Compliant database development using Redgate SQL Provision Part 2

In my first post on Redgate SQL Provision, I ran some basic tests against the AdventureWorksDW2017 database to demonstrate how SQL Provision works.

The headline was that I saved 47 seconds cloning a database in 2 seconds that took 49 seconds to provision, nothing to write home about. So to really prove how cool this product is, it needs to run against something much, much larger…if I still get the same sort of results, then that would be pretty cool.

The Stack Overflow Database

The Stack Overflow database is periodically published as an XML data dump and Brent Ozar uses it as part of his performance tuning courses and so being the super helpful guy he is carves it up into several different sized backup files. What I find most helpful about this, is given that the structure is consistent it means I can test stuff out at speed against one of the smaller databases before unleashing it on one of the larger datasets. There are tonnes of versions available, but the ones I have used for this test can be found on this page.

So the idea is to evaluate against StackOverflow2010 database and run sone preliminary tests against that. I would then duplicate the masking file, switch the database connection and run a full test against StackOverflow.

I’ve decided to mask a few columns within the User table to imitate masking a customer table. Here’s a screenshot of the first 10 rows;

Setting up the masking rules

There’s not a huge amount of columns to choose from, but we just need a few columns for the test because what we’re interested in isn’t how quickly my machine can process a load of update statements, but how quickly it can complete an end to end process and make multiple copies available to my (imaginary) team. I chose to mask the following columns;

  • AboutMe
  • DisplayName
  • Location
  • WebsiteURL

There are 9,737,247 users in this dataset and that posed a problem with trying to provide each user a unique name. I didn’t really have to do this, but I figured I’d keep it reasonably accurate. The predefined datasets weren’t really viable since none of them were going to provide enough unique values. There’s less than a quarter of a million words in the English dictionary #gofigure, there’s nowhere near enough person names, so I had to configure UserName differently.

So the first masking step covered AboutMe, Location, WebsiteURL with some basic substitution. I set the AboutMe column to produce some random words up to a max of 50 characters;

I applied a predefined list of country names to the location column;

And I configured the generation of random URLs as shown below;

For the user name, there is a large selection of options from scrambling to regular expressions, but I opted for a custom calculation based on the user ID since the user ID is unique. So I went for this;

‘User_’ + RIGHT(‘0000000000’ + CAST(ID AS VARCHAR),10)

Which will result in a consistent pattern of “User_” followed by 10 digits consisting of the User ID plus some leading zeros. So ID 1 will be User_0000000001, ID 9999 will be User_0000009999 and so on. That kinda fits nice with my OCD then as they’ll all be a consistent length. So the rule looks like this;

But for good measure I excluded the bot (which has an ID of -1) from the rule because User_00000000-1 just wouldn’t look right. You can apply filters on the WHERE clause tab;

And with my ruleset complete, I put some new Powershell scripts together and tested the results. Once I’d tested it against the 2010 dataset, switching it to the full dataset was as simple as changing the database name within the first masking step;

Building an image

I ran tests against the 2010, 2013 and the full dataset, below is a walkthrough of the full dataset…

To build an image, I need a backup file or an existing database and a fileshare for the image to be hosted. I then wrote a Powershell script to build the image which looks like this;

Connect-SqlClone -ServerUrl ‘http://localhost:14145’
$SqlServerInstance = Get-SqlCloneSqlServerInstance -MachineName DESKTOP-H4M4K2E -InstanceName “” `
$imageDestination = Get-SqlCloneImageLocation -Path ‘\DESKTOP-H4M4K2E\Image’
$imageName = “StackOverflow”
$myMaskingModification = New-SqlCloneMask -Path ‘C:\Code\demos-redgate-sql-provision\stackoverflow\stackoverflow.dmsmaskset’
$imageOperation = New-SqlCloneImage -Name $imageName -SqlServerInstance $sqlServerInstance -BackupFileName @(‘G:\Backups\StackOverflow.bak’) -Destination $imageDestination -Modifications $myMaskingModification
$imageOperation | Wait-SqlCloneOperation

This process will initially build an image, then mask four columns in two steps for just under 10 million rows. I’ve cut down the output of the log to the salient points;

14:33:42 Operation started
15:10:04 Attach database
15:10:04 Starting Data Masking Script
15:38:44 Finished Data Masking Script
15:39:21 Operation succeeded with a duration of 01:05:39

It restored the database in ~37 minutes, with masking taking ~28 minutes. Final footprint of the image on disk is ~329GB. The original database was ~323GB.

Building the clones

So this is where SQL Provision really comes into its own. I set up a Powershell script to imitate provisioning databases for 10 developers, 3 test databases and 1 pre prod database. The Powershell script looks like this;

Connect-SqlClone -ServerUrl ‘http://localhost:14145’
$SqlServerInstance = Get-SqlCloneSqlServerInstance -MachineName DESKTOP-H4M4K2E -InstanceName “” `
$imageName = “StackOverflow”
$image = Get-SqlCloneImage -Name $imageName
$image | New-SqlClone -Name ‘StackOverflow_clone_Developer001’ -Location $sqlServerInstance | Wait-SqlCloneOperation
$image | New-SqlClone -Name ‘StackOverflow_clone_Developer002’ -Location $sqlServerInstance | Wait-SqlCloneOperation
$image | New-SqlClone -Name ‘StackOverflow_clone_Developer003’ -Location $sqlServerInstance | Wait-SqlCloneOperation
$image | New-SqlClone -Name ‘StackOverflow_clone_Developer004’ -Location $sqlServerInstance | Wait-SqlCloneOperation
$image | New-SqlClone -Name ‘StackOverflow_clone_Developer005’ -Location $sqlServerInstance | Wait-SqlCloneOperation
$image | New-SqlClone -Name ‘StackOverflow_clone_Developer006’ -Location $sqlServerInstance | Wait-SqlCloneOperation
$image | New-SqlClone -Name ‘StackOverflow_clone_Developer007’ -Location $sqlServerInstance | Wait-SqlCloneOperation
$image | New-SqlClone -Name ‘StackOverflow_clone_Developer008’ -Location $sqlServerInstance | Wait-SqlCloneOperation
$image | New-SqlClone -Name ‘StackOverflow_clone_Developer009’ -Location $sqlServerInstance | Wait-SqlCloneOperation
$image | New-SqlClone -Name ‘StackOverflow_clone_Developer010’ -Location $sqlServerInstance | Wait-SqlCloneOperation
$image | New-SqlClone -Name ‘StackOverflow_clone_Test001’ -Location $sqlServerInstance | Wait-SqlCloneOperation
$image | New-SqlClone -Name ‘StackOverflow_clone_Test002’ -Location $sqlServerInstance | Wait-SqlCloneOperation
$image | New-SqlClone -Name ‘StackOverflow_clone_Test003’ -Location $sqlServerInstance | Wait-SqlCloneOperation
$image | New-SqlClone -Name ‘StackOverflow_clone_PreProd’ -Location $sqlServerInstance | Wait-SqlCloneOperation

Each clone build is a separate command, so there’s no end to end log and I don’t have the patience to collate them for this post! Here’s the output from one clone build;

20:13:40 Operation started
20:13:44 Started to attach database StackOverflow_clone_Developer001
20:13:45 Finished to attach database StackOverflow_clone_Developer001
20:13:45 Operation succeeded with a duration of 00:00:04.588

In under 5 seconds it provisioned a dedicated database! In total the process took just over a minute to complete with a footprint of ~640MB for all 14 databases;

I can now apply changes directly to a single database independent from the rest of my team;

In the screenshot above, you can see that I have updated the user database for Developer001. When I run a query against Developer001 and Developer002 you can see that all information is returned consistently except for the changes applied to Developer001.

Result!

This test is a bit more like it! I’ve provisioned a 300GB database and securely masked and provisioned independent copies of it for 10 developers, three test tracks and pre prod testing in under 70 minutes! As more changes are applied to these clones, their footprint will increase, but when a developer can be given the ability to drop their clone and re-create it in seconds or even create more copies, this tool adds so much flexibility to development teams!

Here’s the stats from each load;

20102013Full
Footprint of original database~9GB~51GB~323GB
Footprint of image
~9GB~48GB~329GB
Time to provision image00:02:1900:14:3501:11:39
Footprint of clones616MB616MB640MB
Time to create clones00:01:0100:00:5300:01:04

Next steps

There are changes I could make to the masking process to optimise the image load time. There are tips and techniques covered in the Redgate University videos, but since the main test was around how quickly I could spin up multiple copies of the database I didn’t spend a huge amount of time doing this.

All in all, I’m really impressed with this process, it is possible to securely run this since the backup can be hosted on a restricted file share, whilst the image is being constructed it is only accessible to the service account and once complete, developers can only obtain a clone of the de-sensitised image.

Share this post...

Leave a Reply

Your email address will not be published. Required fields are marked *