Sh*t Happens, Retain Your AWS Resources

After reading today’s story, I suggest you check Best Practices, spoiler alert, you’ll encounter this paragraph:

The AWS CDK attempts to keep you from losing data by defaulting to policies that retain everything you create. For example, the default removal policy on resources that contain data (such as Amazon S3 buckets and database tables) is not to delete the resource when it is removed from the stack. Instead, the resource is orphaned from the stack.

This story combines a default behavior in CDK, a bad expectation, some good habits to have, and a happy ending.

It starts with experimenting with the latest version of Hawkbit on our dev server. Naturally, to do testing of a new version, you’d try to make sure migrations work. I decided to add to our existing CDK stack a variable to which I’d provide a snapshot identifier to restore the database from, it looked something like this:

let snapshotIdentifier = this.node.tryGetContext("rdsSnapshotId");

const clusterConfig: rds.DatabaseClusterProps = {
  removalPolicy: RemovalPolicy.RETAIN,
  ...
};

let rdsCluster: rds.DatabaseCluster;
if (snapshotIdentifier) {
  rdsCluster = new rds.DatabaseClusterFromSnapshot(this, `MyCluster`, {
    snapshotIdentifier,
    ...clusterConfig
  });
} else {
  rdsCluster = new rds.DatabaseCluster(this, `MyCluster`, {
    ...clusterConfig
  });
}

Where the value of rdsSnapshotId is the ARN of the snapshot you want to restore, and you pass it to CDK using cdk deploy my-stack -c rdsSnapshotId=<mysnapshot>, which in our case will create a new RDS Cluster with all previous data. The idea was that after I’ve created the new cluster from the snapshot, I’d just deploy again without the context, and it would all work fine.

It didn’t. Assuming that DatabaseClusterFromSnapshot can be automatically transformed to DatabaseCluster even if they have the same parent class is totally wrong, they are both different L2 constructs. CDK will create a new RDS Cluster and delete the old resource if it’s not marked as RETAIN.

As a side note, while testing other changes in the stack, I mistakenly ran cdk deploy my-stack without the context. Guess what happened? We’ll end up in the else block, spinning up a new RDS Cluster. This is because the behavior of tryGetContext is not keeping previous values in the state, contrary to a CfnParameter which is retained across change-sets unless overridden. This is a good behavior, and using contexts is the way to go ¹, just be careful.

TIP

Be mindful of your habits; they’ll betray you. Always double-check the diff.

Now the good news is that in my clusterConfig I made sure not to change the default removal policy, I even explicitly set it up to RETAIN. A good habit to have when you know that things might go wrong and you’d like to rollback. Sadly, CDK does not have a restore to previous state option, the cdk rollback is to be used when your stack is in error.

This is where cdk import comes in handy. It allows you to import existing resources into your stack. The fix to our issue becomes a set of simple steps:

Make sure that our initially created cluster (from snapshot) has a RETAIN removal policy
Delete the resource (lines of code) from our stack and then deploy it again

Add the code back, only keeping:

const rdsCluster = new rds.DatabaseCluster(this, `MyCluster`, {
  ...clusterConfig,
});

Run cdk import, it will automatically detect the resource MyCluster and its related resources:

$ cdk import my-stack

start: Building my-stack Template
success: Built my-stack Template
start: Publishing my-stack Template ()
success: Published my-stack Template ()
my-stack/MyCluster/Subnets/Default (AWS::RDS::DBSubnetGroup): enter DBSubnetGroupName (empty to skip) <previous-cluster-subnet-group>
my-stack/MyCluster/Resource (AWS::RDS::DBCluster): enter DBClusterIdentifier (empty to skip) <previous-cluster-id>
my-stack/MyCluster/rds-writer/Resource (AWS::RDS::DBInstance): enter DBInstanceIdentifier (empty to skip) <previous-cluster-writer-id>

Once done, you’re back to a stable state, where the code does not mention any snapshot and behaves as if you created the cluster with all previous data on purpose like this 😉

Why not restore from the snapshot and avoid the issue altogether? This is how experience is built, making mistakes and fixing them.

https://docs.aws.amazon.com/cdk/v2/guide/parameters.html ↩

Sh*t Happens, Retain Your AWS Resources

Footnotes