Performing Backups in Azure Cosmos DB
When it comes to backups in Azure Cosmos DB, we have two options. We can rely on the Automatic Backups that come with Azure Cosmos DB or we can manage our own backups. This article will go over both of these processes so you can determine which route is best for your application’s needs.
Azure Cosmos DB automatically makes backups of your data every 4 hours without it affecting performance or availability. These backups are stored in Azure Blob Storage in the same region that your Cosmos DB write region is (If you have multi-master configured for your Cosmos DB account, the backups are stored in one of the write locations).
These backups are taken every 4 hours and the latest two backups are stored. If you delete a container or database within your account, the existing snapshots for those containers or databases are stored for 30 days.
Thankfully, these snapshots are taken without consuming any provision throughput on your account, so you don’t need to provision any extra throughput on your Cosmos DB databases or containers in order for backups to happen.
How to restore data from an automatic backup
Azure Cosmos DB can restore data should any of the following situations occur:
- The entire account is deleted
- One or more databases are deleted.
- One or more containers are deleted.
- Items within a container are deleted or modified.
- A shared throughput offer container within a shared offer database is deleted or corrupted.
According to the documentation, a new Cosmos DB account will be created in order to hold the restored data. If you’re in the portal at the time, you’ll see a Cosmos DB account with the following name:
the last digit will show the amount of restore attempts that have been made. Unfortunately, you can restore data to a pre-created Cosmos DB account
The Cosmos DB team can restore data into an account with the same name. Do not recreate the account yourself and then ask for a restore! The team won’t be able to restore the data to the newly created account.
When a database is deleted, whole databases or some collections within that database can be restored. However, if you provision throughput at the Database level, you can’t restore the individual containers.
The tricky part comes when restoring items within a container. In this scenario, we need to specify the time to restore our data to. With backups only having a retention period of 8 hours in Cosmos DB, the more you delay the more likely your backups will be overwritten.
Managing your own Backups
For your workload requirements, you may want more control over how you backup your Azure Cosmos DB data. If you’re using the SQL API for your Cosmos DB account, you can use either Azure Data Factory to move data to another Cosmos DB account or another alternative storage mechanism, or you can use the Azure Cosmos DB Change Feed to listen to a container and then move any changes in that container to another type of storage.
Let’s go through two basic examples for both the Data Factory and Change Feed route and see how we can set this up.
Performing Backups with Azure Data Factory
Azure Data Factory is a managed cloud service that you can use to build awesome and complex ETL, ELT and data integration pipelines with.
We can backup our data periodically using a Copy and Transform data job that essentially copies the data within our Azure Cosmos DB container and then stores those documents in another location.
I’ve created a Data Factory account and I’ll use that to take my documents from Azure Cosmos DB and store it into Blob Storage. To create new pipelines, we go to Data Factory and click on ‘Author & Monitor”
We’ll be redirected to the Azure Data Factory UI where we can set up our pipelines and data flows. For this tutorial, I’m just going to perform a simple Copy Data activity to copy the data in my Cosmos DB collection into a Blob container in Blob Storage. Click on ‘Copy Data’ to get started.
To create a new Copy Data activity, we need to set up some configuration for our pipeline. We first have to give it some properties. Below, I’ve given my activity a name and configured it to run on a schedule. If we are doing a 1 time backup, we can just run this job once.
Now that we’ve given our activity a name, we need to define a data source and a data destination. Essentially where we are getting the data from and where we are sending it to. Let’s set up our data source which will be our Cosmos DB container. Click on ‘Create New Connection’ and choose Azure Cosmos DB SQL API as our linked service.
Once you’ve configured your account, you need to choose which collection that you want to get your data from. I’ve only got one container so I’ll pick that. We can also choose to export our data as JSON files so I’ll choose that option as well.
Now that I’ve got my source sorted, I need to configure my destination. I’m going to create a new linked service and choose ‘Azure Blob Storage’
Once I’ve configured my Storage account, I need to choose where I’m going to send my Cosmos DB data as JSON files to. I’ve got two containers in my Storage account (One for the Data Factory demo, another for my Change Feed demo). I’m going to choose ‘datafactoryfiles’ as my destination.
Once everything has been set up, Azure Data Factory will validate the pipeline and provided everything is good, it’ll start running! We can monitor the pipeline to see if it’s still running and restart it if necessary by clicking on ‘Monitor’
Once the pipeline has finished running, we will see our JSON file containing our backups in our Storage account.
Performing Backups with the Change Feed
The Change Feed is a feature in Cosmos DB that listens to any container that you configure the Change Feed for any changes. The changes are then outputted in the order that they were modified.
I’ve written up a couple of articles about the Change Feed, which you can read here:
Working with the Azure Cosmos DB Change Feed Processor in C#
You can simplify the process of reading the change feed on Azure Cosmos DB thanks to the Change Feed Processor
Implementing Cosmos DB Change Feed using Azure Functions in C#
Azure Cosmos DB has a cool feature called the Change Feed which allows us to perform real-time analytics on our…
For the purposes of this article, I’m going to write a basic Azure Function that listens to a container and then persists the document to another container in our Cosmos database. Azure Functions are by far the easiest way to get started with the Change Feed in Cosmos DB,
Here’s the code:
In this Function, I’m connecting to my storage account that I want to persist my backups to. I’m then iterating through each of the changes on this container captured by the Change Feed and then writing those changes to a JSON file. I’m then uploading that file to blob storage.
If you want to see the whole sample, please checkout out this GitHub Repo.
We should now see our backup file in our Storage account:
In this article, I discussed two routes that you can use to handle backups for your Cosmos DB accounts. My recommendation when it comes to backups is to take matters into your hands as much as possible. I’d recommend exploring the Data Factory route or Change Feed route, particularly if your backup requirements are unique and require special handling.
If you do need the Cosmos DB team to backup anything for you, I’d recommend that you do it as soon as possible. Automatic backups are limited in time frame, so the sooner you get onto it the better.
Hopefully you’ve enjoyed this article. As always, if you have any questions, please feel free to ask in the comments below.