If you are storing a large number of data files in S3, one of the first recommendations you would see people offering is to implement life-cycle policies to transition S3 data to relatively inexpensive storage classes like S3 Glacier or Glacier Deep Archive. And the motivation is to save on storage costs for archiving large amounts of data files that are accessed somewhat infrequently.
However, you may be surprised to know that implementing these life-cycle policies can actually turn out to be more expensive in some cases.
The intent of this blog is to explain, with help of real life use cases, illustrations and practical cost calculations, why is it important to evaluate before you implement life-cycle policies to move your data from S3 standard storage to S3 Glacier (or Glacier Deep Archive) storage class.
Here is a quick background on S3, storage classes and the need for transition, in case you are new to these topics.
S3 – Unlimited cloud storage
As we all know, AWS S3 is one of the most popular object cloud storage mechanisms in use today. There has been a phenomenal rise in use of S3 as a data store on cloud, and why not, it provides a fairly inexpensive way to store large amounts of data in the cloud in a highly available and reliable manner. However as is the case with most things in life, there are two sides to every coin. Cloud computing is a boon, but it can cost you a fortune if you’re not careful about your resources in the cloud, and S3 is no exception to this. And that is precisely the reason that S3 gives you an option to store data in various storage classes.
S3 Storage Classes – flexible prize points
S3 storage classes provide you an option to store data at varied price points considering different requirement characteristics like storage availability, reliability and frequency of access. Considering the number of storage class options S3 has added over all these years, its a mammoth topic to talk about in detail and is definitely worth at least a blog post in itself, so we will limit our discussion to Glacier storage class for this blog post. But the idea discussed here should apply to other storage classes as well.
For the sake of this discussion, it is worth mentioning that AWS S3 initially started off with standard storage class and was extended by providing a mechanism to transition files to Glacier storage as a cheaper storage alternative to facilitate archiving infrequently accessed data at a much more inexpensive storage price. The reduction in storage costs comes at a price – you pay an additional charge every time the data is accessed, thus, the more frequently you access data in Glacier, the more you pay.
S3 to Glacier – Quick Introduction
Glacier Use Case
The prescribed use case for Glacier is archival storage – so if you have large amount of data that needs to be archived and accessed very infrequently, Glacier and Glacier Deep Archive storage classes are the options that are prescribed.
How it works
This topic is probably worth a blog post in itself, so we will just try to cover the basics.
In terms of setting it up, AWS actually makes it a cakewalk to make the S3 to Glacier transition work for us. It lets us define life-cycle policies at a bucket level that allow us to define
- which object – defines the S3 objects (within the bucket) that this policy applies to. There are multiple ways to define that selection like tags, object prefix, all objects etc. For a versioned bucket, its also lets us define whether the policy applies to current or previous version or both versions.
- when – the time period after which the policy action takes place on selected object(s)
- what action – expire (delete) or transition to another storage class.
What makes Glacier attractive
The one and only reason that makes Glacier storage class attractive is its price point – the price varies by regions, but the per GB storage cost of S3 Glacier storage class is 1/5th the price of S3 standard.
For example, as of March 2020, in US East (N. Virginia) region, here is a comparison of per GB storage costs for S3 standard and Glacier
|Data size||S3 Standard||S3 Glacier||S3 Glacier Deep Archive|
|Next 450 TB / Month||$0.022/GB||$0.004/GB||$0.004/GB|
|Over 500 TB / Month||$0.021/GB||$0.004/GB||$0.004/GB|
It seems very tempting, isn’t it, however as we will see later in this blog, there is a lot more to understand than what meets the eye. Let us try to understand this in a bit more detail.
S3 to Glacier – Cost Analysis – Lets experiment
The way S3, Glacier and life-cycle policies for transitioning S3 data to Glacier are perceived (and to an extent promoted as well), it is almost taken for granted that transitioning to Glacier from S3 will save us a ton of money. However that may not be true based on the context in question. There are intricacies in terms of charges when data is placed in Glacier.
Let us dive deeper to understand this better by taking an example and looking at various components that contribute to our overall cost – this includes overall costs when transition is being made and the overall costs after the transition.
Storage cost is the most complex and probably most neglected part of Glacier costs. Although the per GB price of Glacier storage is much less than the per GB price of S3 standard storage, storing an object in Glacier storage class requires an additional 40 KB per object (irrespective of the original size of the object).
So, we are billed for that additional 40 KB per object. 8 KB of this 40 KB is billed at S3 standard storage price, while 32 KB is billed at Glacier storage price. On the surface, this does not appear to be a problem, and seems negligible. However this could lead to a big escalation in overall costs in certain situations.
To understand this better, consider the following table that illustrates the impact of this additional storage on net effective costs in various scenarios.
Lets try to understand the data presented in the table above. For our cost analysis, we have considered two broad categories of scenarios.
Scenario 1 – Keeping the total size of data constant, vary the size of each object, which in turn would vary the number of objects.
For this analysis, we do the following:
- Total storage size = 100 GB.
- Vary the size of file(s) from 1 KB each to go up to 1000 KB each, keeping the total size of the data constant at 100 GB.
Here are our observations from this analysis of the scenario under consideration:
- When using the S3 standard storage, the monthly storage cost (which obviously would be recurring in nature) is constant at $2.30.
- However when using Glacier storage for the same data files, the monthly storage costs varies from around $30 to 50 cents based on size of file.
It obviously comes as a big surprise because instead of paying $2.30, we may end up paying $30 per month after moving objects to Glacier which is obviously not something we would expect.
So does it mean we should not move our data to Glacier, well in some cases, yes that seems to be a gruesome reality, however in some cases it does make sense to move data to Glacier because it saves us a lot of money on a recurring monthly basis. This can be seen for bigger files in the above analysis and it is exemplified further with our second scenario described below.
Scenario 2 – Keeping the number of files constant, we are increasing the size of each file thereby increasing the total size of data that is stored.
As can be seen from the table above, as the size of file increases, there is substantial benefit to move files to glacier. For example with the file size of 2000 KB and 1 million files, S3 standard storage would cost us about $44 for the month while S3 Glacier storage would cost us about $8.
If you took just storage costs into consideration while moving files from S3 standard storage to Glacier, then I’m afraid you may have to reconsider this. That’s because there are additional costs for the transition operation as well. The transition operation is considered as a new write to Glacier storage, and each put operation is charged for every file written/moved to Glacier. And if you look at the cost here, you are probably in for a ‘nasty’ surprise.
Each PUT operation to Glacier is 10 times costlier than a PUT operation in S3 standard.
Yes, you heard it right, as an example in US East region, 1000 put requests to S3 cost $0.005, while the same number of put requests to Glacier cost $0.05.
So let’s try to look at this cost in the context of our earlier scenarios in a another table below.
As you may observe, there is a significant amount of one-time transition cost that you may have to pay based on the number of files that you’re moving to Glacier. Some of these scenarios may seem a bit hypothetical but when you are working with a large amount of data in the form of files in S3 you cannot really control what is the size of each file unless the source producing the files takes that into account. The key here is the number of files being moved, if the size of each file is beyond a given threshold, then the total number of files being moved is considerably reduced (assuming total data size is constant) and hence the initial transition cost to move to Glacier is well under control.
Alright so we have looked at transition costs which are one time and we’ve also looked at recurring monthly storage costs. Now let’s try to analyze what will be our break-even point considering the transition and the storage costs i.e. let’s try to analyze when are we really going to start saving some money considering the one time transition cost as well as recurring monthly storage costs once the life-cycle policy kicks in and data is transitioned over to Glacier.
Considering the example here, as you may note in the table above:
- For files below 100 KB, we would never ever reach a break even point so we will never be able to save money if you move your data to Glacier storage.
- Even with a file size of around 100 KB or 200 KB, it’ll take us 31 months or 15 months respectively before we can start saving any real money.
- For a file size of 500 KB, it’ll take us six months to reach the break-even point post which, we should be able to save some money on recurring storage costs.
Data Retrieval (Restore) Cost
So we have looked at storage cost as well as one-time transition costs and compared them to respective numbers in standard S3 storage. Now let’s look at an additional type of cost that is associated with Glacier storage and that is called the data retrieval cost. Essentially, since Glacier storage is an archival storage, there is a cost associated to retrieving data from Glacier.
That retrieval cost is divided into two parts:
- a data retrieval request submission portion, the cost of which is measured in terms of number of requests (one per object) submitted
- actual data retrieval itself which is measured (and charged) in terms of the total size of data retrieved in terms of GB.
Again with Glacier, there are three types of data retrieval supported, based on how fast the data needs to be retrieved and the charges vary based on the type. For example, for standard retrieval type, AWS charges $0.05 per thousand requests for data retrieval request portion of the cost and charges $0.01 per GB for the actual data retrieval portion of the cost.
The following table summarizes the data retrieval cost for the scenarios we have discussed earlier considering standard type of data retrieval request.
Additionally, as a result of the retrieval operation, the data (from Glacier) is “restored” or copied to S3 for the requested number of days. This additional copied data (temporarily re-stored) in S3 is in addition to the actual data that is stored in Glacier. Obviously, there is an additional charge associated with this “copied” data, and that is calculated on a per restore per day basis.
As you may note in the table above, there is a cost associated to each retrieval request, and the retrieval cost varies significantly based on number of files, size of total data and the number of days for which the retrieved/copied data needs to be retained. Thus, the access patterns (in terms of frequency) also need to be considered before deciding on implementing a life-cycle policy for transition data from S3 to archival storage like Glacier (or Glacier deep archive).
I realize we have covered a lot in this blog post, hopefully providing a good insight into somewhat opaque/complex (or not so obvious) aspects of costs that are associated with transitioning data from S3 standard to Glacier storage.
The data presented in the above scenarios illustrate the various cases where transitioning S3 data to Glacier may not make much sense, while also clarifying the scenarios where it does makes sense. So, the key is to evaluate and calculate the potential cost savings (or cost implications) before implementing a life-cycle policy to transition data from S3 standard to Glacier.
There is a lot more to talk about in terms of next steps and how best to utilize the archival storage classes to our advantage, but considering the amount of content already covered here (and the length of the blog post), it will probably be a better idea to cover it in a following blog post.
Hope you enjoyed reading this blog post, and it will help you design and implement a better life-cycle transition strategy for your data in S3.
Feedback and suggestions are most welcome as always.
Happy coding…Happy storing…Happy saving…..!!!