A Look At Deduplication In Keep
What is Keep?
Keep is the Arvados storage layer. All data in Arvados is stored in Keep.
Keep can be mounted as a regular filesystem on GNU/Linux via FUSE. It can also be accessed via HTTP, WebDAV, an S3-compatible interface, cli tools, and programmatically via the Arvados SDKs.
Keep is implemented as a content-addressed distributed storage system, backed by object storage (e.g. an S3-compatible object store) or POSIX file systems. Content addressing means that the data in Keep is referenced by a cryptographic hash of its contents. Content addressing provides a number of benefits which are described in more detail in the Introduction to Keep.
Keep is a system with two layers: collections
refer to a group of files and/or directories, and blocks
contain the contents of those files and directories. In Arvados, users and programs interface only with collections; blocks are how those collections store their contents, but they are not directly accessible to the Arvados user.
Collections have a manifest, which can be thought of as an index that describes the content of the collection: the list of files and directories it contains. It is implemented as a list of paths, each of which is accompanied by a list of hashes of blocks stored in Keep.
Collections and their manifests are stored in the Arvados API. Collections have a UUID and a portable data hash
, which is the hash of its manifest. Keep blocks are stored by the keepstore
daemons.
The Keep architecture is described in more detail in the documentation.
Keep benefits: deduplication
Keep relies on content addressing to provide automatic, built-in data deduplication at the block level. This section shows how the deduplication works through an example. It also illustrates how the effectiveness of the deduplication can be measured.
As an experiment, we start by creating a new collection that contains 2 text files:
The arv-put
command returned the collection UUID pirca-4zz18-46zcwbae8mkesob
. Let’s have a look at that collection object:
The “manifest_text” field contains the collection manifest:
. 77fc3b646b70c5118ce358a9ef76b3b1+13+Ab4c280ce1400d0e94748522f3353fb8a4e76aaf0@60104148 0:6:hello.txt 6:7:world.txt\n
The contents of the 2 files were combined in 1 block with hash 77fc3b646b70c5118ce358a9ef76b3b1 and size 13 bytes. The remainder of the block locator is the permission signature, which can be ignore in the context of this discussion. The filenames are preceded by the start position in the block, and their size. The manifest format is explained in greater detail in the manifest format documentation.
What would happen if we change one of the files and then create a new collection with the 2 files? For this example, I’ve uppercased the contents of the ‘hello.txt’ file.
Let’s have a look at the new collection:
We can compare the manifest text for both collections. This was the original manifest:
. 77fc3b646b70c5118ce358a9ef76b3b1+13+Ab4c280ce1400d0e94748522f3353fb8a4e76aaf0@60104148 0:6:hello.txt 6:7:world.txt\n
and this is the new manifest:
. 0084467710d2fc9d8a306e14efbe6d0f+6+A1556ac1a8d8a6e8cc5cdfacfde839ef6a61ace12@60104205 77fc3b646b70c5118ce358a9ef76b3b1+13+Ae501281d9a3f01f7a6606b08ef3fb347cc2c2135@60104205 0:6:hello.txt 12:7:world.txt\n
We see that an additional keep block is referenced, with hash 0084467710d2fc9d8a306e14efbe6d0f and size 6 bytes, which contains the new contents of ‘hello.txt’. The original block is reused for the ‘world.txt’ file, which didn’t change.
We can verify this by asking Arvados for a deduplication report for the 2 collections:
Keep’s deduplication saved us 7 bytes of storage (27% of the nominal storage size).
Conclusion
While the gains in this contrived example are quite small, in the real world the deduplication feature is quite powerful. It saves a ton of storage space and as a consequence, a lot of money. Here’s a real-world example, looking at how much data is saved by Keep’s deduplication in the top 100 largest collections on the Arvados playground
A storage system without deduplication would have stored 15.4 TiB. Because of Keep’s built-in deduplication, we store that data in 1.7 TiB, for a savings of about 89% of the nominal size.
This is a relatively small Arvados installation. On a bigger installation, the savings really add up. Here’s an example from a bigger Arvados cluster with roughly 5.5 million collections. The deduplication report was run on the top 5000 largest collections on the cluster:
Keep saved about a petabyte of storage space (roughly 63% of the nominal size of the stored data).
If this data was stored on S3, the nominal storage cost would be around $37,250 per month ($0.022 per GiB). Because of Keep’s deduplication, the actual storage cost would only be $13,350 per month. That’s a savings of $23,900 per month!
Try it yourself!
If you liked this experiment, feel free to replicate it with a free account on the Arvados playground, or have a look at the documentation for installing Arvados.
Alternatively, Curii Corporation provides managed Arvados installations as well as commercial support for Arvados. Please contact info@curii.com for more information.