Source Control HOWTO: Repositories


is mission critical data. We have to consider things like performance and backups and RAID and administration. The cost of storing 12 TB of ultra-important data is more than just the cost of the actual disk platters.

So we actually do have an incentive to store this information a bit more efficiently. Fortunately, there is an obvious reason why this is going to be easy to do. We observe that tree N is often not terribly different from tree N-1. By definition, each version of the tree is derived from its predecessor. A checkin might be as simple as a one-line fix to a single file. All of the other files are unchanged, so we don't really need to store another copy of them.

So, we don't want to store the full contents of the tree for every single change. Instead, we want a way to store a tree represented as a set of changes to another tree. We call this a "delta."

Delta Direction
As we decide to store our repositories using deltas, we must be concerned about performance. Retrieving a tree which is in a deltified representation requires more effort than retrieving one which is stored in full. For example, let's suppose that version 1 of the tree is stored in full, but every subsequent revision is represented as a delta from its predecessor. This means that in order to retrieve version 4,686, we must first retrieve version 1 and then apply 4,685 deltas. Obviously, this approach would mean that retrieving some versions will be faster than others. When using this approach we say that we are using "forward deltas," because each delta expresses the set of changes from one version to the next.

We observe that not all versions of the tree are equally likely to be retrieved. For example, version 83 of the Vault tree is not special in any way. It is likely that we have not retrieved that version in over a year. I suspect that we will never retrieve it again. However, we retrieve the latest version of the tree many times per day. In fact, as a broad generalization, we can say that at any given moment, the most recent version of the tree is probably the most likely one to be needed.

The simplistic use of forward deltas delivers its worst performance for the most common case. Not good.

Another idea is to use "reverse deltas." In this approach, we store the most recent tree in full. Every other tree N is represented as a set of differences from tree N+1. This approach delivers its best performance for the most common case, but it can still take an awfully long time to retrieve older trees.

Some SCM tools use some sort of a compromise design. In one approach, instead of storing just one full tree and representing every other tree as a delta, we sprinkle a few more full trees along the way. For example, suppose that we store a full tree for every 10th version. This approach uses more disk space, but the SCM server never has to apply more than 9 deltas to retrieve any tree.

What is a Delta?
I've been throwing around this concept of deltas, but I haven't stopped to describe them.

A tree is a hierarchy of folders and files. A delta is the difference between two trees. In theory, those two trees do not need to be related. However, in practice, the only reason we calculate the difference between them is because one of them is derived from the other. Some developer started with

About the author

AgileConnection is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.