Source Control HOWTO: Repositories


tree N and made one or more changes, resulting in tree N+1.

We can think of the delta as a set of changes. In fact, many SCM tools use the term "changeset" for exactly this purpose. A changeset is merely a list of the changes which express the difference between two trees.

For example, let's suppose that Wilbur starts with tree N and makes the following changes:

  1. He deletes $/top/subfolder/foo.c because it is no longer needed.
  2. He edits $/top/subfolder/Makefile to remove foo.c from the list of file names
  3. He edits $/top/bar.c to remove all the calls to the functions in foo.c
  4. He renames $/top/hello.c and gives it the new name hola.c
  5. He adds a new file called feature_creep.c to $/top/
  6. He edits $/top/Makefile to add feature_creep.c to the list of filenames
  7. He moves $/top/subfolder/readme.txt into $/top

At this point, he commits all of these changes to the repository as a single transaction. When the SCM server stores this delta, it must remember all of these changes.

For changeset item 1 above, the delete of foo.c is easily represented. We simply remember that foo.c existed in tree N but does not exist in tree N+1.

For changeset item 4, the rename of hello.c is a bit more complex. To handle renames, we need each object in the repository to have an identifier which never changes, even when the name or location of the item changes.

For changeset item 7, the move of readme.txt is another example of why repositories need IDs for each item. If we simply remember every item by its path, we cannot remember the occasions when that path changes.

Changeset item 5 is going to be a lot bulkier than some of the other items here. For this item we need to remember that tree N+1 has a file called feature_creep.c which was never present in tree N. However, a full representation of this changeset item needs to contain the entire contents of that file.

Changeset items 2, 3 and 6 represent situations where a file which already existed has been modified in some way. We could handle these items the same way as item 5, by storing the entire contents of the new version of the file. However, we will be happier if we can do deltas at the file level just as we are doing deltas at the tree level.

File Deltas
A file delta merely expresses the difference between two files. Once again, the reason we calculate a file delta is because we believe it will be smaller than the file itself, usually because one of the files is derived from the other.

For text files, a well-known approach to the file delta problem is to compare line-by-line and output a list of lines which have been modified, inserted or changed. This is the same kind of results which are produced by the Unix 'diff' command. The bad news is that this approach only works for text files. The good news is that software developers and web developers have a lot of text files.

CVS and Perforce use this approach for repository storage. Text files are deltified using a line-oriented diff. Binary files are not deltified at all, although Perforce does reduce the penalty somewhat by compressing them.

Subversion and Vault are examples of tools which use binary file deltas for repository storage. Vault uses a file delta algorithm called VCDiff, as described in RFC 3284 . This algorithm is byte-oriented, not line-oriented. It outputs a list of byte ranges which have been changed. This means it can handle any kind of file, binary or

About the author

AgileConnection is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.