Source Control HOWTO: Repositories


can store anything you want in a repository, that doesn't mean you should. The best practice here is to store everything which is necessary to do a build, and nothing else. I call this "the canonical stuff."

To put this another way, I recommend that you do not store any file which is automatically generated. Checkin your hand-edited source code. Don't checkin EXEs and DLLs. If you use a code generation tool, checkin the input file, not the generated code file. If you generate your product documentation in several different formats, checkin the original format, the one that you manually edit.

If you have two files, one of which is automatically generated from the other, then you just don't need to checkin both of them. You would in effect be managing two expressions of the same thing. If one of them gets out of sync with the other, then you have a problem.

If you will be storing a lot of binary files, it is helpful to know how your SCM tool handles them. A tool which uses binary deltas in the repository may be a better choice.

If all of your files are binary, you may want to explore other solutions. Tools like Vault and Subversion were designed for programmers. These products contain features designed specifically for use with source code, including diff and automerge. You can use these systems to store all of your Excel spreadsheets, but they are probably not the best tool for the job. Consider exploring "document management" systems instead.

How is the Repository Itself Stored?
We need to descend through one more layer of abstraction before we turn our attention back to more practical matters. So far I have been talking about how things are stored and managed within a repository, but I have not broached the subject of how the repository itself is stored.

A repository must store every version of every file. It must remember the hierarchy of files and folders for every version of the tree. It must remember metadata, information about every file and folder. It must remember checkin comments, explanations provided by the developer for each checkin. For large trees and trees with very many revisions, this can be a lot of data that needs to be managed efficiently and reliably. There are several different ways of approaching the problem.

RCS kept one archive file for every file being managed. If your file was called "foo.c" then the archive file was called "foo.c,v." Usually these archive files were kept in a subdirectory of the working directory, just one level down. RCS files were plain text, you could just look at them with any editor. Inside the file you would find a bunch of metadata and a full copy of the latest version of the file, plus a series of line-oriented file deltas, one for each previous version. (Please forgive me for speaking of RCS in the past tense. Despite all the fond memories, that particular phase of my life is over.)

CVS uses a similar design, albeit with a lot more capabilities. A CVS repository is distinct, completely separate from the working directory, but it still uses ",v" files just like RCS. The directory structure of a CVS repository contains some additional metadata.

When managing larger and larger source trees, it becomes clear that the storage challenges of a repository are exactly the same as the storage challenges of a database. For this reason, many SCM tools use an actual database as the backend data store. Subversion uses Berkeley DB. Vault uses SQL Server 2000. The benefit of

About the author

AgileConnection is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.