2015-06-07

Converting a CVS repository to git

With the recent loss of trust in SourceForge as a reputable open-source hosting platform, I decided to see how difficult it would be to move the CVS repository for AdPlug off SourceForge, and over to GitHub. As it turned out, the process is relatively straightforward once you work out what needs to be done.

For this process I used cvs2git, which is part of the cvs2svn project. There are a handful of other alternative tools available, but this one seems to be the most recommended.

Preparation

There is a sample config that is supplied with cvs2svn, so you should have a file somewhere on your system called cvs2git-example.options. For me using Arch Linux it is located in /usr/share/doc/cvs2svn/examples/cvs2git-example.options. Make a copy of this file so you can edit it for the conversion process.

Obtain server-side CVS

Firstly, you will need a copy of the CVS repository. Sadly just checking out a working copy isn't enough, as the centralised nature of the CVS system means all the history is stored on the server. If you don't have access to the server, you may still be able to perform a conversion, but it will be with a tool other than cvs2git.

Luckily for me, and to their credit, SourceForge offer rsync access to the CVS server for all project contributors. The documentation for this is somewhat well hidden, but a command like this is all you need:

rsync -av rsync://PROJECTNAME.cvs.sourceforge.net/cvsroot/PROJECTNAME/* .

Using the rsync:// protocol is mandatory (trying to use SSH returns a 'restricted shell' error.)

In the cvs2git config file, change the set_project call's first parameter to be the path to the files retrieved by rsync. The correct folder specified here will have a subdirectory inside it called CVSROOT.

Map CVS users

CVS records each commit with a username, while git uses a name and e-mail address. Part of the conversion process involves converting each CVS username to the git equivalent, so to do this successfully you'll need both a list of CVS usernames, as well as a matching name and e-mail address to use with git.

You can get a list of all CVS usernames in a checked-out repository with a command like this:

$ cvs log | grep author: | cut -b 37- | cut -d ';' -f1 | sort | uniq > authors.txt

This will give you a list of CVS usernames, one per line. You will however have to manually track down a name and e-mail address for each one! Once you have done so, update the cvs2git config file by finding the author_transforms section and copying the examples there, providing a mapping for each CVS username.

Make sure you don't omit anyone. If you don't know the e-mail address for someone, you'll have to make up something like unknown@example.com otherwise later git operations will fail with the error "Patch does not have a valid e-mail address."

Symbol transforms

You may not need to modify any symbols, however since git allows dots in tag names (and CVS doesn't), I wanted to change the tag names from their CVS format (e.g. "adplug-2-1") to a more readable alternative (like "adplug-2.1"). For this I uncommented the example RegexpSymbolTransform and modified it slightly to perform the required conversion:

RegexpSymbolTransform(r'([^-]+)-(\d+)-(\d+)',
                      r'\1-\2.\3'),

I ended up with four entries similar to the above to convert all the tags correctly.

Perform the conversion

The conversion happens in two steps. First the CVS repository is converted into git's "fast-import" format, and then the normal git command-line tool is used to import that to a newly created repository.

This script will perform the conversion and save it into the directory called output. Running it multiple times will erase the output folder and perform the conversion again, which is useful when working on symbol conversion regexes.

#!/bin/sh

rm -rf output
cvs2git --options=cvs2git.options
git init output
cd output
cat ../cvs2svn-tmp/* | git fast-import && git reset --hard

Once this is done, you can examine the git repository and confirm everything is as it should be. If there are any mistakes, update the cvs2git config file and rerun the conversion.

It's a good idea to double-check you haven't missed any authors during the conversion, with a command like the following:

git log | grep ^Author: | cut -d' ' -f 2- | sort | uniq

Any authors missing an e-mail address should be added to the cvs2git config file and the conversion run again.

Splitting modules

In the case of AdPlug, the CVS tree contained a number of related but independent codebases. There are modules for the core library, the DOS player, the website, and so on. These end up as folders in the root of a single git tree, which in my case is not what I wanted - I wanted each project to be a separate repository.

There were also some branches that were imported from CVS, but because in this case each branch only contained one module (directory), cvs2git was forced to create some fake commits at the split point (removing all the other modules' files) and again at the merge point (restoring the other modules' files.) This was quite messy, but as it turned out, by splitting off each module's folder into its own repository these fake commits went away and the branches were split and merged correctly.

The git subtree command offers the ability to extract a folder from one repository into an entirely new, independent one, only including the commits that affect the folder in question. The way it works is that it copies the relevant commits (or part thereof) and places them in a new branch. You can then either delete the old branches or pull the new branch into an empty repository, whichever is easier.

The command to split the folder called "adplug" into a branch called "split-master" is this:

git checkout master
git subtree split -P adplug -b split-master

This should be repeated for all branches that contain commits for the module being split off, keeping the module name the same, but supplying a new branch, for example:

git checkout adplug-2.0-patches
git subtree split -P adplug -b split-adplug-2.0-patches

Once this is done, the original branches can be deleted and the new branches can be renamed to take their place:

git branch -d adplug-2.0-patches
git branch -m split-adplug-2.0-patches adplug-2.0-patches

git branch -d master
git branch -m split-master master

At this point it is probably a good idea to fire up gitk and have a look at how the branches connect (go to the View menu, choose New View, tick All (local) branches, then click Apply.)

Post-split tags

After splitting a module, any tags in the git repository will no longer be valid. They'll point to the old commits before the split, which no longer exist in the new branches. To preserve the tags, they will need to be recreated.

The following script will recreate all valid tags. It works by looking at the commit date of each tag, then trying to find a commit at the exact same date and time in the new branch. If it finds a match, the tag is pointed at the new commit, otherwise the tag is removed.

There's a chance that this may drop tags that apply to multiple CVS modules at the same time (because if module A was tagged but that commit doesn't exist in module B, then the tag will have nowhere to point to in module B and will be dropped.) I'm not sure if this is even possible but if it is, the only solution will be to manually find a relevant commit and recreate the tag manually.

Luckily in the case of AdPlug, each tag was made in the module itself so the script updated all the tags correctly.

#!/bin/sh

for TAG in `git tag`; do
  echo -e "\n>> Tag: $TAG"
  TAG_DATE=`git rev-list -1 --pretty="format:%aD" "$TAG" | tail -1`
  DATE_AFTER=`date -R -d "$TAG_DATE -2sec"`
  NEW_COMMIT=`git rev-list -1 --before="$TAG_DATE" --after="$DATE_AFTER" master`
  git rev-list -1 --pretty="format:%aD %s" "$TAG"
  git tag -d "$TAG"
  if [ "$NEW_COMMIT" != "" ]; then
    git tag "$TAG" "$NEW_COMMIT"
    echo ">> New tag $TAG = $NEW_COMMIT"
    git rev-list -1 --pretty="format:%aD %s" "$TAG"
  fi
done

You should check the output of this script to ensure it has picked the correct commits to point the tags to. You will see output like this:

>> Tag: linux-1.2
commit 3b5cbef3d699dc285a9e4bd40be7a38aabf7c05a
Wed, 9 Oct 2002 17:11:56 +0000 This commit was manufactured by cvs2svn to create tag 'linux-1.2'.
Deleted tag 'linux-1.2' (was 3b5cbef)
>> New tag linux-1.2 = 699bac02880247aabfd500fd4115d1593af5fca9
commit 699bac02880247aabfd500fd4115d1593af5fca9
Wed, 9 Oct 2002 17:11:55 +0000 AdPlay/UNIX release candidate #1

Here you can see the first date and time closely match the second. Normally they should be identical except the first commit is a fake one from cvs2git, which is always one second newer than the real commit, so the one second difference is fine. Likewise the commit message on the last line makes sense as it seems like the right place to put a tag.

After you have updated the tags, it might be a good idea to run gitk again to confirm that the tags appear to be in the correct place, and to remove any tags that were converted but are no longer relevant (e.g. they are for a different module.)

Finalise

At this point the git repo should be ready to publish. You might want to remove any leftover/unused tags and branches, and do one last check of the history to make sure there are no cvs2git/cvs2svn synthetic commits left behind. If you will continue to use the local repository, it might be a good idea to run git gc to remove all the discarded commits. This doesn't affect the repository or history, it just frees up some disk space.

All that's left is to publish the shiny new git version of the CVS repo!

No comments:

Post a Comment

Please keep comments relevant to the article. If you need help, please visit a forum like http://www.classicdosgames.com/forum/ instead of asking for help here.