Migrating an SVN repo with an inconsistent structure to Git.
Recently at work we migrated our code repository from Subversion to Git. You can find numerous guides all over the internet to do this, but none of them covered migrating a repo where the conventional directory structure (trunk, branches, tags) had been introduced halfway into the project. So, I figured it out and wrote a script to do it, and then decided I'd write a walkthrough so that you can follow along if you ever need to do this. Skip down to the script or continue reading...
The repository
To keep things simple, I'm going to be working against a sample SVN repository. Let's set that up now:
$ sudo svnadmin create /var/svn/svn2git-test
Here's the first few commits:
$ svn co file:///var/svn/svn2git-test svn
Checked out revision 0.
$ cd svn
$ echo "Lorem ipsum dolor sit amet" > test.txt
$ svn add test.txt
A test.txt
$ svn commit -m "Initial commit"
Adding test.txt
Transmitting file data .
Committed revision 1.
$ echo "Here's another line we added in rev 2" >> test.txt
$ svn commit -m "Update test.txt"
Sending test.txt
Transmitting file data .
Committed revision 2.
$ echo "And another line we added in rev 3" >> test.txt
$ svn commit -m "Update test.txt, again"
Sending test.txt
Transmitting file data .
Committed revision 3.
Up to now we don't have a conventional structure yet, we are just putting everything in root. But now we want to make a branch so let's change that:
$ svn mkdir trunk branches tags
A trunk
A branches
A tags
$ svn commit -m "Add trunk, branches, tags"
Adding branches
Adding tags
Adding trunk
Committed revision 4.
$ svn mv test.txt trunk
A trunk/test.txt
D test.txt
$ svn commit -m "Move everything to trunk"
Deleting test.txt
Adding trunk/test.txt
Committed revision 5.
Before we make the branch, though, we decide to commit something to trunk:
$ echo "Add another line from trunk, rev 6" >> trunk/test.txt
$ svn commit -m "Update test.txt from trunk"
Sending trunk/test.txt
Transmitting file data .
Committed revision 6.
We also need to make sure revision the trunk folder is associated with is up to date, otherwise our branch will appear in the wrong place when we convert it:
$ svn up
At revision 6.
And now we make the branch:
$ svn copy trunk branches/foo
A branches/foo
$ svn commit -m "Make 'foo' branch"
Adding branches/foo
Adding branches/foo/test.txt
Committed revision 7.
make some changes in the branch:
$ echo "Add another line from foo, rev 8" >> branches/foo/test.txt
$ svn commit -m "Update test.txt from foo"
Sending branches/foo/test.txt
Transmitting file data .
Committed revision 8.
and go back and make some changes to trunk:
$ echo "Add another line from trunk, rev 9" >> trunk/test.txt
$ svn commit -m "Update test.txt finally, from trunk"
Sending trunk/test.txt
Transmitting file data .
Committed revision 9.
If you're like me, all you want to see is a picture, so here's a graph of the commit history:
The problem
At some point down the road, we decide that we want to convert this repository to Git. Now most guides these days will tell you to use svn2git. I agree, it's a great little script that takes care of details you don't want to have to worry about to ensure that your repo is converted properly.
However, if we try to use it on the sample repo we created above, we will quickly find that we don't have all of the commits. First let's do the conversion. Note that my version of Git doesn't let you pass a file:// URL to git-svn so we have to use svnserve so we can refer to the repository using svn://:
$ svnserve -d
$ mkdir git1 && cd git1
$ svn2git svn://localhost/var/svn/svn2git-test
Found possible branch point: svn://localhost/var/svn/svn2git-test/trunk => svn://localhost/var/svn/svn2git-test/branches/foo, 4
Found branch parent: (refs/remotes/foo) 2b09a1ad55e4b0b1bf826d3718821f9b065d198c
Following parent with do_switch
Successfully followed parent
Checked out HEAD:
svn://localhost/var/svn/svn2git-test/trunk r9
Note: moving to 'foo' which isn't a local branch
If you want to create a new branch from this checkout, you may do so
(now or later) by using -b with the checkout command again. Example:
git checkout -b <new_branch_name>
HEAD is now at 42679bc... Update test.txt from foo
Switched to a new branch 'foo'
Note: moving to 'trunk' which isn't a local branch
If you want to create a new branch from this checkout, you may do so
(now or later) by using -b with the checkout command again. Example:
git checkout -b <new_branch_name>
HEAD is now at dadf507... Update test.txt finally, from trunk
Switched to a new branch 'master'
Counting objects: 15, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (15/15), done.
Total 15 (delta 4), reused 0 (delta 0)
Now let's see open the repository in GitX to see what we've got:
As you can see, the history stops at the point that we split the structure, because git-svn assumes the structure has been the same from the beginning.
The research
I spent a good two days just trying to figure out where to start. A lot of the different things I read involved combining three pieces of history into one repository using grafting and merging, assembling history using git-archive and git_load_dirs, converting an inconsistent trunk using svn2git's --rootistrunk option, a temporary branch, and git-format patch. I also found plenty of solutions for merging one Git repository into another one as a subtree, but that wasn't really what I wanted to do.
However, what ended up being the best resource was a post by this guy, David Wheeler, who'd taken a CVS repo, SVN repo, converted both of them to Git and then mashed them together into one repository. There's a lot of good information on some of the problems he faced, how he solved them, and a few scripts that he wrote to automate the whole process.
Putting aside svn2git for the moment, I knew that I could use git svn init and git svn fetch -r REV1:REV2 to make two copies of the SVN repository, one repository that contained the first half before the split to trunk/branches/tags, and another that contained the last half. The trick was figuring out how to merge these halves into one continuous Git repository.
So, armed with new information, I did a couple of experiments on a sample repository (similar to the one I presented earlier). First I tried using git fetch and git rebase:
$ mkdir git.pre && cd git.pre
$ git svn init svn://localhost/var/svn/svn2git-test
Initialized empty Git repository in /tmp/svn2git/migration_with_rebase/git.pre/.git/
$ git svn fetch -r 1:3
A test.txt
r1 = 0a500d79a303cd0a0153e2208d42e825d9f90504 (refs/remotes/git-svn)
M test.txt
r2 = 00bfdbf380411ad86ed274aef02e9a28528e0211 (refs/remotes/git-svn)
M test.txt
r3 = 6b79e39d6e19256fdbb0a95210570972b349c30b (refs/remotes/git-svn)
Checked out HEAD:
svn://localhost/var/svn/svn2git-test r3
$ cd ..
$ mkdir git.post && cd git.post
$ git svn init svn://localhost/var/svn/svn2git-test -s
Initialized empty Git repository in /tmp/svn2git/migration_with_rebase/git.pre/.git/
$ git svn fetch -r 5:HEAD
A test.txt
r5 = f8a9998b0bd00144c28a82d74a3099d719ef33f4 (refs/remotes/trunk)
M test.txt
r6 = a3a5c0896cac7cd8b4d26b48855a7314e4d30d1f (refs/remotes/trunk)
Found possible branch point: svn://localhost/var/svn/svn2git-test/trunk => svn://localhost/var/svn/svn2git-test/branches/foo, 4
A test.txt
r7 = 005cc439b5b81c9fcc2ab26ed8d8c60c11993061 (refs/remotes/foo)
M test.txt
r8 = 83904e2fd2afae9db7af372c2b3154c33c0a58e7 (refs/remotes/foo)
M test.txt
r9 = 28d31c35c2696fb477b2c27bded270a14b3dce56 (refs/remotes/trunk)
Checked out HEAD:
svn://localhost/var/svn/svn2git-test/trunk r9
$ git fetch ../git.pre master:pre
warning: no common commits
remote: Counting objects: 9, done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 9 (delta 2), reused 0 (delta 0)
Unpacking objects: 100% (9/9), done.
From ../git.pre
* [new branch] master -> pre
$ git branch
* master
pre
$ git branch -r
foo
trunk
$ git rebase pre
First, rewinding head to replay your work on top of it...
Applying: Move everything to trunk
Using index info to reconstruct a base tree...
Falling back to patching base and 3-way merge...
No changes -- Patch already applied.
Applying: Update test.txt from trunk
Applying: Update test.txt finally, from trunk
Basically, I'm attempting to copy all the commits in the first half to the last half, then replaying the commits in the last half on top of the ones in the first half. However, as you can see something was off:
If I'd copied the foo branch to a local branch it might have worked, but I didn't know how to do that yet.
I also tried something similar:
$ mkdir git.post && cd git.post
$ svn2git svn://localhost/var/svn/svn2git-test
Found possible branch point: svn://localhost/var/svn/svn2git-test/trunk => svn://localhost/var/svn/svn2git-test/branches/foo, 4
Found branch parent: (refs/remotes/foo) 2b09a1ad55e4b0b1bf826d3718821f9b065d198c
Following parent with do_switch
Successfully followed parent
Checked out HEAD:
svn://localhost/var/svn/svn2git-test/trunk r9
Note: moving to 'foo' which isn't a local branch
If you want to create a new branch from this checkout, you may do so
(now or later) by using -b with the checkout command again. Example:
git checkout -b <new_branch_name>
HEAD is now at 42679bc... Update test.txt from foo
Switched to a new branch 'foo'
Note: moving to 'trunk' which isn't a local branch
If you want to create a new branch from this checkout, you may do so
(now or later) by using -b with the checkout command again. Example:
git checkout -b <new_branch_name>
HEAD is now at dadf507... Update test.txt finally, from trunk
Switched to a new branch 'master'
Counting objects: 15, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (15/15), done.
Total 15 (delta 4), reused 0 (delta 0)
$ cd ..
$ mkdir git.merged && cd git.merged
$ git svn init svn://localhost/var/svn/svn2git-test
Initialized empty Git repository in /tmp/svn2git/migration_with_fetch/git.merged/.git/
$ git svn fetch -r 1:3
A test.txt
r1 = 0a500d79a303cd0a0153e2208d42e825d9f90504 (refs/remotes/git-svn)
M test.txt
r2 = 00bfdbf380411ad86ed274aef02e9a28528e0211 (refs/remotes/git-svn)
M test.txt
r3 = 6b79e39d6e19256fdbb0a95210570972b349c30b (refs/remotes/git-svn)
Checked out HEAD:
svn://localhost/var/svn/svn2git-test r3
$ cp ../git.post/.git/config ./.git/config # make sure fetch knows where to put trunk, branches, and tags
$ git svn fetch -r 4:HEAD
r4 = 2b09a1ad55e4b0b1bf826d3718821f9b065d198c (refs/remotes/trunk)
A test.txt
r5 = 020e7ff72c87b3711486bbffa4230b2fd9ccc266 (refs/remotes/trunk)
M test.txt
r6 = 80fdfdd4fc0f69f79f10c114219da4aae97aabce (refs/remotes/trunk)
Found possible branch point: svn://localhost/var/svn/svn2git-test/trunk => svn://localhost/var/svn/svn2git-test/branches/foo, 4
Found branch parent: (refs/remotes/foo) 2b09a1ad55e4b0b1bf826d3718821f9b065d198c
Following parent with do_switch
A test.txt
Successfully followed parent
r7 = 56356325a446aa2539a6f78f0e11673972a8a8b6 (refs/remotes/foo)
M test.txt
r8 = 42679bc779f0842de4a83e99c9fa718df88e4cc6 (refs/remotes/foo)
M test.txt
r9 = dadf507389f0342456e8e5e7367f2aa862ec1b4a (refs/remotes/trunk)
However, this didn't work either, because what were formerly revision 3 and 4 were now disconnected:
What I ended up doing in the end was taking the script David Wheeler wrote to stitch repositories together, running it piece by piece against my sample SVN repository and working out any problems. Once I had that working, I could then run it against the real SVN repository. Before I did this, however, because our repository is rather large, I first copied it to my computer using svnsync:
$ sudo svnsync init /var/svn/store svn+ssh://my@server.com/path/to/real/repo
$ sudo svnsync sync /var/svn/store
I should say that GitX was invaluable in confirming that the new Git repo was complete and that all the history was intact.
The breakdown
I'll have the script that I ended up with at the very end, but because you might like to see how it works, I'm going to break it apart for you, using the sample SVN repo I gave at the very beginning. Note that I'm simplifying what the script does using shell script, but the final script is in Perl.
Here's what we'll be doing in a nutshell:
- Run master for both halves of the SVN repo (pre-repo and post-repo)
- Copy commits from post-repo to pre-repo
- Copy remote branches to local ones on pre-repo
- Graft pre and post together
- Copy remote branches to local ones on the final repo
- Relocate the master branch
- Clean up (add remote, etc.)
One more thing. I ended up going with the Perl port of svn2git and then modifying it, converting it to an object-oriented version so I could require it in my script and adding a few things like a --revision option. I'll have this at the end too.
One
The first thing we do is run svn2git twice:
$ svn2git svn://localhost/var/svn/svn2git-test git.pre --root-is-trunk -r 1:3 --authors /path/to/authors.txt
$ svn2git svn://localhost/var/svn/svn2git-test git.post --authors /path/to/authors.txt
This will create two Git repositories: the first repo has the part of the SVN repo before the split occurred, the other has the part afterward. I'll be referring to them as "pre-repo" and "post-repo" (or simply "pre" and "post").
At this point here's what the git repos look like:
Two
The next thing we want to do is copy commits (or, in Git parlance, objects) from post-repo to pre-repo. We do this using git fetch:
$ cd git.pre
$ git fetch ../git.post
warning: no common commits
remote: Counting objects: 11, done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 11 (delta 3), reused 11 (delta 3)
Unpacking objects: 100% (11/11), done.
From /tmp/git.post
* branch HEAD -> FETCH_HEAD
Now, this is supposed to copy objects in local branches from post-repo to pre-repo. However, for whatever reason it didn't do that for me. However, that command isn't totally useful. As git has told you, FETCH_HEAD now points to HEAD in the post-repo, which happens to point to post-repo's master branch. This will prove to be valuable later. So let's save that in a temporary branch:
$ git checkout -b post-master FETCH_HEAD
Switched to a new branch 'post-master'
$ git checkout master
Switched to branch 'master'
Since the first git fetch didn't do anything, let's try copying objects in remote branches from post to pre. Basically we get a list of the remote branches from post-repo using git branch -r, and for each branch we say something like this:
git fetch ../git.post refs/remotes/$branch:refs/remotes/$branch
Let's do that for the sample repo:
$ git fetch ../git.post refs/remotes/foo:refs/remotes/foo
remote: Counting objects: 6, done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 4 (delta 1), reused 2 (delta 0)
Unpacking objects: 100% (4/4), done.
From /tmp/git.post
* [new branch] foo -> foo
I can't find any place that explains quite what that colon syntax does better than man git-fetch does, but I think here we're telling Git, "Hey, be a doll and look at post-repo, find a remote branch called foo, and copy all the commits inside to the remote branch foo in pre-repo."
At this point here's what the pre-repo looks like:
Three
If you take a closer look at the screenshot above, you will notice that master is colored orange and foo is colored blue. This is because foo is a remote branch and master is a local branch:
$ cd git.pre
$ git branch
* master
post-master
$ git branch -r
foo
So, we need to copy the remote branches over to local branches. We can do that by looping through the remote branches, stripping "origin/" from the branch names if it's present, and saying something like:
git branch --no-track $branch refs/remotes/$branch
Let's do that for the sample repo:
$ git checkout foo
Note: moving to 'foo' which isn't a local branch
If you want to create a new branch from this checkout, you may do so
(now or later) by using -b with the checkout command again. Example:
git checkout -b <new_branch_name>
HEAD is now at 23a450b... Update test.txt from foo
$ git checkout -b foo
At this point here's what pre-repo looks like:
As you can see, there's now a green "foo" next to the blue "foo", since foo is now a local branch. We can double-check this through the command line:
$ cd git.pre
$ git branch
foo
* master
post-master
Four
You might think that everything is okay with the repository, judging by the screenshot above. There's another problem with it, however. Do you see it? There isn't a line connecting "Update test.txt, again" and "Add trunk, branches, tags". This is because, in fact, they aren't connected. The first half is the master branch -- you can see that master points to the end of it. The second half, though, isn't on the master branch, it's just kind of floating around in the repository. (post-master points to it, but forget that's there.) We can confirm this with git log:
$ cd git.pre
$ git checkout master
Already on 'master'
$ git log
commit ae7596f38fefc8a53bf6f2018391a8c5ff3c0379
Author: elliot <elliot@445a800c-2d3f-4b84-95f0-4d0e8c338740>
Date: Sun Jan 31 03:02:26 2010 +0000
Update test.txt, again
commit bead1ec713a5361b0988cc71cbaf08b9c4a1462b
Author: elliot <elliot@445a800c-2d3f-4b84-95f0-4d0e8c338740>
Date: Sun Jan 31 03:02:25 2010 +0000
Update test.txt
commit e89a208a2b1aa2bf5b569a4d4e3fbfd23e138615
Author: elliot <elliot@445a800c-2d3f-4b84-95f0-4d0e8c338740>
Date: Sun Jan 31 03:02:24 2010 +0000
Initial commit
So, that's obviously not good. How do we fix this? We can graft the two histories together using a file that's special to Git, .git/info/grafts. Each line in the file specifies a connection that looks something like this:
<commit id> <parent id> [<parent id>]*
So if we were to take another look at our repo in GitX and click on the commit where post-repo starts, we would see that its commit id is d2874583232277a9e8e425a80e1670e9b1193013. If we click on the commit below, we would see that its commit id is ae7596f38fefc8a53bf6f2018391a8c5ff3c0379. So in order to create the connection between them, all we have to do is open .git/info/grafts and add this:
d2874583232277a9e8e425a80e1670e9b1193013 ae7596f38fefc8a53bf6f2018391a8c5ff3c0379
The cool thing is that if we save this file, refresh the view in GitX, scroll down and find the split again, we should see that the commits are connected now:
This connection is temporary, however; we won't be able to push the repository until we apply the change to the repository. To do that, we use git filter-branch. Not only will this create the connections from .git/info/grafts, it will also regenerate all the commit ids for every subsequent commit after the new connections. 1
Something worth noting is that git filter-branch by itself only applies the change to the branch that you're on. If you're only concerned about master, that's all right, but if you have any branches that come off of master after the split, like we do, those won't be properly rewritten. So to rewrite everything we say this:
git filter-branch --tag-name-filter cat -- --all
So we run that, and then we make sure to
rm .git/info/grafts
so that it doesn't interfere with future changes to pre-repo.
Now, it turns out that git filter-branch saves old commits as it rewrites everything (I suppose in case you want to roll it back or something, I'm not sure). As you can see it looks pretty funky:
To correct this, all we have to do is clone the repo:
git clone file:///tmp/git.pre /path/to/git.final
As David Wheeler points out, it's important to use file:// here to get a copy of everything, otherwise some of the copied files end up as hard links.
The cloned repo will look like:
Five
Since we've cloned the repository, we need to copy remote branches to local ones again. However, this time we're going to actually remove the remote branches since we don't want them to interfere with anything when we push the repository to its final place at the end.
This works pretty much the same way as step two (git branch --no-track $branch refs/remotes/$branch), except that we need to make sure we aren't copying over the master branch, since we already have that locally.
Six
So let's review: we've created two Git repos from the pre-split and post-split parts of the SVN repo, copied over commits from post to pre, copied remote branches to local branches on pre, grafted pre and post together, cloned that into a final repository, and re-copied remote branches to local branches in the final repo.
Okay, so let's inspect the repo:
$ cd git.final
$ git checkout master
Already on 'master'
$ git log
commit ae7596f38fefc8a53bf6f2018391a8c5ff3c0379
Author: elliot <elliot@445a800c-2d3f-4b84-95f0-4d0e8c338740>
Date: Sun Jan 31 03:02:26 2010 +0000
Update test.txt, again
commit bead1ec713a5361b0988cc71cbaf08b9c4a1462b
Author: elliot <elliot@445a800c-2d3f-4b84-95f0-4d0e8c338740>
Date: Sun Jan 31 03:02:25 2010 +0000
Update test.txt
commit e89a208a2b1aa2bf5b569a4d4e3fbfd23e138615
Author: elliot <elliot@445a800c-2d3f-4b84-95f0-4d0e8c338740>
Date: Sun Jan 31 03:02:24 2010 +0000
Initial commit
Hmm. That's weird. The history still only goes up to pre-split. But we grafted the two halves together -- that should have been enough, right?
Just to make sure, let's pull up the repo in GitX and see what we have:
Okay, now we can see that everything's there, but master still points to the commit before the split. Is there a way we can move it? Yes -- but we have to tell git exactly where it should move master to. You might think it's easy -- can't you just tell Git to find the very latest commit and move master to that? Unfortunately not, because what if the latest commit is on a branch, and the branch has been modified beyond master?
So we have to go with a different tactic. At some point, something must have pointed to the last commit on post-repo. Of course! It was post's master branch that pointed to its last commit. Is there a way we can find out which commit that points to? Probably, but unfortunately, even if we had that commit id, it would be invalid due to git filter-branch having rewritten the history. However, that doesn't mean we can't have stored it in a branch before said history is rewritten. If you remember way back in step one, we did this:
$ git fetch ../git.post
$ git checkout -b post-master FETCH_HEAD
Now you know why. We've have had access to that through the post-master branch all along (it's been carried over through the steps). Now we can simply write:
$ git checkout post-master
Switched to branch 'post-master'
$ git branch -D master
Deleted branch master (was ae7596f).
$ git checkout -b master
Switched to a new branch 'master'
$ git branch -D post-master
Deleted branch post-master (was 2ca81cf).
And just like that, now master points to the very end:
Seven
All that's left now is some cleanup to reduce the size of the repository:
$ git gc
$ git repack -a -d -f --depth 50 --window 50
While we're at it we can add the URL for the remote repo on our server:
$ git remote rm origin
$ git remote add origin ssh://you@yourserver.com/path/to/repository
And we're done!
Pushing the final repo
I wanted to say a few words about our Git setup. If you don't want to have to pay for Github it's actually very easy to host your Git repository right on your server. We followed this guide to set up a git user and set permissions on the directory that holds our Git repos. Assuming you have that, any time you create a new project, you can initialize a Git repo on your server using the following commands (taken partially from that and also this post):
su git
cd /path/to/git/repos_dir
umask 007
mkdir proj.git && cd proj.git
git init --bare
git config core.sharedrepository 1
git config receive.denyNonFastforwards true
find objects -type d -exec chmod 02770 {} \;
chmod -R o-rwx .
chmod -R g=u .
Notice that we initialize a bare repository. In a non-bare repository, all of the files that hold your project's history is stored in a special .git folder inside your project. In a bare repository, the folder is the .git folder. It's not totally necessary, but since we'll never be modifying the repo on the server, it saves some disk space.
Now that we have that, all we have to do to push our newly converted repository to the server is this:
$ git push origin --all
The script(s)
You can find my stitch script as well as my fork of svn2git in this gist. The stitch script I've put in a gist because it's probably very likely you'll need to change it to fit your repository. For instance, we didn't have any tags in our repository so the stitch script doesn't account for that. Also the svn2git fork should probably go on CPAN, but I may do that at a later time.
That's all! If you have any questions, leave them in the comments. ☯
Footnotes ☯
-
The reason we have to regenerate all the commit ids is because commit ids (like other object ids) are SHA1 hashes. And these hashes are generated in part from certain attributes. And one of the attributes for a commit id is the id of the commit's parent.
So let's take our sample repo for example. When the connection is made between d287458 and ae7596f, Git will set ae7596f as the parent of d287458. Because d287458 has a new parent now, when git calculates the SHA1 hash, the commit will have to get a different commit id. But, that means that whichever commit has d287458 as its parent, that commit will have to get a new commit id. But, that means whichever commit above that will have to get a new commit id. And so on and so on. (You can read more about SHA1 calculation here and here.)↩