So, autotest (http://autotest.github.com/), the project I work on was kept for a long time on an svn repo (over 6 years). Turns out that people love git [1] and then a git-svn mirror was set up.
It took a while to finally take the decision to move autotest entirely to git. When we did, we briefly discussed doing a proper history rewrite, to eliminate old svn cruft (the hideous git-svn-id: lines on the commit messages) and assign authorship properly (on git-svn, the committer is considered the patch author). It’s fair enough that git-svn behaves this way, since there’s no reliable way to tell what is the original author of the patch, since each project does the copyright assignment on a different way.
On autotest, in the Old Days (TM), this was done by means of a From: field on the commit message, and late by Signed-off-by. So the mirror did not contain reliable version information to tell the authors, by, say, using git shortlog -s. This is the original mess the repo was:
$ git shortlog -s
1 00:18.584417Z
1 Alex Jia
22 Amos Kong
1 Aneesh Kumar K.V
44 Chris Evich
81 Cleber Rosa
1 Daniel Veillard
3 David Greenberg
1 Eduardo Habkost
2 Guannan Ren
1 Jerry Tang
1 Jiri Zupka
19 Jiří Župka
4 Lei Yang
1 Liu Sheng
4 Lubos Kocman
331 Lucas Meneghel Rodrigues
28 Lukas Doktor
6 Madhuri Appana
3 Martin Krizek
6 Miroslav Rezanina
13 Nishanth Aravamudan
1 Onkar N Mahajan
2 Pavel Hrdina
1 Philipp Seiler
27 Qingtang Zhou
3 Quentin Deldycke
1 Satheesh Rajendran
1 Steve
5 Steve Conklin
3 Thomas Jarosch
16 Vinson Lee
1 Wei Yang
1 Wenyi Gao
9 Yiqiao Pu
1 Yu Mingfei
154 apw
9 ericli
8 guyanhua
454 jadmanski
138 jamesren
1380 lmr
2450 mbligh
1 pradeep
2 root
795 showard
12 tangchen
Now, how to solve this? Considering the complete lack of a rigid standard used across the commits, it was clear I’d have to use several passes of git filter-branch vodoo.
So, in the first pass, I wanted to look for signed-off-by: entries, extract the names and authors, and rewrite the commits. If no signed-off-by: name could be successfuly obtained, fall back to the old author. If the old author happens to be one of the (few) commiters of the old svn repo, use a conversion table to replace the names:
mbligh=Martin Bligh <mbligh@google.com> jadmanski=John Admanski <jadmanski@google.com> ericli=Eric Li <ericli@google.com> apw=Andy Whitcroft <apw@shadowen.org> jamesren=James Ren <jamesren@google.com> lmr=Lucas Meneghel Rodrigues <lookkas@gmail.com> showard=Steve Howard <showard@google.com>
Then the following script was created:
#!/bin/bash
git filter-branch -f --env-filter '
get_name_authors_txt () {
fail=0
grep $1 /tmp/authors.txt >> /dev/null || fail=1
if [ $fail -eq 0 ];
then
lname=$(grep "^$1=" "/tmp/authors.txt" | sed "s/^.*=\(.*\) $/\1/")
if [ -n "$lname" ]
then
echo "$lname"
else
echo "$GIT_COMMITTER_NAME"
fi
else
echo "$GIT_COMMITTER_NAME"
fi
}
get_email_authors_txt () {
fail=0
grep $1 /tmp/authors.txt >> /dev/null || fail=1
if [ $fail -eq 0 ];
then
lemail=$(grep "^$1=" "/tmp/authors.txt" | sed "s/^.*=.* $/\1/")
if [ -n "$lemail" ]
then
echo "$lemail"
else
echo "$GIT_COMMITTER_NAME"
fi
else
echo "$GIT_COMMITTER_EMAIL"
fi
}
get_name () {
name=$(git log $GIT_COMMIT -1 | grep "Signed-off-by:" | head -1 | sed "s/^.*:\(.*\) $/\1/" | grep -v "Signed-off-by:")
if [ -n "$name" ]
then
echo "$name"
else
get_name_authors_txt $1
fi
}
get_email () {
email=$(git log $GIT_COMMIT -1 | grep "Signed-off-by:" | head -1 | sed "s/^.*:.* $/\1/" | grep -v "Signed-off-by:")
if [ -n "$email" ]
then
echo "$email"
else
get_email_authors_txt $1
fi
}
GIT_AUTHOR_NAME=$(get_name $GIT_COMMITTER_NAME) &&
GIT_AUTHOR_EMAIL=$(get_email $GIT_COMMITTER_NAME) &&
GIT_COMMITTER_NAME=$GIT_AUTHOR_NAME &&
GIT_COMMITTER_EMAIL=$GIT_AUTHOR_EMAIL &&
export GIT_AUTHOR_NAME GIT_AUTHOR_EMAIL &&
export GIT_COMMITTER_NAME GIT_COMMITTER_EMAIL
'
git filter-branch f --msg-filter 'sed -e "/^git-svn-id:/d"'
So, the basic idea is to try find the first Signed-off-by: line of the patch, extract the author, provided it is in the standard format, and set that to the author name, same process for the email.
This, for the first pass, greatly cleaned up the state of the commits, by assigning proper names and then getting rid of the git-svn-id: tags.
As a matter of fact, I’ve used this same script with small variations over and over to get to the (mostly) complete cleanup, and now things look a lot better:
$ git shortlog -s | wc -l 202
And obviously, this blog post is a way to remember, and hopefully it might be useful to others. About rewriting the history, I know, that’s bad, but obviously I’ll avoid doing it again, and now people can extract stats from the code base more easily now.
[1] And in my humble opinion they should love it, it’s a pretty great program