Rewriting git history

So, autotest (, the project I work on was kept for a long time on an svn repo (over 6 years). Turns out that people love git [1] and then a git-svn mirror was set up.

It took a while to finally take the decision to move autotest entirely to git.  When we did, we briefly discussed doing a proper history rewrite, to eliminate old svn cruft (the hideous git-svn-id: lines on the commit messages) and assign authorship properly (on git-svn, the committer is considered the patch author). It’s fair enough that git-svn behaves this way, since there’s no reliable way to tell what is the original author of the patch, since each project does the copyright assignment on a different way.

On autotest, in the Old Days (TM), this was done by means of a From: field on the commit message, and late by Signed-off-by. So the mirror did not contain reliable version information to tell the authors, by, say, using git shortlog -s. This is the original mess the repo was:

$ git shortlog -s
     1  00:18.584417Z
     1  Alex Jia
    22  Amos Kong
     1  Aneesh Kumar K.V
    44  Chris Evich
    81  Cleber Rosa
     1  Daniel Veillard
     3  David Greenberg
     1  Eduardo Habkost
     2  Guannan Ren
     1  Jerry Tang
     1  Jiri Zupka
    19  Jiří Župka
     4  Lei Yang
     1  Liu Sheng
     4  Lubos Kocman
   331  Lucas Meneghel Rodrigues
    28  Lukas Doktor
     6  Madhuri Appana
     3  Martin Krizek
     6  Miroslav Rezanina
    13  Nishanth Aravamudan
     1  Onkar N Mahajan
     2  Pavel Hrdina
     1  Philipp Seiler
    27  Qingtang Zhou
     3  Quentin Deldycke
     1  Satheesh Rajendran
     1  Steve
     5  Steve Conklin
     3  Thomas Jarosch
    16  Vinson Lee
     1  Wei Yang
     1  Wenyi Gao
     9  Yiqiao Pu
     1  Yu Mingfei
   154  apw
     9  ericli
     8  guyanhua
   454  jadmanski
   138  jamesren
  1380  lmr
  2450  mbligh
     1  pradeep
     2  root
   795  showard
    12  tangchen

Now, how to solve this? Considering the complete lack of a rigid standard used across the commits, it was clear I’d have to use several passes of git filter-branch vodoo.

So, in the first pass, I wanted to look for signed-off-by: entries, extract the names and authors, and rewrite the commits. If no signed-off-by: name could be successfuly obtained, fall back to the old author. If the old author happens to be one of the (few) commiters of the old svn repo, use a conversion table to replace the names:

mbligh=Martin Bligh <>
jadmanski=John Admanski <>
ericli=Eric Li <>
apw=Andy Whitcroft <>
jamesren=James Ren <>
lmr=Lucas Meneghel Rodrigues <>
showard=Steve Howard <>

Then the following script was created:


 git filter-branch -f --env-filter '

 get_name_authors_txt () {
     grep $1 /tmp/authors.txt >> /dev/null || fail=1
     if [ $fail -eq 0 ];
       lname=$(grep "^$1=" "/tmp/authors.txt" | sed "s/^.*=\(.*\) $/\1/")
       if [ -n "$lname" ]
           echo "$lname"
           echo "$GIT_COMMITTER_NAME"
       echo "$GIT_COMMITTER_NAME"

 get_email_authors_txt () {
     grep $1 /tmp/authors.txt >> /dev/null || fail=1
     if [ $fail -eq 0 ];
       lemail=$(grep "^$1=" "/tmp/authors.txt" | sed "s/^.*=.* $/\1/")
       if [ -n "$lemail" ]
           echo "$lemail"
           echo "$GIT_COMMITTER_NAME"
       echo "$GIT_COMMITTER_EMAIL"

 get_name () {
     name=$(git log $GIT_COMMIT -1 | grep "Signed-off-by:" | head -1 | sed "s/^.*:\(.*\) $/\1/" | grep -v "Signed-off-by:")
     if [ -n "$name" ]
         echo "$name"
         get_name_authors_txt $1

 get_email () {
     email=$(git log $GIT_COMMIT -1 | grep "Signed-off-by:" | head -1 | sed "s/^.*:.* $/\1/" | grep -v "Signed-off-by:")
     if [ -n "$email" ]
         echo "$email"
         get_email_authors_txt $1


git filter-branch f --msg-filter 'sed -e "/^git-svn-id:/d"'

So, the basic idea is to try find the first Signed-off-by: line of the patch, extract the author, provided it is in the standard format, and set that to the author name, same process for the email.

This, for the first pass, greatly cleaned up the state of the commits, by assigning proper names and then getting rid of the git-svn-id: tags.

As a matter of fact, I’ve used this same script with small variations over and over to get to the (mostly) complete cleanup, and now things look a lot better:

$ git shortlog -s | wc -l

And obviously, this blog post is a way to remember, and hopefully it might be useful to others. About rewriting the history, I know, that’s bad, but obviously I’ll avoid doing it again, and now people can extract stats from the code base more easily now.

[1] And in my humble opinion they should love it, it’s a pretty great program 🙂


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s