Rewriting git history

So, autotest (http://autotest.github.com/), the project I work on was kept for a long time on an svn repo (over 6 years). Turns out that people love git [1] and then a git-svn mirror was set up.

It took a while to finally take the decision to move autotest entirely to git.  When we did, we briefly discussed doing a proper history rewrite, to eliminate old svn cruft (the hideous git-svn-id: lines on the commit messages) and assign authorship properly (on git-svn, the committer is considered the patch author). It’s fair enough that git-svn behaves this way, since there’s no reliable way to tell what is the original author of the patch, since each project does the copyright assignment on a different way.

On autotest, in the Old Days (TM), this was done by means of a From: field on the commit message, and late by Signed-off-by. So the mirror did not contain reliable version information to tell the authors, by, say, using git shortlog -s. This is the original mess the repo was:

$ git shortlog -s
     1  00:18.584417Z
     1  Alex Jia
    22  Amos Kong
     1  Aneesh Kumar K.V
    44  Chris Evich
    81  Cleber Rosa
     1  Daniel Veillard
     3  David Greenberg
     1  Eduardo Habkost
     2  Guannan Ren
     1  Jerry Tang
     1  Jiri Zupka
    19  Jiří Župka
     4  Lei Yang
     1  Liu Sheng
     4  Lubos Kocman
   331  Lucas Meneghel Rodrigues
    28  Lukas Doktor
     6  Madhuri Appana
     3  Martin Krizek
     6  Miroslav Rezanina
    13  Nishanth Aravamudan
     1  Onkar N Mahajan
     2  Pavel Hrdina
     1  Philipp Seiler
    27  Qingtang Zhou
     3  Quentin Deldycke
     1  Satheesh Rajendran
     1  Steve
     5  Steve Conklin
     3  Thomas Jarosch
    16  Vinson Lee
     1  Wei Yang
     1  Wenyi Gao
     9  Yiqiao Pu
     1  Yu Mingfei
   154  apw
     9  ericli
     8  guyanhua
   454  jadmanski
   138  jamesren
  1380  lmr
  2450  mbligh
     1  pradeep
     2  root
   795  showard
    12  tangchen

Now, how to solve this? Considering the complete lack of a rigid standard used across the commits, it was clear I’d have to use several passes of git filter-branch vodoo.

So, in the first pass, I wanted to look for signed-off-by: entries, extract the names and authors, and rewrite the commits. If no signed-off-by: name could be successfuly obtained, fall back to the old author. If the old author happens to be one of the (few) commiters of the old svn repo, use a conversion table to replace the names:

mbligh=Martin Bligh <mbligh@google.com>
jadmanski=John Admanski <jadmanski@google.com>
ericli=Eric Li <ericli@google.com>
apw=Andy Whitcroft <apw@shadowen.org>
jamesren=James Ren <jamesren@google.com>
lmr=Lucas Meneghel Rodrigues <lookkas@gmail.com>
showard=Steve Howard <showard@google.com>

Then the following script was created:

 #!/bin/bash

 git filter-branch -f --env-filter '

 get_name_authors_txt () {
     fail=0
     grep $1 /tmp/authors.txt >> /dev/null || fail=1
     if [ $fail -eq 0 ];
     then
       lname=$(grep "^$1=" "/tmp/authors.txt" | sed "s/^.*=\(.*\) $/\1/")
       if [ -n "$lname" ]
       then
           echo "$lname"
       else
           echo "$GIT_COMMITTER_NAME"
       fi
     else
       echo "$GIT_COMMITTER_NAME"
     fi
 }

 get_email_authors_txt () {
     fail=0
     grep $1 /tmp/authors.txt >> /dev/null || fail=1
     if [ $fail -eq 0 ];
     then
       lemail=$(grep "^$1=" "/tmp/authors.txt" | sed "s/^.*=.* $/\1/")
       if [ -n "$lemail" ]
       then
           echo "$lemail"
       else
           echo "$GIT_COMMITTER_NAME"
       fi
     else
       echo "$GIT_COMMITTER_EMAIL"
     fi
 }

 get_name () {
     name=$(git log $GIT_COMMIT -1 | grep "Signed-off-by:" | head -1 | sed "s/^.*:\(.*\) $/\1/" | grep -v "Signed-off-by:")
     if [ -n "$name" ]
     then
         echo "$name"
     else
         get_name_authors_txt $1
     fi
 }

 get_email () {
     email=$(git log $GIT_COMMIT -1 | grep "Signed-off-by:" | head -1 | sed "s/^.*:.* $/\1/" | grep -v "Signed-off-by:")
     if [ -n "$email" ]
     then
         echo "$email"
     else
         get_email_authors_txt $1
     fi
 }

 GIT_AUTHOR_NAME=$(get_name $GIT_COMMITTER_NAME) &&
     GIT_AUTHOR_EMAIL=$(get_email $GIT_COMMITTER_NAME) &&
     GIT_COMMITTER_NAME=$GIT_AUTHOR_NAME &&
     GIT_COMMITTER_EMAIL=$GIT_AUTHOR_EMAIL &&
     export GIT_AUTHOR_NAME GIT_AUTHOR_EMAIL &&
     export GIT_COMMITTER_NAME GIT_COMMITTER_EMAIL
 '

git filter-branch f --msg-filter 'sed -e "/^git-svn-id:/d"'

So, the basic idea is to try find the first Signed-off-by: line of the patch, extract the author, provided it is in the standard format, and set that to the author name, same process for the email.

This, for the first pass, greatly cleaned up the state of the commits, by assigning proper names and then getting rid of the git-svn-id: tags.

As a matter of fact, I’ve used this same script with small variations over and over to get to the (mostly) complete cleanup, and now things look a lot better:

$ git shortlog -s | wc -l
202

And obviously, this blog post is a way to remember, and hopefully it might be useful to others. About rewriting the history, I know, that’s bad, but obviously I’ll avoid doing it again, and now people can extract stats from the code base more easily now.

[1] And in my humble opinion they should love it, it’s a pretty great program 🙂

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s