Rewriting git history

So, autotest (http://autotest.github.com/), the project I work on was kept for a long time on an svn repo (over 6 years). Turns out that people love git [1] and then a git-svn mirror was set up.

It took a while to finally take the decision to move autotest entirely to git.  When we did, we briefly discussed doing a proper history rewrite, to eliminate old svn cruft (the hideous git-svn-id: lines on the commit messages) and assign authorship properly (on git-svn, the committer is considered the patch author). It’s fair enough that git-svn behaves this way, since there’s no reliable way to tell what is the original author of the patch, since each project does the copyright assignment on a different way.

On autotest, in the Old Days (TM), this was done by means of a From: field on the commit message, and late by Signed-off-by. So the mirror did not contain reliable version information to tell the authors, by, say, using git shortlog -s. This is the original mess the repo was:

$ git shortlog -s
     1  00:18.584417Z
     1  Alex Jia
    22  Amos Kong
     1  Aneesh Kumar K.V
    44  Chris Evich
    81  Cleber Rosa
     1  Daniel Veillard
     3  David Greenberg
     1  Eduardo Habkost
     2  Guannan Ren
     1  Jerry Tang
     1  Jiri Zupka
    19  Jiří Župka
     4  Lei Yang
     1  Liu Sheng
     4  Lubos Kocman
   331  Lucas Meneghel Rodrigues
    28  Lukas Doktor
     6  Madhuri Appana
     3  Martin Krizek
     6  Miroslav Rezanina
    13  Nishanth Aravamudan
     1  Onkar N Mahajan
     2  Pavel Hrdina
     1  Philipp Seiler
    27  Qingtang Zhou
     3  Quentin Deldycke
     1  Satheesh Rajendran
     1  Steve
     5  Steve Conklin
     3  Thomas Jarosch
    16  Vinson Lee
     1  Wei Yang
     1  Wenyi Gao
     9  Yiqiao Pu
     1  Yu Mingfei
   154  apw
     9  ericli
     8  guyanhua
   454  jadmanski
   138  jamesren
  1380  lmr
  2450  mbligh
     1  pradeep
     2  root
   795  showard
    12  tangchen

Now, how to solve this? Considering the complete lack of a rigid standard used across the commits, it was clear I’d have to use several passes of git filter-branch vodoo.

So, in the first pass, I wanted to look for signed-off-by: entries, extract the names and authors, and rewrite the commits. If no signed-off-by: name could be successfuly obtained, fall back to the old author. If the old author happens to be one of the (few) commiters of the old svn repo, use a conversion table to replace the names:

mbligh=Martin Bligh <mbligh@google.com>
jadmanski=John Admanski <jadmanski@google.com>
ericli=Eric Li <ericli@google.com>
apw=Andy Whitcroft <apw@shadowen.org>
jamesren=James Ren <jamesren@google.com>
lmr=Lucas Meneghel Rodrigues <lookkas@gmail.com>
showard=Steve Howard <showard@google.com>

Then the following script was created:


 git filter-branch -f --env-filter '

 get_name_authors_txt () {
     grep $1 /tmp/authors.txt >> /dev/null || fail=1
     if [ $fail -eq 0 ];
       lname=$(grep "^$1=" "/tmp/authors.txt" | sed "s/^.*=\(.*\) $/\1/")
       if [ -n "$lname" ]
           echo "$lname"
           echo "$GIT_COMMITTER_NAME"
       echo "$GIT_COMMITTER_NAME"

 get_email_authors_txt () {
     grep $1 /tmp/authors.txt >> /dev/null || fail=1
     if [ $fail -eq 0 ];
       lemail=$(grep "^$1=" "/tmp/authors.txt" | sed "s/^.*=.* $/\1/")
       if [ -n "$lemail" ]
           echo "$lemail"
           echo "$GIT_COMMITTER_NAME"
       echo "$GIT_COMMITTER_EMAIL"

 get_name () {
     name=$(git log $GIT_COMMIT -1 | grep "Signed-off-by:" | head -1 | sed "s/^.*:\(.*\) $/\1/" | grep -v "Signed-off-by:")
     if [ -n "$name" ]
         echo "$name"
         get_name_authors_txt $1

 get_email () {
     email=$(git log $GIT_COMMIT -1 | grep "Signed-off-by:" | head -1 | sed "s/^.*:.* $/\1/" | grep -v "Signed-off-by:")
     if [ -n "$email" ]
         echo "$email"
         get_email_authors_txt $1


git filter-branch f --msg-filter 'sed -e "/^git-svn-id:/d"'

So, the basic idea is to try find the first Signed-off-by: line of the patch, extract the author, provided it is in the standard format, and set that to the author name, same process for the email.

This, for the first pass, greatly cleaned up the state of the commits, by assigning proper names and then getting rid of the git-svn-id: tags.

As a matter of fact, I’ve used this same script with small variations over and over to get to the (mostly) complete cleanup, and now things look a lot better:

$ git shortlog -s | wc -l

And obviously, this blog post is a way to remember, and hopefully it might be useful to others. About rewriting the history, I know, that’s bad, but obviously I’ll avoid doing it again, and now people can extract stats from the code base more easily now.

[1] And in my humble opinion they should love it, it’s a pretty great program 🙂


Lucas version 32.0 released!

Yet another year has passed, and project Lucas reaches version 32.0. As usual, we’ve got more improvements, that come from the mature (but not necessarily stable) codebase. Among the higlights of this version, we have:

* After 12 very successful releases in Campinas, project hosting has moved to Piracicaba. We thank Campinas for the many years of continued service, and welcome the new home 🙂
* All round, parenting, personal maintenance and housekeeping subsystems were greatly improved, making 32.0 a very lean and solid release.
* The dating subsystem (that spent some good releases disabled) was worked out, lots of bugs were fixed, and now it’s better than ever before 🙂

As for the offspring project Victoria, it reached version 9.0 last December. The reading and writing subsystems, that debuted in version 8.0 were vastly improved, school subsytem is solid as ever (good grades), and it keeps me amazed pretty much every day 🙂

Thank you very much to all the project contributors (friends, family and girlfriend) for being so awesome, all these years (and we hope, the upcoming years). Many thanks to the Architect for keeping the project running.