2.12.07. Distributed VCS's are the Great Enablers (or: don't fear the repo)

The more I play with the new breed of VCS tools, the more I appreciate them. The older generations (CVS, SVN) look increasingly archaic, supporting a computing and development model that seems unsustainable. Yet most of us lived with those tools, or something similar, for most of our development-focused lives.

When I speak of the new breed, the two standouts (to me) are Git and Mercurial. There are some other interesting ones, particularly Darcs, but Git and Mercurial seem to have the most steam and seem fairly grounded and stable. Between those two, I still find myself preferring Git. I’ve had some nasty webs to untangle and Git has provided me with the best resources to untangle them.

Those webs are actually all related to CVS and some messed up trunks and branches. Some of the code lives on in CVS, but thanks to Git, sorting out the mess and/or bringing in a huge amount of new work (done outside of version control because no one likes branching in CVS and is afraid of ‘breaking the build’) was far less traumatic than usual.

One of those messes could have been avoided had we been using Git as a company (which is planned). One of the great things these tools provide is the ability to easily do speculative development. Branching and merging is so easy. And most of those branches are private. One big problem we have with CVS is what to name a branch: how to make the name unique, informative, and communicative to others. And then we have to tag its beginnings, its breaking off points, its merge points, etc, just in case something goes wrong (or even right, in the case of multiple merges). All of those tags end up in the big cloud: long, stuffy, confusing names that outlive their usefulness. It’s one thing to deal with all of this for an important branch that everyone agrees is important. It’s another to go through all of this just for a couple of days or weeks of personal work. So no one does it. And big chunks of work are just done dangerously - nothing checked in for days at a time. And what if that big chunk of work turned out to be a failed experiment? Maybe there are a couple of good ideas in that work, and it might be worth referring to later, so maybe now one makes a branch and does a single gigantic check-in, just so that there’s a record somewhere. But now, one can’t easily untangle a couple of good ideas from the majority of failed-experiment code. “Oh!” they’ll say in the future, “I had that problem solved! It’s just all tangled up in the soft-link-experimental-branch in one big check in and I didn’t have the time to sort it out!”

I speak from personal experience on that last one. I’m still kicking myself over that scenario. The whole problem turned out to be bigger than expected, and now there’s just a big blob of crap, sitting in the CVS repository somewhere.

With a distributed VCS, I could have branched the moment that it looked like the problem was getting to be bigger than expected. Then I could keep committing in small chunks to my personal branch until I realized the experiment failed. With smaller check-ins, navigating the history to cherry-pick the couple of good usable ideas out would have been much easier, even if everything else was dicarded. I wouldn’t have to worry about ‘breaking the build’ or worry about a good name for my branch since everyone else would end up seeing it. I could manage it all myself.

This is the speculative development benefit that alone makes these tools great. It’s so easy to branch, MERGE, rebase, etc. And it can all be done without impacting anyone else.

One thing that I often hear when I start advocating distributed VCS’s is “well, I like having a central repository that I can always get to” or “is always backed up” or “is the known master copy.” There’s nothing inherant in distributed VCS’s that prevents you from having that. You can totally have a model similar to SVN/CVS in regards to a central repository with a mixture of read-only and read/write access. But unlike CVS (or SVN), what you publish out of that repository is basically the same thing that you have in a local clone. No repository is more special than any other, but that policy makes it so. You can say “all of our company’s main code is on server X under path /pub/scm/…”.

And unlike CVS (or SVN), really wild development can be done totally away from that central collection. A small team can share repositories amongst themselves, and then one person can push the changes in to the central place. Or the team may publish their repository at a new location for someone else to review and integrate. Since they all stem from the same source, comparisons and merges should all still work, even though the repositories are separate.

Imagine this in a company that has hired a new developer. Perhaps during their first three months (a typical probationary period), they do not get write access to the core repositories. With a distributed VCS, they can clone the project(s) on which they’re assigned, do their work, and then publish their results by telling their supervisor “hey, look at my changes, you can read them here …” where here may be an HTTP or just a file system path. Their supervisor can then conduct code reviews on the new guys work and make suggestions or push in changes of his own. When the new developers code is approved, the supervisor or some other higher developer is repsonsible for doing the merge. It’s all still tracked, all under version control, but the source is protected from any new-guy mistakes, and the new-guy doesn’t have to feel pressure about committing changes to a large code-base which he doesn’t yet fully grasp.

But perhaps the most killer feature of these tools is how easy it is to put anything under revision management. I sometimes have scripts that I start writing to do a small job, typically some kind of data transformation. Sometimes those scripts get changed a lot over the course of some small project, which is typically OK: they’re only going to be used once, right?

This past week, I found myself having to track down one such set of scripts again because some files had gotten overridden with new files based on WAY old formats of the data. Basically I needed to find my old transformations and run them again. Fortunately, I still had the scripts. But they didn’t work 100%, and as I looked at the code I remembered one small difference that 5% of the old old files had. Well, I didn’t remember the difference, I just remembered that they had a minor difference and I had adjusted the script appropriately to finish up that final small set of files. But now, I didn’t have the script that worked against the other 95%. When I did the work initially, it was done in such a time that I was probably using my editors UNDO/REDO buffer to move between differences if needed.

Now if I had just gone in to the directory with the scripts and done a git init; git add .; git commit sequence, I would probably have the minor differences right there. But I didn’t know such tools were available at the time. So now I had to rewrite things. This time, I put the scripts and data files under git’s control so that I had easy reference to the before and after stages of the data files, just in case this scenario ever happened again.

I didn’t have to think of a good place to put these things in our CVS repo. I just made the repository for myself and worried about where to put it for future access later. With CVS/SVN, you have to think about this up front. And when it’s just a personal little project or a personal couple of scripts, it hardly seems worth it, even if you may want some kind of history.

Actually, that is the killer feature! By making everything local, you can just do it: make a repository, make a branch, make a radical change, take a chance! If it’s worth sharing, you can think about how to do that when the time is right. With the forced-central/always-on repository structure of CVS and SVN, you have to think about those things ahead of time: where to import this code, what should I name this branch so it doesn’t interfere with others, how can I save this very experimental work safely so I can come back to it later without impacting others, is this work big enough to merit the headaches of maintaining a branch, can I commit this change and not break the build….?

As such, those systems punish speculation. I notice this behavior in myself and in my colleages: it’s preferred to just work for two weeks on something critical with no backup solution, no ability to share, no ability to backtrack, etc, than it is do deal with CVS. I once lost three days worth of work due to working like this - and it was on a project that no one else was working on or depending on! I was just doing a lot of work simultaneously and never felt comfortable committing it to CVS. And then one day, I accidentally wiped out a parent directory and lost everything.

Now, in a distributed VCS, I could have been committing and committing and could have lost everything anyways since the local repository is contained there: but I could have made my own “central” repository on my development machine or on the network to which I could push from time to time. I would have lost a lot less.

There are so many good reasons to try one of these new tools out. But I think the most important one comes down to this: just get it out of your head. Just commit the changes. Just start a local repository. Don’t create undue stress and open loops in your head about what, where, or when to import or commit something. Don’t start making copies of ‘index.html’ as ‘index1.html’, ‘index2.html’, index1-older.html’ ‘old/index.html’, ‘older/index.html’ and hope that you’ll remember their relationships to each other in the future. Just do your work, commit the changes, get that stress out of your head. Share the changes when you’re ready.

It’s a much better way of working, even if it’s only for yourself.

Labels: , , , , , , , ,