A love letter to Git
"The Information Manager from Hell" that we all love
The most useful tech is the stuff you take for granted but can’t live without.
Git is one of those things.
You're in a good mood, and [Git] actually works for you. Angels sing, and a light suddenly fills the room.
- Linus Torvalds, describing Git in the README
I've been in this field long enough that I experienced the transition from “should we use source control?” to “of course we use source control”. In fact, this was the first question on the famous Joel Test back in 2000.
Source control tools before it had intended to enable collaboration of teams on large projects at a global scale. But Git actually met the goal.
Previous source control tools seemed regularly hampered by the all too common problem: the maker of the tool was not actually a user of the tool.
This was not true of Git. Linus made Git for himself, and he in fact used Git to track Git—in addition to a little thing by the name of the Linux kernel.
He ate his own dog food.
While “use your own tools” is indeed good advice, it’s not enough to explain why Git absolutely rocketed to having >90% market share.
I see two technical choices that made the difference: (1) every repo is equally canonical, and (2) multi-file code changes were tracked with a single changeset.
Traditionally, there was one copy of the code that's "true" or “central”—one that every other system or person copies from. Generally it lived on an organizational server somewhere. It is a very command-and-control, hub-and-spoke model.
Literally everyone—both open source (SVN, CVS) and commercial (Perforce and TFS)—had centralized canonical repos as the norm.
A consequence of this choice has been that if you wanted to make a new file in a repo, you had to ask the central server to make it for you first before you could really write code.
With Git, however, every copy of a repo is equal. The one on my laptop is just as “canonical” as the server. The idea of there being another copy to sync up with on the regular is not assumed at all. You can just work by yourself on your own laptop with no server.
To be clear in practice, engineers often treat a repo server elsewhere as the canonical version of the software. That is, the repo that is on Github or Gitlab or Bitbucket is the repo that everyone should take a copy from and keep in sync with.
The keyword there is should. It's a convention that people generally follow when working on teams. A handshake deal that this is a sane way to operate at scale. However, Git does not enforce this to be true. Should you need something different. It'll support your use case.
Since this is the default, it enables many practical exceptions that people need on a routine basis.
As mentioned before, you can just start a repo locally, and not sync it anywhere for personal projects that you realized you might want to keep organized. If another repo online you had been following becomes abandoned, you can “fork” it, and make a new copy that you put up.
Since the prior norm assumed centralization, it enabled a command and control structure of file locking that corporations found attractive for managing changes. If you wanted to change code in a file, you'd lock the file for changes and make your changes. If you wanted to change 5 files, you'd lock all 5.
The exact steps varied on each of those prior tools. But practically speaking, if a coworker locked a bunch of files to work on a feature and went on vacation, you'd need to get someone to unlock the files, generally speaking a manager, in order to allow you to work on them instead. What would that mean for that person's changes on their machine? Again it depends, but losing work was common.
The new world that Git enabled was the changeset which could include 1 or more files stored by SHA hash in a graph tree over time. Or more simply it tracked changes by files through time. While this underlying world was true, it presented to the user the changes as changes line-by-line when viewing them.4 This gave Git both flexibility and granularity.
The consequence of this was that Git enabled teams to work on the very same file without blocking each other's progress. You could both change the file and get your work reviewed many times being entirely unaware that the other person was working on the same file.
Now I could keep breaking down specific technical choices further to understand how they each contributed. But to me, it was these two choices that enabled most of the benefits.
Git solved it so thoroughly that people generally don’t even think about alternatives any more, and people generally like using it because of how dang flexible it is to handle different cases and handling them all beautifully.
Now I don't mean to mislead you: the fundamental feeling that most people have when working with Git is not an appreciation of its beauty. It's just a tool you use while you work.
In fact many people find it somewhat annoying at times, like putting on a safety harness. But if your job entails scaling large buildings, you'd be kind of foolish to not buckle up.
And that is close to the functional feeling that Git gives and does largely achieve: safety.
I won’t lose my code. The change I made wasn’t stepped on by another person accidentally. And even if they did, we can likely get it back without redoing the work. And if I make a change that brings down production, I can undo just that change.
All of these things used to be regular concerns for developers, and now they’re for the most part gone.
A few words on mercurial (hg). I am aware that basically everything I said above that was good about Git could also apply to mercurial. So why didn't mercurial "win"? Why is it that it got dropped by its only provider Bitbucket in 2020?
I don't know. I never used it.
But let me give rumors and speculation: while I've heard more than once that mercurial was easier to get started with, I've also heard that it lacked staging functionality, which seems like a way to potentially cut yourself by accident.
For the initial part of its lifespan suffered from performance issues due to be being written in Python while Git was written in C, but I also heard from Facebook engineers who used mercurial that it "scaled better" for them than Git because of the internals of how mercurial handled changes.
I don't know if any of that's true anymore, though, because it seems that Meta open sourced its Mercurial replacement sapling several years back. But I have also never used that one.
Thanks to all those who worked on Git. We all knew there must be a better way, and you found it. Due to your efforts, we don’t have to suffer any more collectively.
Thanks for reading! Subscribe for free to receive new posts and support my work.
Thanks to Eric from Greenville, SC for point out that I previously led people to believe that Git stored diffs instead of the whole file. I clarified my language.