How git internally stores the code changes (version control system)

In the previous article, we saw the high-level architecture of our time machine, git. However, only having that bit of knowledge doesn't get you started with time travelling. In the quest of mastering time travel, we need to have a thorough understanding of how the time machine works.

Problem Statement

First, let us try to understand what the time machine should be capable of:

The tool should be able to save code snapshots of the given repository (folder).
The tool should be able to traverse these code snapshots seamlessly.

By snapshots, we mean the current state of the code.

A Stab at Design

We can easily conclude that each snapshot needs to have a unique identifier. So whenever we save a snapshot, we should get a unique identifier associated with the snapshot, which can used to travel back to that snapshot.

Compressing Files

Since, we are working with files, over time, the amount of disk space used to store code snapshots can increase drastically. Although the disk space isn't a big concern, however along with the disk space, the time to retrieve those snapshots also increases due to the increased size of the history. To solve this problem, let us compress the content of our files when we store the snapshot. We can easily decompress the content of our snapshot while retrieving it.

Hashing the compressed contents

Compressing files solves the problem of storage, but what about the unique identifiers? The first thing that usually strikes us when we think about a unique identifier is hashing. We will use the same concept here. To generate a unique code identifier, we will simply hash the compressed content using SHA1 (since git uses the same algo). The SHA1 generated can be used to travel back to this code snapshot.

Storage Mechanism

Now that we have somewhat solved the problem of uniquely identifying code snapshots and optimally saving the code, let us delve one layer beneath on how to save this data and keep track of it. So basically, we now have to design how the contents given out by our time machine should be saved and retrieved from our client-side storage.

A few constraints to keep in perspective are that the client-side tool has to be lightweight with a minimal number of extra dependencies and should be highly portable. This non-functional requirement somewhat rules out the possibility of using client-side databases. Can we think of something simpler, one that is already available out of the box by the operating system?

Ahh! Can we store the compressed file contents in files and maintain the SHA1 identifiers in the other files for quick look-up of hashes? Well, this seems a good deal. Or, at least this is what Linus Torvalds, the inventor of the time machine git thought. He leveraged the Unix-like file directory structure to store the compressed content.

Final Outcome

VCS Client Side Code Snapshot Storing Mechanism

So we first compress the contents of the file and store them in a different file. The name of the file is the SHA1 of the file's content. To enable easier look-up for reference, this SHA1 is also stored in a different file. If you are aware of the term inverted index, this might sound something similar in principle. This enables easy storage and retrieval of code snapshots via the VCS tool!

Summary

Congratulations! You have designed a very basic version of git, the time machine! Without a doubt, git uses more sophisticated ways to store and keep track of the timeline. But worry not! In the next article, we will equip ourselves with how git is implemented. The power of commit, blob and tree!

Git Internals: Understanding the Time Machine

Problem Statement