Git: Understanding the internals and its architecture

In this article, we will unfold why git is designed the way it is, and why we have referred to it as a time machine. Here, it is expected that you have a fair bit of an idea of what is a VCS.

Fundamentals

As always, let us try to understand the fundamentals. To do so, we need to understand why git came into existence in the first place. Git is a distributed version control tool, which means that developers can make changes to the code in their local systems, and once done, they can push the changes to a central collaborative repository.

This is the major problem that git solves, it enables each developer to maintain several different versions of the code locally in their systems, while being able to get new changes (pull changes) from the remote repository in the central server and also push their changes to the remote repository in the central server from their local systems.

Problem Statement

To have a better understanding, let us imagine that you are tasked with designing a tool, that can move you back to any given code snapshot and also give you the ability to share it with other developers as well.

By snapshot, we mean any particular version of code that you want your tool to be able to go back to. Consider a website development scenario where post the initial model (v1), code modifications occur (v2), necessitating a return to the previous snapshot (v1). You should be able to do this with your tool. You should also be able to share the snapshot of any state of your code with other developers.

Solution 1

Now let us try to understand the components needed to design such a tool. Since we have to maintain the code snapshots, it is obvious that we need some storage mechanism, i.e. we will need a database to store our code changes. We want to enable code sharing and, therefore we want the storage to be a remote central storage to save our content (don't even think about peer-to-peer connections!).

Just having central remote storage won't help, because to maintain the connections with multiple clients, making the client-side tool lightweight, ensuring a secure connection and maintaining role-based access control, we need to have a remote server itself, which interfaces between the client and the central storage.

VCS Tool Design v1

Workflow

The development flow entails a developer making changes, capturing a code snapshot, and automatically syncing with the remote server while compressing the snapshot.
Every time a new snapshot is created by the developer, it is directly saved to the remote server.
The client side maintains the current code version available on the remote server, excluding local modifications.

Even after having this structured approach, scalability concerns arise with the accumulation of data from numerous changes and snapshots, especially with a sizable number of developers.

The performance bottlenecks are due to the storage of compressed data for all files in each snapshot. For example, if our tool takes the first snapshot of the code having 1000 files as v1. It sends all of the compressed data of these 1000 files to the central repository. Now, in the second version, even if we have updated only 2 lines in all the 1000 files, it will send the compressed data of all these 1000 files. However, we have only modified 2 lines.

So a better solution would be to just store what has changed from v1 till any number of changes in the file. So for example, the first snapshot v1 will have all the file contents, however the subsequent snapshots: v2, v3, v4, etc. will be the delta, i.e. only the contents that have changed. This way we have optimized our performance and storage as well. Let's introduce some mathematical terms:

Initial Snapshot (v1):

$$S(v1)=C(1)+C(2)+…+C(1000)$$

Where S(v1) is the snapshot of version 1, and C(n) refers to the compressed data of the given file number
Subsequent Snapshots (v2, v3, ...):

$$Δ (v1,v2)=C(modified)$$

Where Δ is the change of file contents from v1 to v2.

By storing only the delta, representing what has changed in each file from v1 onward, we optimize both performance and storage. See, how simple it is to time travel from one version to another!

Congratulations! You have just designed a tool similar to SVN.

Problems

This seems to be ideal but it is a nightmare for teams working on a decent-sized project. Problems?

Internet Dependency
The developer always needs to be connected to the internet. To save the current snapshot, developers often struggle when they are offline or have a slow network connection since snapshots can only be saved on the remote server. This hampers the developer's productivity.
Linear Snapshot Retrieval
Since only the delta changes are stored, accessing a specific snapshot requires reconstructing the code's state sequentially from the initial version. This is necessary because each subsequent snapshot builds upon the previous one. Retrieving snapshots linearly ensures the integrity of the code history, but it may become less efficient as the project grows in size and complexity.
Single point of failure

If the remote server goes offline, then all the developers will be stuck in the current local copy, and won't be able to save any snapshots.

Although our system is optimized in terms of computer resources, it fares very poorly in terms of developer experience and productivity.

In our analogy, it means that if you have to travel in the past, you must always start from the beginning of time itself, the very first big bang! Not a pleasant experience for the time-travellers!

Solution 2

Developer time is more expensive than local disk storage.

If we think in terms of developer time being more expensive than the local disk storage, we can think of having client-side storage. Having client-side storage means that we need an internet connection only when we have to update or fetch the data from the remote repository.

Since we consider storage to be cheaper as compared to productive time, we can start storing the entire snapshot of the code instead of delta. At any given instance of time, the developer can push the snapshots to the remote server, which in return will maintain these snapshots, which can later be pulled by some other developer. Using some smart compression techniques and retrieval methods, we can make the system performant. So here we go, we have the following three components:

Client-side storage, client tool, remote server.

VCS Tool Design v2

Workflow

The developer changes the code stored in a folder (repository). This folder is somehow registered in the tool that we use, and thus there is a remote copy of this repository.
Multiple developers take a copy of this repository in their local machines. Each developer can push their snapshot of code to the remote repository. Developers can make changes and store the required snapshots in the client-side storage. Internet connectivity is not required for creating snapshots. An Internet connection is only required to push the snapshots to the remote server.
Since we store all the contents of the code with a given snapshot, it becomes easier to jump back to any version, without the need to calculate it linearly from the initial version.
Multiple developers can continue to work on their local copies even if the central remote server is down.

Since there are multiple copies of snapshots present with each local machine, which can be eventually synced back with the remote server, we can call this system a loosely coupled distributed version control system that empowers the developer and can result in more developer productivity.

Congratulations, you have designed a very raw version of git!

Summary

In principle, we have designed a time machine, which can take us back in time to any snapshot by just knowing where we want to land (the version number). This is what git is. A time machine which can take us back to the dinosaur age of our code, and then bring us efficiently back to the current modern age (even if it is no better than the Stone Age. Pun intended!)

There are quite a few drawbacks in git as well, which are easily available on the internet if you want to read through them! Although in this post we have designed the most basic and scrappy version of git, in the upcoming posts, we will delve into the details of the internal workings of git and build upon these foundations! Looking forward to meeting you again, time-traveller!

Git Internals: Understanding the Time Travel

Fundamentals

Problem Statement

Solution 1

Workflow

Problems

Solution 2

Workflow

Summary

Comments

Demystifying Git: Time Travel

Git Internals: Understanding the Time Machine

More from this blog

Agentic TDD Non-Negotiable

What If a Hash Function Wanted Collisions?

The Halting Problem

Python Dependency Isolation with Subprocesses

The performance cost of dynamic lists

Command Palette

Fundamentals

Problem Statement

Solution 1

Workflow

Problems

Solution 2

Workflow

Summary

Comments

Demystifying Git: Time Travel

Git Internals: Understanding the Time Machine

More from this blog