Photo by Viktor Talashuk on Unsplash
Git Internals: Blobs, Commits & Trees
The fuel powering our time machine...
In the previous article, we discussed the approach taken by Git to save code snapshots. Now it is time to get our hands dirty and try using the time machine. In this article, we will see how git works behind the scenes.
Note: Although we delve into the basics of git internal, it is expected that the reader is familiar with the basic git concepts. The intend of the article is not to teach the basics of git, but to understand how things work behind the scene, and to find rationale behind why git is the way it is.
Three-Tiered Architecture of Git
Let us first have a look at the workflow that git follows to save the code snapshots. Let us suppose that you have a couple of hundred files in the folder of your concern (the folder where your code lies), also known as the working directory. Git is already initialized in this working directory. Git creates its database inside the working directory itself in the .git
folder. This folder holds the code snapshots and is also called the git repository or simply the repository.
Out of these ~200 files, you have updated thirty files. So ideally, one will expect git to take the snapshot of the current state of code by just executing one command. In our context, the process of taking the snapshot of code is known as committing the code to the repository. Straightforward, right?
Now think of a scenario where although you have made changes to 30 files, you are sure only of 14 files, while the remaining 16 files you want to continue working on. You want to have the snapshot of code where only those 14 files have changed, and the remaining 16 files unchanged. This can be a tricky problem to solve if there is no intermediary layer between the changes that the developer has made in the working directory and committing the changes to create a snapshot.
To address this problem, Git uses a three-tiered architecture. The first step remains the same, i.e., the developer makes changes in the working directory. Once the changes are made, now instead of directly committing the changes (i.e. taking the snapshot of all the files changed), the developer first selects only those files that he/she wants to commit out of all the files that are changed. This stage is known as staging. Once the files that the developer intends to take a snapshot of are staged, then those files can be committed. So here's what the three-tiered architecture looks like:
Demystifying the .git folder
Now we are aware of how the git workflow to store a code snapshot works, i.e. the workflow to commit code. Where is the snapshot stored? Yes, you guessed it right, in the .git
folder. The entire mystery lies in that folder. It is where the magic happens!
Objects
git init
is used to initialize git in any folder. Once git is initialized in that folder (repository), a new folder .git
is added to the root of the folder where the git init
command was executed.
Now, let us go to the place where magic happens, the .git
folder
The .git
folder has a bunch of sub-folders, out of which the one that we are focused on in this article will be the objects
folder. All the code snapshots are stored in the objects folder. The entire code-related data in git is stored in the form of objects. In the subsequent sections, we discuss in detail the different types of objects in git and their significance.
Blobs
Now let us create a new file in our working directory.
➜ time-travel-with-git git:(main) touch first_file.txt
➜ time-travel-with-git git:(main) ✗ echo "Hello" >> first_file.txt
We see that by simply adding a file in the working directory, nothing changes in the git repository.
➜ time-travel-with-git git:(main) ✗ cd .git/objects
➜ objects git:(main) tree
.
├── info
└── pack
3 directories, 0 files
Therefore, we can safely conclude that no changes are automatically synced up in the git repository. Let us use the git status
command (use this command in the root folder, not in the .git
folder) to derive some more insights.
➜ time-travel-with-git git:(main) ✗ git status
On branch main
No commits yet
Untracked files:
(use "git add <file>..." to include in what will be committed)
first_file.txt
nothing added to commit but untracked files present
(use "git add" to track)
As seen in the message above, to track the file via git, we need to stage the file. Once we stage the file using the git add first_file.txt
command, let us examine the contents of the objects
folder once again. Here's the output:
➜ time-travel-with-git git:(main) ✗ git add first_file.txt
➜ time-travel-with-git git:(main) ✗ cd .git/objects
➜ objects git:(main) tree
.
├── a6
│ └── cb6ee3f97c51226a1e86bda81f48021ad27584
├── info
└── pack
4 directories, 1 file
We see a new folder here, which contains a file with a weird long name. What exactly is this?
If you recall, git stores the content in the compressed form and then calculates the SHA-1 (also known as the objectId) of the compressed content to use it as the file name. So it means that once we stage the file, git creates a snapshot of the file, and stores it in the .git/objects folder. An interesting thing to note is that git does not directly store the file name in the objects folder. Rather, it takes the first two letters of the calculated objectId and creates a folder with the same name as those two letters. The subsequent letters of the objectId are the name of the file containing the compressed contents. Therefore, the objectId is the folder name + file name. This enables quicker look-up of objects.
Let us inspect the contents of the file containing the compressed contents
➜ objects git:(main) cd a6
➜ a6 git:(main) cat cb6ee3f97c51226a1e86bda81f48021ad27584
??W??!Qc?x??%??V^?R????ṃk?X?C51i???.???R{?prə?,???d??Fc????
As seen above, we are not able to infer any meaning from the compressed content. However, git provides us with a command to see the raw content from the file. The command is: git cat-file
. This command takes the object ID as input.
➜ a6 git:(main) git cat-file -p a6cb6ee3f97c51226a1e86bda81f48021ad27584
Hello
Voila! We are now able to get the raw content back from Git!
One question that arises is what type of object the current file in question is. By using the -t
flag we can find out the type of object it is.
➜ a6 git:(main) git cat-file -t a6cb6ee3f97c51226a1e86bda81f48021ad27584
blob
As seen above, the contents of our files are stored in compressed form in objects known as blobs.
However, as discussed earlier, to store the current snapshot of code, we need to commit the changes to the git repository. Before moving ahead, let us make a few random changes to the existing file and add one more file.
If you inspect the contents of the .git/objects
folder, it hasn't changed yet. However, once you stage the files by using the git add
command, the content of the objects folder will have changed again.
➜ time-travel-with-git git:(main) ✗ git add .
➜ time-travel-with-git git:(main) ✗ cd .git/objects
➜ objects git:(main) tree
.
├── a6
│ └── cb6ee3f97c51226a1e86bda81f48021ad27584
├── e7
│ └── e57ac6308173dec7dc1749b72c5f682aa67344
├── ef
│ └── b89ab841a90e0a67fbcdb6cad8696d0357e715
├── info
└── pack
6 directories, 3 files
One thing that you will quickly notice is that although we have two files, three objects are present in the objects
directory. Any idea why?
This is because whenever you stage files, git takes the snapshot of all the files that have changed. So when you staged first_file.txt initially, git had created one object of type blob. The next time when we staged two files, git creates two blob objects, containing the compressed content of the staged files.
Commit
Let us try committing the code.
Once the changes have been committed, git successfully creates a snapshot of the current state of the code. Let us inspect the objects folder. We see two additional objects:
➜ objects git:(main) tree
.
├── 0d
│ └── 643279d1e62540cb18cb003d9b5e12b46bd992
├── a6
│ └── cb6ee3f97c51226a1e86bda81f48021ad27584
├── cc
│ └── 3a96b5a05c666fca63d7b4acd29ce779ad0ce5
├── e7
│ └── e57ac6308173dec7dc1749b72c5f682aa67344
├── ef
│ └── b89ab841a90e0a67fbcdb6cad8696d0357e715
├── info
└── pack
8 directories, 5 files
On inspecting the type of object present in the cc folder, we find a new object type: commit
➜ objects git:(main) cd cc
➜ cc git:(main) git cat-file -t cc3a96b5a05c666fca63d7b4acd29ce779ad0ce5
commit
Let us try to get more details about this object by using the -p flag:
➜ objects git:(main) git cat-file -p cc3a96b5a05c666fca63d7b4acd29ce779ad0ce5
tree 0d643279d1e62540cb18cb003d9b5e12b46bd992
author Aman <dummy-author-email@gmail.com> 1703969783 +0530
committer Aman <dummy-author-email@gmail.com> 1703969783 +0530
First stop in time-travel
You can see that the commit object contains a reference to another object known as a tree. The commit object also contains the commit message and the details of the author of the commit.
Tree
If you carefully observe, the reference of the tree object provided in the commit object is already present in the objects
folder (the 0d
folder in objects/
). Let us examine the object.
➜ objects git:(main) git cat-file -p 0d643279d1e62540cb18cb003d9b5e12b46bd992
100644 blob efb89ab841a90e0a67fbcdb6cad8696d0357e715 first_file.txt
100644 blob e7e57ac6308173dec7dc1749b72c5f682aa67344 second_file.txt
So the tree object contains the reference to the blob objects of all the files in the working directory. But then a question arises as to how git maintains the folder hierarchy of the working directory if the tree only consists of blob objects of individual files. For example, if I add another file, third_file.txt
in a new folder called notes
inside my working directory, how will git know that this file belongs to the notes
folder and not the root folder?
Let us add a few more files and directories to our working directory and figure out how git handles this use case.
➜ time-travel-with-git git:(main) tree
.
├── first_file.txt
├── folder1
│ ├── fourth_file.txt
│ └── sub-folder11
│ ├── seventh_file.txt
│ └── third_file.txt
├── folder2
│ └── fifth_file.txt
├── folder3
│ └── sixth_file.txt
└── second_file.txt
5 directories, 7 files
After we stage all the files and commit these changes to the git repository, there will be a lot of objects present in the objects repository. To keep it simple, let us directly focus on the commit object
➜ objects git:(main) cd b4
➜ b4 git:(main) git cat-file -p b4f560a353a2202e9bb18939c7c3ccbf6e479639
tree 408c5d6e7750f028dd67dd5b25cf98076e400638
parent cc3a96b5a05c666fca63d7b4acd29ce779ad0ce5
author AmanMulani <amanmulani369@gmail.com> 1703971554 +0530
committer AmanMulani <amanmulani369@gmail.com> 1703971554 +0530
Second commit to time travel!
This commit object seems to be a bit different from the previous one. We have a parent object here. We will come back to that later, but let us first focus on the tree object. Let us further inspect the tree object:
➜ objects git:(main) git cat-file -p 408c5d6e7750f028dd67dd5b25cf98076e400638
100644 blob efb89ab841a90e0a67fbcdb6cad8696d0357e715 first_file.txt
040000 tree 44f140066e3411499a3f1dda2a4e1c1fb1fda7e4 folder1
040000 tree a6322941fa81fde0730c1bb6cdd69edfe23332b1 folder2
040000 tree 2028d42939fff173dece6e7050245b7f8e70a919 folder3
100644 blob e7e57ac6308173dec7dc1749b72c5f682aa67344 second_file.txt
Initially, we thought that the tree only consisted of the blob objects, which is partly true. But it also consists of other tree objects. So if we inspect the rest of the tree objects, we find that the tree object maintains the folder hierarchy via the tree objects itself.
# folder1
➜ objects git:(main) git cat-file -p 44f140066e3411499a3f1dda2a4e1c1fb1fda7e4
100644 blob c6c3571f4c4a10c36f3936bc0be783d2f742742b fourth_file.txt
040000 tree 464a225e9d96b59aba367df8e9d124c95ce4549e sub-folder11
# subfolder11
➜ objects git:(main) git cat-file -p 464a225e9d96b59aba367df8e9d124c95ce4549e
100644 blob 84cfff544bbf9ea894413243bc6e882d5b56afb5 seventh_file.txt
100644 blob 72418f23007afa1d63594d8e50215ed51b84ec03 third_file.txt
# folder2
➜ objects git:(main) git cat-file -p a6322941fa81fde0730c1bb6cdd69edfe23332b1
100644 blob f5a688aed83b1c6776885b5b05c13ba6ca3145d1 fifth_file.txt
# folder3
➜ objects git:(main) git cat-file -p 2028d42939fff173dece6e7050245b7f8e70a919
100644 blob a952388abaf3c3894d7bbab02475a0359dd7afed sixth_file.txt
Therefore the tree object in essence is the code snapshot we talk about. So after we commit our changes, git creates a new tree object, which has references to all the blob objects. In the tree, only those blobs are changed that have been modified. For the files that have not changed, the old references remain as is in the tree.
Parent (object type is commit)
If you recall from the example above, we also have the parent object. If you inspect the parent object, you will find that it is of type commit, and it simply points to the previous commit. Having the parent commit as a part of the commit object helps in tracking history, and has a bunch of other use cases as well which are not discussed in this article.
➜ objects git:(main) git cat-file -t cc3a96b5a05c666fca63d7b4acd29ce779ad0ce5
commit
➜ objects git:(main) git cat-file -p cc3a96b5a05c666fca63d7b4acd29ce779ad0ce5
tree 0d643279d1e62540cb18cb003d9b5e12b46bd992
author AmanMulani <amanmulani369@gmail.com> 1703969783 +0530
committer AmanMulani <amanmulani369@gmail.com> 1703969783 +0530
First stop in time-travel
Summary
By now, you are equipped with a strong foundational and fundamental understanding of git works. With the help of a simple repository, we have understood how git stores the snapshots of our data, and how to inspect the objects ourselves.
In the upcoming articles, we will be delving into real-time time travel from the present to the past. There will also be surprises as we learn more about how we can travel from one timeline to another via branching in the madness of this multiverse!