Demystifying Git: Branch Merging, Index and Staging

Now we know that there can be multiple timelines. These timelines can represent different features that you are working on. Ultimately, you want these features to be consolidated in a single branch.

Note: In this article we are not going to cover the types of merging like fast-forward merge, three-way merge, etc. We also won't cover concepts like squash merge or rebasing. There are plenty of great resources availble for these concepts.

What we do cover in this article is the rationale behind how git merges branches and indentifies what files have changed.

Before getting started with how git does merging. Let us start with something much simpler, how does git identify what files have changed when we stage changes?

Index

Git identifies changes in the working directory by maintaining an index, also known as the staging area. The index acts as an intermediary between our working directory and the repository, and it helps Git determine which changes should be included in the next commit. When we modify a file in our working directory, Git detects these changes, but they are not automatically committed. Instead, Git relies on the index to keep track of the staged changes. The index essentially serves as a snapshot of the working directory, containing information about the state of each file at the time of the last commit.

Git recognizes files that have changed even when they are not staged by utilizing the information stored in the index file within the .git directory. The index maintains a record of the files and their respective states, helping Git efficiently track modifications. When you we commands like git status or git diff, Git consults the index to compare the current state of the files in your working directory with the last committed state. This allows Git to highlight any differences and provide insights into which files have been modified but not yet staged. By leveraging the index, Git ensures a systematic and organized approach to managing changes, allowing developers to selectively stage and commit modifications.

Let us have a look at our index file present in the .git folder.

➜  time-travel-with-git git:(main) cat .git/index
DIRe??[
?k?e??[
?k?'????︚A?
g?Ͷ??imW?first_file.txte??82V
                             6e??82V
                                    6'???????WLJ?o96?
                                                     ???Bt+folder1/fourth_file.txte??m?gGe??m?gG'??????TK????A2C?n?-[V??%folder1/sub-folder11/seventh_file.txte??&12`e??&12`'?h???rA?#z?cYM?P!^??#folder1/sub-folder11/third_file.txte??MZ)De??MZ)D'??????????;gv?[[?;??1E?folder2/fifth_file.txte??\6|??e??\6|??'??????R8???ÉM{??$u?5?ׯ?folder3/sixth_file.txte?CZ    ??Ve?CZ    ??V(?b????ez??xe?<}?+n?n?x    merge.txte??5Ԇ8e??5Ԇ8'???????z?0?s???I?,_h*?sDsecond_file.txtTREE?8 3
???
   ??݋?`g#??fI?\cfolder13 1
D?@n4I???*N????sub-folder112 0
FJ"^?????6}???$?\?T?folder21 0
?2)A????s
         ?֞??32?folder31 0
 C?)?jƦz???%

The contents of the index file are not human-readable, and there isn't a direct way to decode its contents. However, we can see the current files being tracked in the index by using the following command: git ls-files --stage.

➜  time-travel-with-git git:(main) git ls-files --stage
100644 efb89ab841a90e0a67fbcdb6cad8696d0357e715 0    first_file.txt
100644 c6c3571f4c4a10c36f3936bc0be783d2f742742b 0    folder1/fourth_file.txt
100644 84cfff544bbf9ea894413243bc6e882d5b56afb5 0    folder1/sub-folder11/seventh_file.txt
100644 72418f23007afa1d63594d8e50215ed51b84ec03 0    folder1/sub-folder11/third_file.txt
100644 f5a688aed83b1c6776885b5b05c13ba6ca3145d1 0    folder2/fifth_file.txt
100644 a952388abaf3c3894d7bbab02475a0359dd7afed 0    folder3/sixth_file.txt
100644 e965047ad7c57865823c7d992b1d046ea66edf78 0    merge.txt
100644 e7e57ac6308173dec7dc1749b72c5f682aa67344 0    second_file.txt

As you see, the current index contains all the files of our working repository. Now let us examine any of the object IDs that are shown in the output.

➜  time-travel-with-git git:(main) git cat-file -t efb89ab841a90e0a67fbcdb6cad8696d0357e715
blob
➜  time-travel-with-git git:(main) git cat-file -p efb89ab841a90e0a67fbcdb6cad8696d0357e715
Hello, for the first time!

Each of the objects shown in the previous output was a reference to the blobs of the files in the working directory. Using this as a reference, git can identify if the files have been updated or not.

Bonus:
Shouldn't there be a way to see the changes introducted in commits if there is a way to compare unstaged / staged changes?
Well, there is another command in git, git show, that displays the information about a specific commit, including the changes introduced in that commit. When we run git show followed by a commit hash or a branch name, Git internally compares the specified commit with its parent commit (the previous state). It then shows the differences, or changes, between the two commits.

Now let us move on to the merging of branches.

Merging

The process of combining the changes from different timelines into a single timeline is called merging. We know that git knows what changes were introduced in commits in the given timeline. Now let us understand, how git understands which files to update, and what happens in case of the same files being updated in both timelines, or worse, if the same content is updated in both timelines.

When merging two branches in Git, we identify the changes made on each branch since they diverged. Git uses a three-way merge algorithm to combine changes from the common ancestor (base commit) to the tips of the two branches being merged. Let us try to understand step by step how a three-way merge takes place:

Git identifies the common ancestor commit of the branches being merged. This commit represents the last point where the branches share the same commit history.
Now git identifies the changes made on each branch independently. This involves comparing the common ancestor with the tips of the two branches to find the modifications made on each side.
The three-way merge combines the changes from the common ancestor to the branch tips. It considers the changes made on both branches and attempts to automatically reconcile them.
1. If the changes made to the files are the same on both branches (even though the commits are different), Git recognizes this as a non-conflicting scenario.
2. Git performs a content-level merge, applying the shared changes to the files from both branches.
3. If Git encounters conflicting changes (i.e., changes made on the same lines of the same file), it marks those files as conflicted. Git requires manual intervention to resolve these conflicts. In this case, git can no longer be in the autopilot mode. Here's where the experience of the time-traveller is required.
The result of the merge is a new commit that represents the combination of changes from both branches. This new commit will have multiple parent commits, indicating the branches that were merged.

During this process, Git doesn't necessarily identify "changed files" directly or compare the differences in the content of the files. Instead, it looks at changes in terms of commits and the differences introduced in those commits. The changes are then applied to the files, and the resulting state is what you see after completing the merge.

Summary

So, now, if you come across scenarios of why some changes are unintentionally showing up in the PRs that you have raised with your parent / other branches, do remember, that git is not only about comparing the content, it is also about comparing the commits and the where the commits are placed in the history of our timeline!

In the next article, we will be having a look at how to recover from the disastrous event of resetting our timeline, and why the recovery is possible!