<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[ArcheTech with Aman]]></title><description><![CDATA[Experience what it means to delve deep into software!]]></description><link>https://tech.amanmulani.com</link><image><url>https://cdn.hashnode.com/uploads/logos/658e754e0bcdc5c057e05fc5/04713fc9-f39f-4c79-a065-09231d9360ad.png</url><title>ArcheTech with Aman</title><link>https://tech.amanmulani.com</link></image><generator>RSS for Node</generator><lastBuildDate>Wed, 22 Apr 2026 14:35:49 GMT</lastBuildDate><atom:link href="https://tech.amanmulani.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[What If a Hash Function Wanted Collisions?]]></title><description><![CDATA[I was reading about universal hashing last week. You know, the standard stuff: pick random coefficients a and b, compute h(x) = (ax + b) mod p, and you get a family of hash functions where collisions ]]></description><link>https://tech.amanmulani.com/what-if-a-hash-function-wanted-collisions</link><guid isPermaLink="true">https://tech.amanmulani.com/what-if-a-hash-function-wanted-collisions</guid><dc:creator><![CDATA[Aman Mulani]]></dc:creator><pubDate>Sat, 14 Mar 2026 15:19:40 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/658e754e0bcdc5c057e05fc5/4a71a550-cecd-4591-829e-fb15bf1981ae.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I was reading about universal hashing last week. You know, the standard stuff: pick random coefficients a and b, compute <code>h(x) = (ax + b) mod p</code>, and you get a family of hash functions where collisions are provably rare. It is one of those foundational ideas that shows up everywhere, from hash tables to load balancing to streaming algorithms.</p>
<p>And then a thought hit me. What if I wanted the exact opposite?</p>
<p>What if, instead of scattering keys as far apart as possible, I <em>wanted</em> similar things to land in the same bucket?</p>
<p>That question turns out to have a name: Locality-Sensitive Hashing. And it turns out to be the trick behind some pretty significant systems.</p>
<h2>The Problem That Started All of This</h2>
<p>Think about what Google had to deal with in, say, 2007. Billions of web pages. A huge chunk of them are duplicates or near-duplicates. Mirror sites, scraped content, syndicated articles, press releases copy-pasted across a hundred news outlets. If you don't catch these, your search index bloats, your rankings get polluted, and users see the same article five times on the first page.</p>
<p>The naive approach is to compare every pair of documents. If you have a million documents, that is about 500 billion comparisons. Even if each comparison takes a microsecond, you are looking at roughly 6 days of compute. For a billion documents, the math just falls apart entirely.</p>
<p>Google needed a way to find near-duplicates without comparing every pair. And they needed it to work on a scale where even O(n log n) felt expensive.</p>
<p>This is also exactly the problem plagiarism detection tools face. Turnitin, Copyscape, and similar services need to take a submitted essay and check it against millions of existing documents. They can't afford to do a full text comparison against every document in their database. They need a shortcut.</p>
<h2>The Core Insight</h2>
<p>Here is the key idea behind LSH, and I think it is genuinely clever.</p>
<p>A regular hash function tries to make <code>h("cat") != h("bat")</code> even though the inputs differ by one character. That is the whole point of hashing. Uniform distribution, minimal collisions, all of that.</p>
<p>An LSH hash function does the opposite. It is <em>designed</em> so that similar inputs are likely to produce the same hash value. Not guaranteed. Likely. And you can tune exactly how likely.</p>
<p>This gives you a filter. Instead of comparing all pairs, you hash everything, and only compare documents that ended up in the same bucket. Most of the truly dissimilar pairs never even get looked at.</p>
<p>The tradeoff is that you might miss some similar pairs (false negatives) and you might flag some dissimilar pairs (false positives). But you can control both of those error rates by tuning parameters. And you go from O(n^2) comparisons to something much closer to O(n).</p>
<h2>Two Flavors of the Same Idea</h2>
<p>These are two main algorithms that I studied that fall under the LSH umbrella, and they solve slightly different problems. (Ref: Image generated by Google's Gemini)</p>
<img src="https://cdn.hashnode.com/uploads/covers/658e754e0bcdc5c057e05fc5/a7f25626-817e-4179-b993-cab816de5d1b.png" alt="Gemini Generated Image for LSH" style="display:block;margin:0 auto" />

<h3>MinHash: "Did someone copy-paste this?"</h3>
<p>MinHash is built for detecting structural overlap. It works with sets, and it estimates something called Jaccard similarity, which is just "what fraction of elements do these two sets share?"</p>
<p>The setup works like this. You take a document and break it into overlapping character sequences called shingles. The sentence "the cat sat" with shingle size 5 becomes {"the c", "he ca", "e cat", " cat ", "cat s", "at sa", "t sat"}. Now your document is a set, and you can ask set overlap questions.</p>
<p>The problem is that these sets are huge. A typical article might produce tens of thousands of shingles. Comparing two sets directly is fine, but comparing all pairs is not.</p>
<p>MinHash compresses each set into a short signature (say, 128 numbers) with a nice property: the probability that two signatures agree at any position equals the Jaccard similarity between the original sets. So you compare 128 numbers instead of tens of thousands of shingles.</p>
<p>This is not a heuristic. There is a clean theorem behind it. If you pick a random hash function h and look at the minimum hash value across all elements of a set, then:</p>
<pre><code class="language-markdown">P( min_hash(A) = min_hash(B) ) = |A ∩ B| / |A ∪ B|
</code></pre>
<p>The intuition is straightforward. Pool all shingles from both documents together, randomly order them by h, and look at whichever shingle lands first. That shingle is shared by both documents with probability equal to the fraction of the pool that is shared. That fraction is exactly the Jaccard similarity. Each position in the signature is an independent trial of this experiment, so averaging over 128 positions gives you a tight estimate.</p>
<p>And here is where the LSH part comes in. You don't even compare all pairs of signatures. You split each 128-number signature into bands (say 16 bands of 8 numbers each) and hash each band separately. Two documents become candidates only if they match on at least one band. The math works out so that pairs with high similarity almost always match on at least one band, while pairs with low similarity almost never do.</p>
<p>The result is a system with a sharp threshold. Below some similarity level, pairs are almost never detected. Above it, they almost always are. And you can tune exactly where that cutoff sits by adjusting how many bands you use and how many numbers go in each band.</p>
<p>This is what plagiarism detectors use. If someone copy-pastes a paragraph and changes a few words, the shingle sets still overlap heavily, MinHash catches the overlap, and the LSH index surfaces the pair without having to scan the whole database.</p>
<h3>SimHash: "Is this the same article in different words?"</h3>
<p>SimHash solves a different problem. It was developed by Moses Charikar, and Google used a variant of it for web-scale deduplication. Where MinHash asks "do these documents share literal text?", SimHash asks "do these documents have similar content?" even if the wording is completely different.</p>
<p>The idea is to compress an entire document into a single 64-bit number (a fingerprint) such that similar documents produce fingerprints that differ in very few bits.</p>
<p>How you get there: take each word in the document, weight it by importance (common words like "the" get low weight, distinctive words get high weight), hash each word to a 64-bit value, and then do a weighted vote across all 64 bit positions. If more weight voted for a 1 in position k, the fingerprint gets a 1 there. Otherwise it gets a 0.</p>
<p>Two documents about the same topic, using similar vocabulary, will have fingerprints that differ in maybe 2 or 3 bits out of 64. Two completely unrelated documents will differ in around 32 bits. You measure similarity by counting the number of differing bits (Hamming distance).</p>
<p>Again, there is a theorem that makes this precise. It comes from a result by Goemans and Williamson, originally in a different context, but the relevant part is this: if you represent two documents as weighted vectors in some high-dimensional feature space, then the probability that a random hyperplane hash disagrees at a single bit position is:</p>
<pre><code class="language-markdown">P( bit_i(A) ≠ bit_i(B) ) = θ(A, B) / π
</code></pre>
<p>where θ is the angle between the two document vectors. Small angle means similar documents, and similar documents will disagree on very few bits. The expected Hamming distance between two fingerprints is <code>64 × θ / π</code>. So Hamming distance is not just a rough proxy for similarity. It is a direct, linear function of the angle between the original document representations.</p>
<p>For the index, you use the pigeonhole principle. Split the 64-bit fingerprint into 4 blocks of 16 bits. If two fingerprints differ in at most 3 bits total, then at least one of the four blocks must be identical between them. So you build 4 separate indexes (one per block), and two documents become candidates if they match on any block.</p>
<p>Google used this (and variations of it) to deduplicate their web crawl. When you are indexing billions of pages and a huge fraction of them are just copies of the same content with different ads wrapped around it, SimHash lets you identify the duplicates in time proportional to the number of documents rather than the number of pairs.</p>
<h2>Why Not Just Use Embeddings?</h2>
<p>This is a fair question. Today, if you wanted to find semantically similar documents, you would probably reach for a sentence embedding model, throw the vectors into a vector database, and call it done.</p>
<p>But consider the constraints Google faced in 2007. Transformer models did not exist. Word2Vec was still six years away. The embedding models we take for granted today simply were not available.</p>
<p>Even today, embeddings are not always the right tool. They are expensive to compute, they require GPU infrastructure, and the vectors are large (768 or 1536 floats per document). SimHash gives you a 64-bit integer. That is 8 bytes. You can fit a billion SimHash fingerprints in about 8 GB of RAM. Try doing that with embedding vectors.</p>
<p>For the specific problem of "is this document a near-copy of something I have already seen?", LSH methods are still hard to beat on cost and simplicity. They don't require a model, they don't require GPUs, and they don't require a vector database. You need a hash function and some hash tables.</p>
<h2>Building It for Real</h2>
<p>I ended up building both pipelines from scratch to really understand how the pieces fit together. A MinHash-LSH pipeline for plagiarism detection, a SimHash-Pigeonhole pipeline for semantic deduplication, both backed by a custom on-disk hash index (because if you are already going down this rabbit hole, why stop at the algorithm and not also build the storage layer?).</p>
<p>I tested it against 10,000 documents: a mix of real news articles, synthetically mutated copies at various similarity levels, and noise documents. The system picks up copy-paste plagiarism with over 95% recall, and the on-disk index handles single-document queries in under 5ms.</p>
<p>The thing that surprised me most was how well the banded LSH approach works as a filter. Out of the roughly 50 million possible document pairs in a 10k corpus, the index narrows it down to a few thousand candidates. The verification step (which does the expensive exact comparison) only runs on those candidates. That is the real win. Not the hashing itself, but the fact that hashing lets you skip 99.99% of the work.</p>
<h2>The Takeaway</h2>
<p>I started with a silly question about inverting hash functions and ended up understanding how some of the largest-scale data processing systems actually work.</p>
<p>The core idea is worth remembering: sometimes the best way to solve a search problem is to design a hash function that is deliberately bad at being a hash function. Instead of scattering similar items apart, you scatter them <em>together</em>. And then you just look inside the buckets.</p>
<p>If you want to dig into the code, I have put the full implementation (with the custom disk-based index and all the benchmarks) on GitHub. However, given the coding agents at our disposal, I will rather ask you to giving it a try yourself. If you are still interested my implementation (that I generated using Claude), feel free to comment. Cheers! 🥂</p>
]]></content:encoded></item><item><title><![CDATA[The Halting Problem]]></title><description><![CDATA[Lately, I've been diving deep into one of computer science's most fundamental unsolvable problems. Yes, you read that right, unsolvable! Not just difficult, not just complex, but mathematically proven to be impossible to solve in the general case.
Wh...]]></description><link>https://tech.amanmulani.com/the-halting-problem</link><guid isPermaLink="true">https://tech.amanmulani.com/the-halting-problem</guid><category><![CDATA[Computer Science]]></category><category><![CDATA[halting-problem]]></category><category><![CDATA[Orchestration]]></category><category><![CDATA[Python]]></category><category><![CDATA[Software Engineering]]></category><dc:creator><![CDATA[Aman Mulani]]></dc:creator><pubDate>Sun, 15 Jun 2025 09:00:18 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/YKEjw5l-kNQ/upload/635be719e1916915bc336b6fa0f7fde3.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Lately, I've been diving deep into one of computer science's most fundamental unsolvable problems. Yes, you read that right, <em>unsolvable</em>! Not just difficult, not just complex, but mathematically proven to be impossible to solve in the general case.</p>
<h2 id="heading-whats-this-halting-problem-anyway">What's This Halting Problem Anyway?</h2>
<p>In a nutshell, the halting problem asks: "Can we write a program that can determine whether any arbitrary program will eventually finish running or continue forever?" Sounds reasonable, right? We should be able to analyze code and figure out if it'll terminate, after all both humans and computers are apparently intelligent, we should be able to predict the outcome of what we have written!</p>
<p>Turns out, nope. Not possible. And this isn't just me being pessimistic, this was proven by Alan Turing back in 1936. The proof is beautiful in its simplicity, and we can demonstrate it using Python.</p>
<h2 id="heading-the-python-paradox-seeing-is-believing">The Python Paradox: Seeing is Believing</h2>
<p>Let's imagine we have a magical function called <code>halt(program, input)</code> that can determine if a program will halt when given specific input. It returns <code>True</code> if the program terminates, and <code>False</code> if it runs forever.</p>
<p>Now, watch this paradox unfold:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">paradox</span>(<span class="hljs-params">program_code</span>):</span>
    <span class="hljs-comment"># Assuming we have a magical halt function that works</span>
    <span class="hljs-keyword">if</span> halt(program_code, program_code):
        <span class="hljs-comment"># If halt predicts the program will terminate...</span>
        <span class="hljs-comment"># For all practicality purposes for testing, you can just have a timeout here instead.</span>
        <span class="hljs-keyword">while</span> <span class="hljs-literal">True</span>:  <span class="hljs-comment"># Run forever!</span>
            <span class="hljs-keyword">pass</span>
    <span class="hljs-keyword">else</span>:
        <span class="hljs-comment"># If halt predicts the program will run forever...</span>
        <span class="hljs-keyword">return</span>  <span class="hljs-comment"># Terminate immediately!</span>

<span class="hljs-comment"># Now what happens when we run: paradox(paradox)?</span>
</code></pre>
<p>Do you see the contradiction? If our hypothetical <code>halt</code> function says <code>paradox(paradox)</code> will terminate, then the function enters an infinite loop (making <code>halt</code> wrong). If <code>halt</code> says it won't terminate, then the function returns immediately (also making <code>halt</code> wrong).</p>
<p>This simple example proves that our magical <code>halt</code> function cannot exist. It's not just that we haven't figured out how to implement it - it's mathematically impossible.</p>
<h2 id="heading-implementation-details">Implementation Details</h2>
<p>I won’t be going deep into the implementation details, will rather provide you with the test file with which I had mocked the halt function to return the desired output. This helped us to cleverly show the paradox without actually implementing anything in the halt function.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> unittest
<span class="hljs-keyword">from</span> unittest.mock <span class="hljs-keyword">import</span> patch

<span class="hljs-keyword">from</span> paradox <span class="hljs-keyword">import</span> paradox_self_reference
<span class="hljs-keyword">from</span> banner <span class="hljs-keyword">import</span> banner, conclusion

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">TestHaltingProblem</span>(<span class="hljs-params">unittest.TestCase</span>):</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_halting_mock_true</span>(<span class="hljs-params">self</span>):</span>
        <span class="hljs-string">"""
        Simulate halts(paradox) == True → paradox() should loop (i.e., timeout).
        """</span>
        <span class="hljs-keyword">with</span> patch(<span class="hljs-string">'halt.halt'</span>, return_value=<span class="hljs-literal">True</span>) <span class="hljs-keyword">as</span> mock_halt:
            halted = paradox_self_reference(mock_halt)
            self.assertFalse(halted, <span class="hljs-string">"Expected paradox() to infinitely loop. [Due to timeout, returns empty]"</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">test_halting_mock_false</span>(<span class="hljs-params">self</span>):</span>
        <span class="hljs-string">"""
        Simulate halts(paradox) == False → paradox() should halt quickly.
        """</span>
        <span class="hljs-keyword">with</span> patch(<span class="hljs-string">'halt.halt'</span>, return_value=<span class="hljs-literal">False</span>) <span class="hljs-keyword">as</span> mock_halt:
            halted = paradox_self_reference(mock_halt)
            self.assertTrue(halted, <span class="hljs-string">"Expected paradox() to halt, and therefore paradox() returns True"</span>)


<span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
    unittest.main()
</code></pre>
<h2 id="heading-outcome">Outcome</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749977597854/2a6255bf-f201-4883-8878-5dbbf3199dee.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-real-world-implications">Real World Implications:</h2>
<p>While the halting problem might seem like an abstract concept, it has serious practical implications:</p>
<ol>
<li><p><strong>Orchestration systems and job scheduling</strong>: You mentioned orchestrators, and you're absolutely right. Since we can't determine in advance whether every program will terminate, we can't precisely calculate how long each job will take. That's why orchestrators use timeouts and resource limits rather than trying to predict exact runtimes.</p>
</li>
<li><p><strong>Bug detection tools</strong>: Static analysis tools that try to find infinite loops or deadlocks are fundamentally limited. They may find some problems, but they cannot detect all possible non-terminating executions.</p>
</li>
<li><p><strong>Compiler optimizations</strong>: Some optimizations require knowing if certain code sections terminate. Compilers must use conservative approximations, potentially missing optimization opportunities.</p>
</li>
</ol>
<h2 id="heading-conclusion">Conclusion</h2>
<p>While you won’t frequently come across scenarios where you need to worry about the halting problem, it still shows that we have a far way to go in deterministically define the outcome of problems with the help of mathematics. Surely, long way to go in the evolution of humans!</p>
]]></content:encoded></item><item><title><![CDATA[Python Dependency Isolation with Subprocesses]]></title><description><![CDATA[Introduction
Resolving conflicting dependencies can be a real headache in python applications. Different parts of the codebase might require different versions of the same package, creating a seemingly impossible situation. While virtual environments...]]></description><link>https://tech.amanmulani.com/python-dependency-isolation-with-subprocesses</link><guid isPermaLink="true">https://tech.amanmulani.com/python-dependency-isolation-with-subprocesses</guid><category><![CDATA[Python]]></category><category><![CDATA[python projects]]></category><category><![CDATA[dependency management]]></category><category><![CDATA[Software Engineering]]></category><dc:creator><![CDATA[Aman Mulani]]></dc:creator><pubDate>Mon, 26 May 2025 19:50:06 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/aVOACNd1cc0/upload/cba7e7f6af9128cdc08145e846554a22.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>Resolving conflicting dependencies can be a real headache in python applications. Different parts of the codebase might require different versions of the same package, creating a seemingly impossible situation. While virtual environments help with project-level isolation, they don't solve the problem of conflicting dependencies within a single application. In this article, we'll explore a proof-of-concept approach that uses Python's subprocess mechanism to run specific functions with isolated dependencies. This is purely for exploration and is not at all recommended to be productionized, unless you are ready for late night pagers!</p>
<h2 id="heading-glossary">Glossary</h2>
<p><strong>Subprocess</strong>: A separate process spawned by a parent process, with its own memory space and environment<br /><strong>Runtime Environment</strong>: The set of dependencies available to a function when it executes<br /><strong>Dependency Isolation</strong>: Running code with specific dependencies regardless of what's installed in the parent process</p>
<h2 id="heading-the-challenge">The Challenge</h2>
<p>Consider this scenario: one part of your application needs <code>requests==2.22.0</code> while another requires <code>requests==2.32.2</code>. How do you satisfy both requirements within the same Python process? The initial thoughts are:</p>
<ol>
<li><p><strong>Compromise</strong>: Choose one version and hope everything works</p>
</li>
<li><p><strong>Refactoring</strong>: Rewrite code to work with a single version</p>
</li>
</ol>
<p>Let us explore a third option: run specific functions in isolated subprocesses with their own dependencies.</p>
<h2 id="heading-the-subprocess-solution">The Subprocess Solution</h2>
<p>The core idea is: when a function needs specific dependencies, we:</p>
<ol>
<li><p>Spawn a new Python subprocess</p>
</li>
<li><p>Install the required dependencies in a temporary directory</p>
</li>
<li><p>Execute the function code in this subprocess</p>
</li>
<li><p>Capture the result and return it to the parent process</p>
</li>
</ol>
<p>This approach leverages the fact that subprocesses have their own memory space and Python environment, allowing for true isolation without affecting the parent process.</p>
<h2 id="heading-key-components">Key Components</h2>
<p>Our solution consists of two main parts:</p>
<ol>
<li><p><strong>A decorator</strong> (<code>runtime_env</code>) that wraps functions to execute in isolated subprocesses</p>
</li>
<li><p><strong>A subprocess runner</strong> that handles dependency installation and function execution</p>
</li>
</ol>
<p>Here's how the decorator is used:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> decorator <span class="hljs-keyword">import</span> runtime_env

<span class="hljs-meta">@runtime_env("requirements.txt")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_data</span>(<span class="hljs-params">data, factor=<span class="hljs-number">1</span></span>):</span>
    <span class="hljs-comment"># This function runs in a subprocess with dependencies from requirements.txt</span>
    <span class="hljs-keyword">import</span> requests  <span class="hljs-comment"># Will use version specified in requirements.txt</span>

    <span class="hljs-comment"># Process data...</span>
    result = {
        <span class="hljs-string">"original"</span>: data,
        <span class="hljs-string">"processed"</span>: {<span class="hljs-string">"name"</span>: data[<span class="hljs-string">"name"</span>].upper(), <span class="hljs-string">"value"</span>: data[<span class="hljs-string">"value"</span>] * factor},
        <span class="hljs-string">"metadata"</span>: {<span class="hljs-string">"requests_version"</span>: requests.__version__}
    }

    <span class="hljs-keyword">return</span> result
</code></pre>
<h2 id="heading-how-it-works">How It Works</h2>
<p>When the decorated function is called, the following sequence happens:</p>
<ol>
<li><p><strong>Function identification</strong>: The decorator captures the source code of the module containing the function</p>
</li>
<li><p><strong>Argument serialization</strong>: Function arguments are serialized using Python's pickle module</p>
</li>
<li><p><strong>Subprocess creation</strong>: A new Python process is started, running our subprocess runner</p>
</li>
<li><p><strong>Dependency installation</strong>: The subprocess installs required packages in a temporary directory</p>
</li>
<li><p><strong>Function execution</strong>: The function runs in the isolated environment with its own dependencies</p>
</li>
<li><p><strong>Result communication</strong>: The result is serialized and passed back to the parent process</p>
</li>
<li><p><strong>Cleanup</strong>: Temporary directories and files are removed</p>
</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748287980031/65f50224-aaa9-4ede-bf91-a6a4a7f7ec17.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-example-in-action">Example in Action</h2>
<p>Let's explore a concrete example. The parent process might have <code>requests==2.26.0</code> installed, while we want a function to run with <code>requests==2.28.0</code>:</p>
<pre><code class="lang-python"><span class="hljs-comment"># Parent process</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_data</span>():</span>
    <span class="hljs-keyword">try</span>:
        <span class="hljs-keyword">import</span> requests
        print(<span class="hljs-string">f"Parent process using requests version: <span class="hljs-subst">{requests.__version__}</span>"</span>)
    <span class="hljs-keyword">except</span> ImportError:
        print(<span class="hljs-string">"Requests not available in parent"</span>)
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"name"</span>: <span class="hljs-string">"Test Data"</span>, <span class="hljs-string">"value"</span>: <span class="hljs-number">42</span>}

<span class="hljs-comment"># This will run in a subprocess with its own dependencies</span>
<span class="hljs-meta">@runtime_env("requirements.txt")  # Contains requests==2.28.0</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_data</span>(<span class="hljs-params">data</span>):</span>
    <span class="hljs-keyword">import</span> requests
    print(<span class="hljs-string">f"Subprocess using requests version: <span class="hljs-subst">{requests.__version__}</span>"</span>)
    <span class="hljs-keyword">return</span> {<span class="hljs-string">"result"</span>: data, <span class="hljs-string">"version_used"</span>: requests.__version__}

<span class="hljs-comment"># Main execution</span>
input_data = get_data()
result = process_data(input_data)
print(<span class="hljs-string">f"Process completed with requests version: <span class="hljs-subst">{result[<span class="hljs-string">'version_used'</span>]}</span>"</span>)
</code></pre>
<p>When executed, we see different versions of the requests library being used in each process, demonstrating successful isolation. Here’s a screenshot from the PoC that we did:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748288391181/f53889f1-d5f1-471d-ad3c-8e1efd2284c7.png" alt class="image--center mx-auto" /></p>
<p>As you can see in the example above, the parent process PID is 1, where as the subprocess PID is 7, and for both the parent process and subprocess, the requests library version is different.</p>
<h2 id="heading-behind-the-scenes">Behind the Scenes</h2>
<p>I won’t be delving in the exact approach here, I would let you take a stab at it, and yes, use LLMs to help you solution! However, here’s a summary of the approach at a very high level:</p>
<ul>
<li><p>The decorator first identifies the module to which the function belongs, and get the module via the <code>sys</code> package, and uses the <code>inspect</code> module to fetch the source file. We need the module in order to ensure that we can also use the imports used for the decorated function.</p>
</li>
<li><p>Once we have the code information, we use the subprocess module to create a new process, and we create a new temporary directory in which we install the required dependencies</p>
</li>
<li><p>Now we execute the function and return the output back to the caller process.</p>
</li>
</ul>
<h2 id="heading-only-for-fun-not-for-production">Only for fun, not for production!</h2>
<p>This approach is only for educational purposes, the primary reason being, this approach is very naive, and there are a lot of aspects to cover even to make it dev ready, for example, the protocol or mechanism used to communicate the results back to the parent process. What about concurrent processes creating new isolated envs? What about common dependencies between the parent and the child process? And what not!</p>
<h2 id="heading-explore-something-production-ready">Explore something production ready?</h2>
<p>To checkout an implementation that deals with conflicting dependencies you can read about <a target="_blank" href="https://docs.ray.io/en/latest/ray-core/handling-dependencies.html">Ray's Runtime Environments</a> feature, which provides a more robust, optimized approach to dependency isolation with additional features like package caching and environment reuse.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Python’s subprocess is a powerful module. The example above reveals the power of process isolation as a solution to conflicting dependencies, a common challenge in modern Python development. While this is a proof-of-concept, nevertheless, understanding the underlying principles of process isolation provides valuable insight into how dependency conflicts can be managed in complex applications.</p>
<p>If you want to explore the source code, feel free to get in touch! However, I strongly recommend doing it yourself, with LLMs at your disposal, should be an hour of effort!</p>
]]></content:encoded></item><item><title><![CDATA[The performance cost of dynamic lists]]></title><description><![CDATA[Introduction
Ever wondered why you are always told that numpy performs better than python lists? Well, it circles back to dynamic + reference based arrays vs static arrays.
In this article, we will understand it by exploring how Python's implementati...]]></description><link>https://tech.amanmulani.com/the-performance-cost-of-dynamic-lists</link><guid isPermaLink="true">https://tech.amanmulani.com/the-performance-cost-of-dynamic-lists</guid><category><![CDATA[Python]]></category><category><![CDATA[Performance Optimization]]></category><category><![CDATA[numpy]]></category><category><![CDATA[Python Internals]]></category><dc:creator><![CDATA[Aman Mulani]]></dc:creator><pubDate>Wed, 21 May 2025 19:26:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/szPxqGFNS6Y/upload/516677fbeb8b7ef6efca39b8c30748da.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction">Introduction</h2>
<p>Ever wondered why you are always told that numpy performs better than python lists? Well, it circles back to dynamic + reference based arrays vs static arrays.</p>
<p>In this article, we will understand it by exploring how Python's implementation of dynamic lists affects cache performance and why NumPy arrays can be dramatically faster for numerical operations.</p>
<h2 id="heading-glossary">Glossary</h2>
<ul>
<li><p><strong>L1 Cache (D1)</strong>: Smallest, fastest cache directly connected to the CPU</p>
</li>
<li><p><strong>Last-Level Cache (LL)</strong>: Larger but slower cache before accessing main memory</p>
</li>
<li><p><strong>Cache Lines</strong>: Data is loaded in fixed-size blocks (typically 64 bytes), not individual values</p>
</li>
<li><p><strong>Spatial Locality</strong>: The principle that after accessing one memory location, programs are likely to access nearby locations soon</p>
</li>
</ul>
<h2 id="heading-python-lists-under-the-hood">Python Lists: Under the Hood</h2>
<p>Python's lists are implemented as dynamic arrays of references. This means that when you create a list like <code>[1, 2, 3]</code>, what you're actually getting is an array of pointers, each pointing to a separate Python object stored somewhere else in memory:</p>
<pre><code class="lang-plaintext">List in memory:
[ptr] → PyObject(1)
[ptr] → PyObject(2)
[ptr] → PyObject(3)
</code></pre>
<p>Each PyObject carries type information, reference counts, and the actual value, requiring 16+ bytes per integer on 64-bit systems. When the list grows beyond its capacity, Python allocates a new, larger array (typically 1.125x the size until 9 elements, then 2x thereafter), copies over the pointers, and frees the old array.</p>
<h2 id="heading-numpy-arrays-the-contiguous-approach">NumPy Arrays: The Contiguous Approach</h2>
<p>In contrast, NumPy arrays store values directly in a contiguous memory block:</p>
<pre><code class="lang-plaintext">NumPy array in memory:
[1][2][3][4][5]...
</code></pre>
<p>This approach sacrifices Python's flexibility (mixing types) for dramatic performance improvements with homogeneous data.</p>
<h2 id="heading-two-approaches-compared">Two Approaches Compared</h2>
<p>Let's implement Python-like and NumPy-like structures in C++ to demonstrate the difference.</p>
<h3 id="heading-python-lists-dynamic-arrays">Python Lists (Dynamic Arrays)</h3>
<pre><code class="lang-cpp"><span class="hljs-comment">// Python-like list implementation (simplified)</span>

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">PythonList</span> {</span>
<span class="hljs-keyword">private</span>:
    T* <span class="hljs-built_in">array</span>;
    <span class="hljs-keyword">int</span> capacity;
    <span class="hljs-keyword">int</span> size;

    <span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">resize</span><span class="hljs-params">()</span> </span>{
        capacity *= <span class="hljs-number">2</span>;
        T* newArray = <span class="hljs-keyword">new</span> T[capacity];
        <span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; size; i++) {
            newArray[i] = <span class="hljs-built_in">array</span>[i];
        }
        <span class="hljs-keyword">delete</span>[] <span class="hljs-built_in">array</span>;
        <span class="hljs-built_in">array</span> = newArray;
    }

<span class="hljs-keyword">public</span>:
    PythonList() : capacity(<span class="hljs-number">1</span>), size(<span class="hljs-number">0</span>) {
        <span class="hljs-built_in">array</span> = <span class="hljs-keyword">new</span> T[capacity];
    }

    ~PythonList() {
        <span class="hljs-keyword">delete</span>[] <span class="hljs-built_in">array</span>;
    }

    <span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">append</span><span class="hljs-params">(T value)</span> </span>{
        <span class="hljs-keyword">if</span> (size == capacity) {
            resize();
        }
        <span class="hljs-built_in">array</span>[size++] = value;
    }

    <span class="hljs-function">T <span class="hljs-title">get</span><span class="hljs-params">(<span class="hljs-keyword">int</span> index)</span> <span class="hljs-keyword">const</span> </span>{
        <span class="hljs-keyword">if</span> (index &lt; <span class="hljs-number">0</span> || index &gt;= size) {
            <span class="hljs-keyword">throw</span> <span class="hljs-built_in">std</span>::out_of_range(<span class="hljs-string">"Index out of range"</span>);
        }
        <span class="hljs-keyword">return</span> <span class="hljs-built_in">array</span>[index];
    }

    <span class="hljs-function"><span class="hljs-keyword">int</span> <span class="hljs-title">getSize</span><span class="hljs-params">()</span> <span class="hljs-keyword">const</span> </span>{
        <span class="hljs-keyword">return</span> size;
    }

    <span class="hljs-function"><span class="hljs-keyword">int</span> <span class="hljs-title">getCapacity</span><span class="hljs-params">()</span> <span class="hljs-keyword">const</span> </span>{
        <span class="hljs-keyword">return</span> capacity;
    }
};
</code></pre>
<p>Dynamic lists resize when the initial defined size, and then subsequently, double their size. In simple words, [capacity reached → create new array double the size → copy elements → repeat]</p>
<blockquote>
<p>Extras!</p>
<p>When a Python list runs out of capacity, it needs to resize. This resizing operation requires:<br />1. Allocating a new, larger array (typically 2× the previous size)<br />2. Copying all existing elements to the new array<br />3. Deallocating the old array</p>
<p>This resize operation is expensive—O(n) where n is the number of elements. However, it occurs infrequently. If we've performed n append operations, resizing has only happened approximately log₂(n) times. The total cost after n operations is approximately: O(n) + O(n·log₂(n)) = O(n)</p>
<p>When divided by n operations, the amortized cost per operation becomes: O(n)/n = O(1)</p>
</blockquote>
<h3 id="heading-numpy-similar-to-primitive-arrays">Numpy (Similar to Primitive Arrays)</h3>
<p>This approach sacrifices flexibility for performance, as the size and the data type cannot be changed.</p>
<pre><code class="lang-cpp">
<span class="hljs-comment">// NumPy-like array implementation (simplified)</span>
<span class="hljs-keyword">template</span> &lt;<span class="hljs-keyword">typename</span> T&gt;
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">NumpyArray</span> {</span>
<span class="hljs-keyword">private</span>:
    T* <span class="hljs-built_in">array</span>;
    <span class="hljs-keyword">int</span> size;

<span class="hljs-keyword">public</span>:
    NumpyArray(<span class="hljs-keyword">int</span> n) : size(n) {
        <span class="hljs-built_in">array</span> = <span class="hljs-keyword">new</span> T[size];
    }

    ~NumpyArray() {
        <span class="hljs-keyword">delete</span>[] <span class="hljs-built_in">array</span>;
    }

    <span class="hljs-function"><span class="hljs-keyword">void</span> <span class="hljs-title">set</span><span class="hljs-params">(<span class="hljs-keyword">int</span> index, T value)</span> </span>{
        <span class="hljs-keyword">if</span> (index &lt; <span class="hljs-number">0</span> || index &gt;= size) {
            <span class="hljs-keyword">throw</span> <span class="hljs-built_in">std</span>::out_of_range(<span class="hljs-string">"Index out of range"</span>);
        }
        <span class="hljs-built_in">array</span>[index] = value;
    }

    <span class="hljs-function">T <span class="hljs-title">get</span><span class="hljs-params">(<span class="hljs-keyword">int</span> index)</span> <span class="hljs-keyword">const</span> </span>{
        <span class="hljs-keyword">if</span> (index &lt; <span class="hljs-number">0</span> || index &gt;= size) {
            <span class="hljs-keyword">throw</span> <span class="hljs-built_in">std</span>::out_of_range(<span class="hljs-string">"Index out of range"</span>);
        }
        <span class="hljs-keyword">return</span> <span class="hljs-built_in">array</span>[index];
    }

    <span class="hljs-function"><span class="hljs-keyword">int</span> <span class="hljs-title">getSize</span><span class="hljs-params">()</span> <span class="hljs-keyword">const</span> </span>{
        <span class="hljs-keyword">return</span> size;
    }
};
</code></pre>
<p>Both serve similar purposes but have fundamentally different memory layouts.</p>
<h2 id="heading-analyzing-cache-performance">Analyzing Cache Performance</h2>
<p>Let's examine what happens when we iterate through both structures to calculate a sum:</p>
<pre><code class="lang-cpp"><span class="hljs-comment">// PythonList iteration</span>
<span class="hljs-keyword">long</span> <span class="hljs-keyword">long</span> sum = <span class="hljs-number">0</span>;
<span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; pyList.getSize(); i++) {
    sum += pyList.get(i);  <span class="hljs-comment">// Each access may cause a cache miss</span>
}

<span class="hljs-comment">// NumpyArray iteration</span>
<span class="hljs-keyword">long</span> <span class="hljs-keyword">long</span> sum = <span class="hljs-number">0</span>;
<span class="hljs-keyword">for</span> (<span class="hljs-keyword">int</span> i = <span class="hljs-number">0</span>; i &lt; npArray.getSize(); i++) {
    sum += npArray.get(i);  <span class="hljs-comment">// Sequential access has better locality</span>
}
</code></pre>
<h2 id="heading-cache-performance-results">Cache Performance Results</h2>
<p>Using Valgrind's cachegrind tool, we measured cache misses for both implementations across different sizes and performance:</p>
<ol>
<li><p>Numpy’s execution time is less than list as the dataset size increases.</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747854415826/1f051c90-ab39-4d4e-a055-03cb601c4268.png" alt class="image--center mx-auto" /></p>
</li>
<li><p>D1 and LLd cache misses are more for python’s dynamic list implementation</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747854557175/caef39dc-9584-4be8-a419-bd2971c78274.png" alt class="image--center mx-auto" /></p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747854586937/934bb2b0-e6ff-4e33-a821-9a6941110e9c.png" alt class="image--center mx-auto" /></p>
</li>
</ol>
<h2 id="heading-the-spatial-locality-advantage">The Spatial Locality Advantage</h2>
<p>NumPy outperforms Python lists because:</p>
<p><strong>Contiguous Storage</strong>: When a cache line is loaded for array[i], it also contains array[i+1], array[i+2], etc., which will be used immediately afterward.</p>
<p><strong>Fewer Memory Indirections</strong>: Python lists require a pointer dereference for each access, potentially bringing in a cache line that contains only one useful value.</p>
<p><strong>Better Prefetching</strong>: Modern CPUs can predict sequential access patterns and prefetch data, but struggle with the scattered memory access of pointer chasing.</p>
<h2 id="heading-cache-misses-explained">Cache Misses Explained</h2>
<p>When iterating through a Python list, each element access follows this pattern:</p>
<ol>
<li><p>Load pointer from list array (potential cache miss #1)</p>
</li>
<li><p>Follow pointer to actual object (potential cache miss #2)</p>
</li>
<li><p>Extract value from object (potential cache miss #3)</p>
</li>
<li><p>Move to next element (which may be in a completely different memory area)</p>
</li>
</ol>
<p>For NumPy arrays, the access pattern is:</p>
<ol>
<li><p>Calculate position (base address + i * element_size)</p>
</li>
<li><p>Access value directly (adjacent to previous value)</p>
</li>
<li><p>Move to next element (likely in the same cache line)</p>
</li>
</ol>
<p>With Python lists generating significantly more cache misses, the performance impact is enormous.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>The post is not about how lists are internally implemented in python, but rather it is an attempt to spark your curiosity to understand how things work understand the hood, and understand the trade-offs. For the ease of use, and to provide flexibility to have different types of elements in the list, python has traded it off with performance. Hope you found this insightful!</p>
]]></content:encoded></item><item><title><![CDATA[Dynamic Programming & Spatial Locality]]></title><description><![CDATA[Introduction
Computers use a hierarchy of memory systems to balance speed and capacity. The fastest memory (cache) sits closest to the CPU but is limited in size. When data isn't found in cache, it must be fetched from slower memory, creating a "cach...]]></description><link>https://tech.amanmulani.com/dynamic-programming-and-spatial-locality</link><guid isPermaLink="true">https://tech.amanmulani.com/dynamic-programming-and-spatial-locality</guid><category><![CDATA[Dynamic Programming]]></category><category><![CDATA[data structures]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[Performance Optimization]]></category><dc:creator><![CDATA[Aman Mulani]]></dc:creator><pubDate>Sat, 17 May 2025 22:27:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/b8th6xvb-DU/upload/fa130d6b5c50051f9f2cfe696b068cdc.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-introduction"><strong>Introduction</strong></h2>
<p>Computers use a hierarchy of memory systems to balance speed and capacity. The fastest memory (cache) sits closest to the CPU but is limited in size. When data isn't found in cache, it must be fetched from slower memory, creating a "cache miss" that significantly slows execution. In this article we will explore how spatial locality can have significant impact on the performance of your algorithm leveraging dynamic programming techniques.</p>
<h2 id="heading-glossary">Glossary</h2>
<ul>
<li><p><strong>L1 Cache (D1)</strong>: Smallest, fastest cache directly connected to the CPU</p>
</li>
<li><p><strong>Last-Level Cache (LL)</strong>: Larger but slower cache before accessing main memory</p>
</li>
<li><p><strong>Cache Lines</strong>: Data is loaded in fixed-size blocks (typically 64 bytes), not individual values</p>
</li>
<li><p><strong>Spatial Locality</strong> means that after accessing one memory location, programs are likely to access nearby locations soon, and therefore, having these locations in the cache can speed up the operations.</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747520128226/52c0f214-1d9c-4fb6-bd8e-b4da769694f6.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-the-two-dp-approaches"><strong>The Two DP Approaches</strong></h2>
<p>Our example calculates Fibonacci numbers using two approaches:</p>
<ol>
<li><p><strong>Memoization</strong>: Stores previously calculated values in an array but accesses them in a somewhat random pattern through recursion.</p>
<pre><code class="lang-bash"> long long fib_memoization_array(int n, std::vector&lt;long long&gt;&amp; memo) {
     <span class="hljs-keyword">if</span> (n &lt;= 1) <span class="hljs-built_in">return</span> n;

     <span class="hljs-keyword">if</span> (memo[n] != -1) {
         <span class="hljs-built_in">return</span> memo[n];
     }

     memo[n] = fib_memoization_array(n-1, memo) + fib_memoization_array(n-2, memo);
     <span class="hljs-built_in">return</span> memo[n];
 }
</code></pre>
</li>
<li><p><strong>Tabulation</strong>: Builds the solution incrementally, accessing memory in a highly sequential pattern</p>
<pre><code class="lang-bash"> long long fib_tabulation(int n) {
     <span class="hljs-keyword">if</span> (n &lt;= 1) <span class="hljs-built_in">return</span> n;

     std::vector&lt;long long&gt; dp(n+1, 0);
     dp[0] = 0;
     dp[1] = 1;

     // Sequential access pattern - excellent spatial locality
     <span class="hljs-keyword">for</span> (int i = 2; i &lt;= n; i++) {
         dp[i] = dp[i-1] + dp[i-2];
     }

     <span class="hljs-built_in">return</span> dp[n];
 }
</code></pre>
</li>
</ol>
<p>Both solve the same problem, but their memory access patterns differ dramatically.</p>
<h2 id="heading-setup"><strong>Setup</strong></h2>
<p>For our cache performance experiments, we're using Docker to run Valgrind's cachegrind tool on macOS with an M3 chip. Since Valgrind does not support Apple Silicon chips, therefore, we used Docker to spin up a linux instance.</p>
<h2 id="heading-analyzing-cache-performance"><strong>Analyzing Cache Performance</strong></h2>
<h3 id="heading-memoization">Memoization</h3>
<pre><code class="lang-plaintext">root@62969aff3a5f:/workspace# valgrind --main-stacksize=10000000000 --tool=cachegrind \
                                ./access_test memoization

Time taken for 1 runs: 20404 ms
==224== 
==224== I   refs:      4,251,823,292
==224== I1  misses:            2,032
==224== LLi misses:            1,674
==224== I1  miss rate:          0.00%
==224== LLi miss rate:          0.00%
==224== 
==224== D   refs:      2,950,712,351  (1,600,511,544 rd   + 1,350,200,807 wr)
==224== D1  misses:      187,518,436  (   87,515,735 rd   +   100,002,701 wr)
==224== LLd misses:      187,505,766  (   87,504,607 rd   +   100,001,159 wr)
==224== D1  miss rate:           6.4% (          5.5%     +           7.4%  )
==224== LLd miss rate:           6.4% (          5.5%     +           7.4%  )
==224== 
==224== LL refs:         187,520,468  (   87,517,767 rd   +   100,002,701 wr)
==224== LL misses:       187,507,440  (   87,506,281 rd   +   100,001,159 wr)
==224== LL miss rate:            2.6% (          1.5%     +           7.4%  )
</code></pre>
<h3 id="heading-tabulation">Tabulation</h3>
<pre><code class="lang-plaintext">root@62969aff3a5f:/workspace# valgrind --tool=cachegrind ./access_test tabulation
Time taken for 1 runs: 1620 ms
==225== 
==225== I   refs:      751,823,239
==225== I1  misses:          2,031
==225== LLi misses:          1,673
==225== I1  miss rate:        0.00%
==225== LLi miss rate:        0.00%
==225== 
==225== D   refs:      250,712,360  (100,511,549 rd   + 150,200,811 wr)
==225== D1  misses:     25,018,742  (     15,990 rd   +  25,002,752 wr)
==225== LLd misses:     25,010,403  (      8,689 rd   +  25,001,714 wr)
==225== D1  miss rate:        10.0% (        0.0%     +        16.6%  )
==225== LLd miss rate:        10.0% (        0.0%     +        16.6%  )
==225== 
==225== LL refs:        25,020,773  (     18,021 rd   +  25,002,752 wr)
==225== LL misses:      25,012,076  (     10,362 rd   +  25,001,714 wr)
==225== LL miss rate:          2.5% (        0.0%     +        16.6%  )
</code></pre>
<p>Let's examine the cachegrind output:</p>
<h3 id="heading-execution-time-comparison">Execution Time Comparison</h3>
<ul>
<li><p><strong>Memoization</strong>: 20,404 ms</p>
</li>
<li><p><strong>Tabulation</strong>: 1,620 ms (12.6× faster!)</p>
</li>
</ul>
<h3 id="heading-cache-statistics-breakdown">Cache Statistics Breakdown</h3>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Metric</td><td>Memoization</td><td>Tabulation</td><td>Difference</td></tr>
</thead>
<tbody>
<tr>
<td>Instructions</td><td>4.25 billion</td><td>751 million</td><td>5.7× fewer</td></tr>
<tr>
<td>Data references</td><td>2.95 billion</td><td>250 million</td><td>11.8× fewer</td></tr>
<tr>
<td>D1 cache misses</td><td>187.5 million</td><td>25 million</td><td>7.5× fewer</td></tr>
<tr>
<td>LL cache misses</td><td>187.5 million</td><td>25 million</td><td>7.5× fewer</td></tr>
</tbody>
</table>
</div><p>While tabulation shows a higher D1 miss rate percentage (10.0% vs 6.4%), but look at the absolute number of cache misses is dramatically lower because it performs far fewer operations overall. There is a ~7.5x increased in the D1 cache misses in the memoization approach, which can create a big difference highly latency sensitive applications.</p>
<h2 id="heading-the-spatial-locality-advantage"><strong>The Spatial Locality Advantage</strong></h2>
<p>Tabulation outperforms memoization because:</p>
<ol>
<li><p><strong>Sequential Access Pattern</strong>: Tabulation accesses array elements in order (dp[0], dp[1], dp[2]...), which perfectly aligns with how cache lines are loaded and thus eliminates redundant calculations, requiring far fewer instructions and memory accesses.</p>
</li>
<li><p><strong>More Efficient Cache Usage</strong>: When a cache line is loaded for dp[i], it also contains dp[i+1], dp[i+2], etc., which will be used immediately afterward, whereas in memoization, this does not hold true, because of the recurrence.</p>
</li>
</ol>
<h2 id="heading-cache-misses-explained"><strong>Cache Misses Explained</strong></h2>
<p>In the memoization approach, the recursive calls create a less predictable access pattern. When calculating fib(n), the function might jump to calculating fib(n-2) before fib(n-1) is fully resolved, causing scattered memory access that defeats the cache's prefetching mechanisms.</p>
<p>Each cache miss can cost 100-300 CPU cycles compared to 1-3 cycles for a cache hit. With memoization generating 7.5× more cache misses, the performance impact is enormous.</p>
<h2 id="heading-conclusion"><strong>Conclusion</strong></h2>
<p>This example perfectly demonstrates why understanding spatial locality matters for performance-critical code. Though both algorithms have the same mathematical complexity, the memoization approach's inferior cache utilisation and extensive call stack usage makes it over 12 times slower.</p>
<p>When designing algorithms, especially for dynamic programming problems, consider not just the time complexity but also how your data access patterns interact with the cache. Sequential access patterns will almost always outperform scattered access, particularly for large datasets that exceed cache size.</p>
<p><em>P.S. For those wondering why the cover photo of the article features lilies, here’s an extra byte for you: Lilies have a Fibonacci number of petals!</em></p>
]]></content:encoded></item><item><title><![CDATA[Simple maths behind Big-O]]></title><description><![CDATA[Isn’t it very common to hear about the worst-case time complexity of an algorithm? Well, let’s hear the math behind that!

Let’s say we have two functions: f(x) and g(x). Big-O notation is a way to describe how fast functions grow, especially as x be...]]></description><link>https://tech.amanmulani.com/simple-maths-behind-big-o</link><guid isPermaLink="true">https://tech.amanmulani.com/simple-maths-behind-big-o</guid><category><![CDATA[Time Complexity]]></category><category><![CDATA[#big o notation]]></category><category><![CDATA[Mathematics]]></category><category><![CDATA[algorithms]]></category><dc:creator><![CDATA[Aman Mulani]]></dc:creator><pubDate>Mon, 05 May 2025 18:05:59 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/yfScbLDv2ME/upload/d7815fa09d456be7b15c911010e938a5.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p>Isn’t it very common to hear about the worst-case time complexity of an algorithm? Well, let’s hear the <em>math</em> behind that!</p>
</blockquote>
<p>Let’s say we have two functions: f(x) and g(x). Big-O notation is a way to describe how fast functions grow, especially as x becomes really, really large.</p>
<h2 id="heading-set-theory"><strong>Set Theory</strong></h2>
<p>To put it up in simple words, the notation O(g(x)) represents a family of functions that grow <em>no faster than</em> g(x), up to constant factors and for large enough x.</p>
<p>So, when we write:</p>
<p>$$|f(x)| \in O(g(x)) \text{ we are saying that } f(x) \text{ belongs to this family, such that } $$</p><p> $$f(x) \text{ doesn’t outgrow } g(x) \text{ as } x \to \infty$$</p>
<h2 id="heading-calculus">Calculus</h2>
<p>Now let’s back that up with some actual math.</p>
<p>We say f(x) ∈ O(g(x)) if there exist positive constants C and x₀ such that:</p>
<p>$$|f(x)| ≤ C\cdot|g(x)| \text{ for all } x &gt; x_0$$</p><p>Or, in terms of limits:</p>
<p>$$\lim_{x \to \infty} \frac{|f(x)|}{|g(x)|} &lt; \infty$$</p><p>This tells us that the ratio of f(x) to g(x) stays bounded. In other words, f(x) may grow, but it’s never going to grow <em>faster</em> than some constant multiple of g(x) beyond a certain point. For sufficiently large values of x, f(x) will always be smaller than g(x).</p>
<h2 id="heading-example">Example</h2>
<p>Let us use the following example to understand it better:</p>
<p>$$\text{Prove that: } 2^{n} \text{ is upper bounded by } e^n \text{ i.e. } 2^n \in O(e^n)$$</p><p>Let us test with the limit, as n → infinity,</p>
<p>$$\lim_{n \to \infty} \frac{f(n)}{g(n)} = \lim_{n \to \infty} \frac{2^n}{e^n} = \lim_{n \to \infty} (\frac{2}{e})^n [\text{ By power of quotient rule}]$$</p><p>$$\text{ Since } e\sim2.718 &gt; 2 \text{, therefore, } \frac{2}{e} &lt; 1. \text{ Therefore as } n \to \infty \text{ , } ({\frac{2}{e}})^n \to 0$$</p><p> $$ \text{Therefore, } \lim_{n \to \infty} \frac{f(n)}{g(n)} = 0 &lt; \infty. \text{ Hence, } 2^n \in O(e^n)$$</p>
<p>From this, it becomes clear that 2^n belongs to the set O(e^n), which means it is one of the functions in the family of functions that do not grow faster than e^n as n → infinity.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1746468075402/c78324b5-df91-409b-815a-2d5ef3f4014c.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>That’s all Big-O is: a mathematical way to group functions into families based on their long-term growth rates. It gives us a high-level idea of performance, complexity, or behaviour—without sweating the exact details. Similarly, you can later explore Big-Omega and Big-Theta to complete the picture!</p>
]]></content:encoded></item><item><title><![CDATA[[Kind] Kubernetes in Docker]]></title><description><![CDATA[Introduction
I am sure that if you use Kubernetes for container orchestration, you must have used either minikube or kind for running k8s locally. When first faced with the challenge of testing a few loading balancing scenarios, I chose minikube, as ...]]></description><link>https://tech.amanmulani.com/kind-kubernetes-in-docker</link><guid isPermaLink="true">https://tech.amanmulani.com/kind-kubernetes-in-docker</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Docker]]></category><category><![CDATA[Software Engineering]]></category><dc:creator><![CDATA[Aman Mulani]]></dc:creator><pubDate>Wed, 02 Apr 2025 18:15:12 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/_PDPRax9dm8/upload/c55f77178c872c7a9a386f410206cdab.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><img src="https://kind.sigs.k8s.io/logo/logo.png" alt="kind" /></p>
<h2 id="heading-introduction">Introduction</h2>
<p>I am sure that if you use Kubernetes for container orchestration, you must have used either minikube or kind for running k8s locally. When first faced with the challenge of testing a few loading balancing scenarios, I chose minikube, as I had used it while learning kubernetes. However, this time, I found installation issues on my apple silicon chip laptop. Therefore, I started to look for alternatives, and there’s where I came across kind.</p>
<blockquote>
<p>FYI, I haven’t extensively used both minikube and kind. The following are just my observations based on my usage and research.</p>
</blockquote>
<p>In essence, Kubernetes cluster consists of a control plane (master node), and a set of worker nodes. The control plane consists of scheduler, controller, api-server, etc. The worker node consists of, CRE (container runtime engine), that manages and runs containers. kube-proxy helps manage service networking, while a Container Network Interface (CNI) plugin (such as Flannel or Calico) enables inter-node communication. Kubelet, the agent that ensures the node is running the assigned workload.</p>
<p>As you can see for Kubernetes itself to function we need to have quite a few components working in sync. While majority of the cloud providers already provide Kubernetes setup with ease, testing kubernetes can be challenging locally.</p>
<p>Here’s where kind comes into the picture, Kind enables you to create a Kubernetes cluster where each node is a separate container running inside Docker. While Kind uses Docker to launch the cluster, each Kubernetes node inside the cluster runs its own container runtime (usually containerd). Kind manages their networking and lifecycle to simulate a real Kubernetes cluster.</p>
<h2 id="heading-kind">Kind</h2>
<p>As you can find in the documentation, kind stands for <strong>K</strong>ubernetes <strong>in</strong> <strong>D</strong>ocker. To put in simple words, kind is a tool that let’s you run Kubernetes inside container. How ironical! A container to run a tool whose sole purpose is to orchestrate other containers! The master node and worker nodes are provisioned as separate docker containers when a cluster is created via kind.</p>
<p>If you look through the code base of <a target="_blank" href="https://github.com/kubernetes-sigs/kind">kind</a>, you will find that kind at its core, internally, just executes docker commands (if docker provider is used).</p>
<pre><code class="lang-go"><span class="hljs-function"><span class="hljs-keyword">func</span> <span class="hljs-title">createContainer</span><span class="hljs-params">(name <span class="hljs-keyword">string</span>, args []<span class="hljs-keyword">string</span>)</span> <span class="hljs-title">error</span></span> {
    <span class="hljs-keyword">return</span> exec.Command(<span class="hljs-string">"docker"</span>, <span class="hljs-built_in">append</span>([]<span class="hljs-keyword">string</span>{<span class="hljs-string">"run"</span>, <span class="hljs-string">"--name"</span>, name}, args...)...).Run()
}
</code></pre>
<p>This is what makes kind an interesting lightweight tool as it simplifies cluster management by using Docker to create nodes, but still manages cluster configuration, networking, and lifecycle using Kubernetes tools like kubeadm. The image below is a rough illustration of what kind does, it does not show the true working, and is only for representational purposes.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1743537195797/a308a658-827a-43df-b90c-a39636afeb73.png" alt="The image below is a rough illustration of what kind does, it does not show the true working, and is only for representational purposes. " class="image--center mx-auto" /></p>
<h3 id="heading-code-highlights">Code Highlights</h3>
<p>Since I don’t want this article to be a tutorial, I am attaching the references to the code bits that I found interesting wrt Kind.</p>
<ol>
<li><p>Provider Interface (<a target="_blank" href="https://github.com/kubernetes-sigs/kind/blob/72d13903c5395be3d739ad92f30d14a22ecb7850/pkg/cluster/internal/providers/provider.go">provider.go</a>)</p>
<p> At the time of writing this article, Kind support three types of CREs, docker, nerdctl and podman. Kind uses the provider pattern (or the strategy pattern for the OO folks). Simple, yet powerful design pattern that makes any provider pluggable.</p>
</li>
<li><p>Docker Provider (<a target="_blank" href="https://github.com/kubernetes-sigs/kind/blob/64b1432242dd788a2ca1ca4d47a14bbe2067be40/pkg/cluster/internal/providers/docker/provider.go">docker/provider.go</a>)</p>
<p> This is the concrete implementation of provider interface for docker. In this file, you can find how nodes are created from a base image.</p>
</li>
<li><p>Kind CLI (<a target="_blank" href="https://github.com/kubernetes-sigs/kind/blob/84101f25ddb9724d3dc774eff3b18c7e530cba19/pkg/cmd/kind/root.go">cmd/kind/root.go</a>)</p>
<p> The kind CLI tool internally figures out which provider to use (docker, podman, etc.) and goes on the run the commands against the concrete implementation for that provider. CLI supports the following top level subcommands:</p>
<pre><code class="lang-go"> <span class="hljs-comment">// add all top level subcommands: build, completion, create, delete, export, get,</span>
 <span class="hljs-comment">// version and load</span>
 cmd.AddCommand(build.NewCommand(logger, streams))
 cmd.AddCommand(completion.NewCommand(logger, streams))
 cmd.AddCommand(create.NewCommand(logger, streams))
 cmd.AddCommand(<span class="hljs-built_in">delete</span>.NewCommand(logger, streams))
 cmd.AddCommand(export.NewCommand(logger, streams))
 cmd.AddCommand(get.NewCommand(logger, streams))
 cmd.AddCommand(version.NewCommand(logger, streams))
 cmd.AddCommand(load.NewCommand(logger, streams))
</code></pre>
</li>
</ol>
<p>Another interesting thing to explore is how kubeadm initialisation happens, but I will leave that as a homework for the readers!</p>
<h3 id="heading-images-in-the-kind-cluster">Images in the kind cluster</h3>
<p>Kind is unable to automatically access to <strong>locally</strong> built images because it does not have direct access to the Docker daemon on your host. But the interesting thing is that using the following command, you can load your locally built image into the kind cluster:</p>
<pre><code class="lang-bash">kind load docker-image kind.local/image-name:image-tag --name kind-cluster-name
</code></pre>
<p>where kind.local/ is a custom namespace for local images in the kind cluster. The kind load command allows users to push images from the host’s Docker daemon to the Kind cluster, making them accessible to nodes inside the containers</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Overall, I feel that kind successfully caters to the following needs: fast cluster spin-up, lightweight resource usage, and easy CI/CD integration, and without a doubt can be used for these use-cases.</p>
<p>Regardless of your choice of tool, kind or minikube, understanding how Kubernetes works internally, especially concepts like kubeadm, container runtimes, and networking, is crucial for effectively managing clusters in production. Here is where understanding such tools and having a look at the source code can help you develop better understanding. I also find the code of kind to be clean and can be used to learn a couple of design patterns. So the next time you want to test Kubernetes locally, do have a look at kind, not just at kind but its source code as well!</p>
]]></content:encoded></item><item><title><![CDATA[Git Internals: Recover from Git HARD Reset!]]></title><description><![CDATA[In this short article, we will have a look at some practical implications of what we have seen in the previous articles. We will be looking at ways to do damage recovery and control.

This also addresses one of the most common questions that I get, "...]]></description><link>https://tech.amanmulani.com/git-internals-recover-from-git-hard-reset</link><guid isPermaLink="true">https://tech.amanmulani.com/git-internals-recover-from-git-hard-reset</guid><category><![CDATA[Git]]></category><category><![CDATA[GitHub]]></category><category><![CDATA[version control]]></category><dc:creator><![CDATA[Aman Mulani]]></dc:creator><pubDate>Sun, 04 Feb 2024 07:24:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/_whs7FPfkwQ/upload/be1bec9b2df6f9c2ddec92c80d7945f1.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this short article, we will have a look at some practical implications of what we have seen in the previous articles. We will be looking at ways to do damage recovery and control.</p>
<blockquote>
<p>This also addresses one of the most common questions that I get, "Can I recover data from a hard reset?"</p>
</blockquote>
<h2 id="heading-disaster-recovery">Disaster Recovery</h2>
<p>Let us consider the following as your commit history:</p>
<pre><code class="lang-plaintext">a (first-commit) &lt;- b &lt;- c &lt;- d &lt;- e &lt;- f (latest-commit)
</code></pre>
<p>Now you realise that the commits, <code>d, e, f,</code> were incorrect, and you no longer want them. So, you have two options now:</p>
<ol>
<li><p>Revert these commits individually. The commit history will look something like this:</p>
<pre><code class="lang-plaintext"> (first-commit) &lt;- b &lt;- c &lt;- d &lt;- e &lt;- f &lt;- ~f &lt;- ~e &lt;- ~d (latest-commit),

 Where, ~f is the revert of f
        ~e is the revert of e
        ~d is the revert of d.
</code></pre>
<p> You must have noticed here that you have to revert the changes from the latest to the oldest. So, you should always first revert the latest commit, and then move to the older ones. The command to revert the git commit is:</p>
<pre><code class="lang-plaintext"> git revert &lt;commit-id&gt;
</code></pre>
<p> The advantage of this is that the history is preserved, and if at some later point in time, you realise that you needed the commit, you can simply refer to the commit ID and get back the changes that were made in those commits.</p>
</li>
<li><p>Delete the commits from the timeline itself, which deletes them from the history itself, as if they never existed. This is the one we will be looking at in detail in the next section.</p>
</li>
</ol>
<h2 id="heading-git-reset">Git RESET</h2>
<p>As mentioned above, we have the option of deleting commits itself. There are two types of RESET that git supports:</p>
<ol>
<li><p><strong>Soft Reset:</strong></p>
<ul>
<li><p><code>git reset --soft &lt;commit&gt;</code></p>
</li>
<li><p>A soft reset does not modify the working directory or staging area. It only moves the branch pointer to the specified commit, effectively "undoing" the commits that came after that commit.</p>
</li>
<li><p>The changes from the undone commits are retained in the staging area. You can then make further modifications and create a new commit that incorporates both the retained changes and your new modifications.</p>
</li>
</ul>
</li>
</ol>
<p>    Example:</p>
<pre><code class="lang-bash">    ➜  time-travel-with-git git:(main) git <span class="hljs-built_in">log</span> --oneline
    892665b (HEAD -&gt; main) main-branch commit to merge
    9393d41 Add merge.txt
    b4f560a (soft-delete) Second commit to time travel!
    cc3a96b First stop <span class="hljs-keyword">in</span> time-travel

    ➜  time-travel-with-git git:(main) git reset --soft b4f560a
    //ON SOFT RESET, THE CHANGES MADE AFTER THE b4f560a COMMIT
    //ARE MOVED IN THE STAGING AREA.
    ➜  time-travel-with-git git:(main) ✗ git status
    On branch main
    Changes to be committed:
      (use <span class="hljs-string">"git restore --staged &lt;file&gt;..."</span> to unstage)
            new file:   merge.txt

    //THE TOP TWO COMMITS ARE COMPLETELY REMOVED FROM THE HISTORY.
    ➜  time-travel-with-git git:(main) git <span class="hljs-built_in">log</span> --oneline
    b4f560a (HEAD -&gt; main) Second commit to time travel!
    cc3a96b First stop <span class="hljs-keyword">in</span> time-travel
</code></pre>
<ol>
<li><p><strong>Hard Reset:</strong></p>
<ul>
<li><p><code>git reset --hard &lt;commit&gt;</code></p>
</li>
<li><p>A hard reset, on the other hand, not only moves the branch pointer but also modifies the working directory and staging area to match the specified commit. It discards changes in both the working directory and the staging area, effectively resetting the branch to the specified commit.</p>
</li>
</ul>
</li>
</ol>
<p>    Example:</p>
<pre><code class="lang-bash">    ➜  time-travel-with-git git:(hard-reset) git <span class="hljs-built_in">log</span> --oneline
    892665b (HEAD -&gt; hard-reset, main) main-branch commit to merge
    9393d41 Add merge.txt
    b4f560a (soft-delete) Second commit to time travel!
    cc3a96b First stop <span class="hljs-keyword">in</span> time-travel

    ➜  time-travel-with-git git:(hard-reset) git reset --hard b4f560a
    HEAD is now at b4f560a Second commit to time travel!

    //HARD RESET COMPLETELY DELETES THE COMMITS FROM THE HISTORY. 
    //THE CHANGES ARE LOST FOREVER
    ➜  time-travel-with-git git:(hard-reset) git status
    On branch hard-reset
    nothing to commit, working tree clean

    //THE COMMITS ARE LOST FOREVER
    ➜  time-travel-with-git git:(hard-reset) git <span class="hljs-built_in">log</span> --oneline
    b4f560a (HEAD -&gt; hard-reset, soft-delete) Second commit to time travel!
    cc3a96b First stop <span class="hljs-keyword">in</span> time-travel
</code></pre>
<p>    It is evident from the above explanation that hard reset is an irrecoverable damage. But is it really for someone who is a time-traveller? Aren't we a Git time-travellers?</p>
<h2 id="heading-reflog">Reflog</h2>
<p>"reflog" (reference log) is a mechanism that records the history of changes to Git references in the local repository. References in Git include branches (<code>refs/heads/</code>), remote branches (<code>refs/remotes/</code>), and tags (<code>refs/tags/</code>). The reflog is particularly useful for recovering lost commits, branches, or other changes. The reflog entries have a limited lifetime, and older entries are eventually pruned by Git's garbage collection mechanism. The default expiration time is 90 days, but it can be configured.</p>
<p>Let us see what does reflog show us:</p>
<pre><code class="lang-bash">➜  time-travel-with-git git:(hard-reset) 
b4f560a (HEAD -&gt; hard-reset, soft-delete) HEAD@{0}: reset: moving to b4f560a
892665b (main) HEAD@{1}: checkout: moving from main to hard-reset
892665b (main) HEAD@{2}: checkout: moving from soft-delete to main
b4f560a (HEAD -&gt; hard-reset, soft-delete) HEAD@{3}: reset: moving to HEAD
b4f560a (HEAD -&gt; hard-reset, soft-delete) HEAD@{4}: reset: moving to b4f560a
892665b (main) HEAD@{5}: checkout: moving from main to soft-delete
892665b (main) HEAD@{6}: checkout: moving from soft-delete to main
b4f560a (HEAD -&gt; hard-reset, soft-delete) HEAD@{7}: reset: moving to HEAD
b4f560a (HEAD -&gt; hard-reset, soft-delete) HEAD@{8}: reset: moving to b4f560a
892665b (main) HEAD@{9}: checkout: moving from main to soft-delete
892665b (main) HEAD@{10}: commit: main-branch commit to merge
9393d41 HEAD@{11}: checkout: moving from merge-branch to main
adb0db6 (merge-branch) HEAD@{12}: commit: Merge-branch commit to merge
9393d41 HEAD@{13}: checkout: moving from main to merge-branch
9393d41 HEAD@{14}: commit: Add merge.txt
b4f560a (HEAD -&gt; hard-reset, soft-delete) HEAD@{15}: checkout: moving from timeline1 to main
0c8e8d9 (timeline1) HEAD@{16}: checkout: moving from main to timeline1
b4f560a (HEAD -&gt; hard-reset, soft-delete) HEAD@{17}: checkout: moving from timeline1 to main
0c8e8d9 (timeline1) HEAD@{18}: checkout: moving from 01b3b247a53a68b399494b392003c81d0ea63087 to timeline1
01b3b24 HEAD@{19}: checkout: moving from timeline1 to 01b3b247a53a68b399494b392003c81d0ea63087
0c8e8d9 (timeline1) HEAD@{20}: checkout: moving from main to timeline1
b4f560a (HEAD -&gt; hard-reset, soft-delete) HEAD@{21}: checkout: moving from timeline1 to main
0c8e8d9 (timeline1) HEAD@{22}: commit: Second commit on timeline1
01b3b24 HEAD@{23}: commit: First commit on timeline1
b4f560a (HEAD -&gt; hard-reset, soft-delete) HEAD@{24}: checkout: moving from main to timeline1
b4f560a (HEAD -&gt; hard-reset, soft-delete) HEAD@{25}: commit: Second commit to time travel!
cc3a96b HEAD@{26}: commit (initial): First stop <span class="hljs-keyword">in</span> time-travel
</code></pre>
<p>Woah! There's a lot in here. But what we are particularly interested in is the latest commit that we had, which we deleted. It is <code>892665b</code>. Here comes the interesting recovery step:</p>
<pre><code class="lang-bash">➜  time-travel-with-git git:(hard-reset) git checkout -b recover-from-hard-reset 892665b
Switched to a new branch <span class="hljs-string">'recover-from-hard-reset'</span>
➜  time-travel-with-git git:(recover-from-hard-reset) git status
On branch recover-from-hard-reset
nothing to commit, working tree clean
➜  time-travel-with-git git:(recover-from-hard-reset) git <span class="hljs-built_in">log</span> --oneline
892665b (HEAD -&gt; recovered-from-hard-reset, main) main-branch commit to merge
9393d41 Add merge.txt
b4f560a (soft-delete, hard-reset) Second commit to time travel!
cc3a96b First stop <span class="hljs-keyword">in</span> time-travel
</code></pre>
<p>Congratulations, we have magically recovered everything!</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>However, do remember that git reflog is only available locally, i.e. only on your machine, and has a limited lifetime. So it is always better to be cautious about hard resets. But, just in case, if you ever mess up, remember, that you are an expert git time-traveller. Recover, travel and build!</p>
]]></content:encoded></item><item><title><![CDATA[Git Internals: Index and Merging Timelines]]></title><description><![CDATA[Now we know that there can be multiple timelines. These timelines can represent different features that you are working on. Ultimately, you want these features to be consolidated in a single branch.

Note: In this article we are not going to cover th...]]></description><link>https://tech.amanmulani.com/git-internals-index-and-merging-timelines</link><guid isPermaLink="true">https://tech.amanmulani.com/git-internals-index-and-merging-timelines</guid><category><![CDATA[Git]]></category><category><![CDATA[version control]]></category><category><![CDATA[version control systems]]></category><category><![CDATA[vcs]]></category><category><![CDATA[git branching]]></category><dc:creator><![CDATA[Aman Mulani]]></dc:creator><pubDate>Sun, 31 Dec 2023 12:11:33 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/H7LxvEmVZnE/upload/6294daec52c74f925a0aec8420015cf1.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Now we know that there can be multiple timelines. These timelines can represent different features that you are working on. Ultimately, you want these features to be consolidated in a single branch.</p>
<blockquote>
<p>Note: In this article we are not going to cover the types of merging like fast-forward merge, three-way merge, etc. We also won't cover concepts like squash merge or rebasing. There are plenty of great resources availble for these concepts.</p>
<p>What we do cover in this article is the rationale behind how git merges branches and indentifies what files have changed.</p>
</blockquote>
<p>Before getting started with how git does merging. Let us start with something much simpler, how does git identify what files have changed when we stage changes?</p>
<h2 id="heading-index">Index</h2>
<p>Git identifies changes in the working directory by maintaining an index, also known as the staging area. The index acts as an intermediary between our working directory and the repository, and it helps Git determine which changes should be included in the next commit. When we modify a file in our working directory, Git detects these changes, but they are not automatically committed. Instead, Git relies on the index to keep track of the staged changes. The index essentially serves as a snapshot of the working directory, containing information about the state of each file at the time of the last commit.</p>
<p>Git recognizes files that have changed even when they are not staged by utilizing the information stored in the index file within the .git directory. The index maintains a record of the files and their respective states, helping Git efficiently track modifications. When you we commands like <code>git status</code> or <code>git diff</code>, Git consults the index to compare the current state of the files in your working directory with the last committed state. This allows Git to highlight any differences and provide insights into which files have been modified but not yet staged. By leveraging the index, Git ensures a systematic and organized approach to managing changes, allowing developers to selectively stage and commit modifications.</p>
<p>Let us have a look at our <code>index</code> file present in the <code>.git</code> folder.</p>
<pre><code class="lang-plaintext">➜  time-travel-with-git git:(main) cat .git/index
DIRe??[
?k?e??[
?k?'????︚A?
g?Ͷ??imW?first_file.txte??82V
                             6e??82V
                                    6'???????WLJ?o96?
                                                     ???Bt+folder1/fourth_file.txte??m?gGe??m?gG'??????TK????A2C?n?-[V??%folder1/sub-folder11/seventh_file.txte??&amp;12`e??&amp;12`'?h???rA?#z?cYM?P!^??#folder1/sub-folder11/third_file.txte??MZ)De??MZ)D'??????????;gv?[[?;??1E?folder2/fifth_file.txte??\6|??e??\6|??'??????R8???ÉM{??$u?5?ׯ?folder3/sixth_file.txte?CZ    ??Ve?CZ    ??V(?b????ez??xe?&lt;}?+n?n?x    merge.txte??5Ԇ8e??5Ԇ8'???????z?0?s???I?,_h*?sDsecond_file.txtTREE?8 3
???
   ??݋?`g#??fI?\cfolder13 1
D?@n4I???*N????sub-folder112 0
FJ"^?????6}???$?\?T?folder21 0
?2)A????s
         ?֞??32?folder31 0
 C?)?jƦz???%
</code></pre>
<p>The contents of the index file are not human-readable, and there isn't a direct way to decode its contents. However, we can see the current files being tracked in the index by using the following command: <code>git ls-files --stage</code>.</p>
<pre><code class="lang-plaintext">➜  time-travel-with-git git:(main) git ls-files --stage
100644 efb89ab841a90e0a67fbcdb6cad8696d0357e715 0    first_file.txt
100644 c6c3571f4c4a10c36f3936bc0be783d2f742742b 0    folder1/fourth_file.txt
100644 84cfff544bbf9ea894413243bc6e882d5b56afb5 0    folder1/sub-folder11/seventh_file.txt
100644 72418f23007afa1d63594d8e50215ed51b84ec03 0    folder1/sub-folder11/third_file.txt
100644 f5a688aed83b1c6776885b5b05c13ba6ca3145d1 0    folder2/fifth_file.txt
100644 a952388abaf3c3894d7bbab02475a0359dd7afed 0    folder3/sixth_file.txt
100644 e965047ad7c57865823c7d992b1d046ea66edf78 0    merge.txt
100644 e7e57ac6308173dec7dc1749b72c5f682aa67344 0    second_file.txt
</code></pre>
<p>As you see, the current index contains all the files of our working repository. Now let us examine any of the object IDs that are shown in the output.</p>
<pre><code class="lang-plaintext">➜  time-travel-with-git git:(main) git cat-file -t efb89ab841a90e0a67fbcdb6cad8696d0357e715
blob
➜  time-travel-with-git git:(main) git cat-file -p efb89ab841a90e0a67fbcdb6cad8696d0357e715
Hello, for the first time!
</code></pre>
<p>Each of the objects shown in the previous output was a reference to the blobs of the files in the working directory. Using this as a reference, git can identify if the files have been updated or not.</p>
<blockquote>
<p>Bonus:<br />Shouldn't there be a way to see the changes introducted in commits if there is a way to compare unstaged / staged changes?<br />Well, there is another command in git, <code>git show</code>, that displays the information about a specific commit, including the changes introduced in that commit. When we run <code>git show</code> followed by a commit hash or a branch name, Git internally compares the specified commit with its parent commit (the previous state). It then shows the differences, or changes, between the two commits.</p>
</blockquote>
<p>Now let us move on to the merging of branches.</p>
<h2 id="heading-merging">Merging</h2>
<p>The process of combining the changes from different timelines into a single timeline is called merging. We know that git knows what changes were introduced in commits in the given timeline. Now let us understand, how git understands which files to update, and what happens in case of the same files being updated in both timelines, or worse, if the same content is updated in both timelines.</p>
<p>When merging two branches in Git, we identify the changes made on each branch since they diverged. Git uses a three-way merge algorithm to combine changes from the common ancestor (base commit) to the tips of the two branches being merged. Let us try to understand step by step how a three-way merge takes place:</p>
<ol>
<li><p>Git identifies the common ancestor commit of the branches being merged. This commit represents the last point where the branches share the same commit history.</p>
</li>
<li><p>Now git identifies the changes made on each branch independently. This involves comparing the common ancestor with the tips of the two branches to find the modifications made on each side.</p>
</li>
<li><p>The three-way merge combines the changes from the common ancestor to the branch tips. It considers the changes made on both branches and attempts to automatically reconcile them.</p>
<ol>
<li><p>If the changes made to the files are the same on both branches (even though the commits are different), Git recognizes this as a non-conflicting scenario.</p>
</li>
<li><p>Git performs a content-level merge, applying the shared changes to the files from both branches.</p>
</li>
<li><p>If Git encounters conflicting changes (i.e., changes made on the same lines of the same file), it marks those files as conflicted. Git requires manual intervention to resolve these conflicts. In this case, git can no longer be in the autopilot mode. Here's where the experience of the time-traveller is required.</p>
</li>
</ol>
</li>
<li><p>The result of the merge is a new commit that represents the combination of changes from both branches. This new commit will have multiple parent commits, indicating the branches that were merged.</p>
</li>
</ol>
<p>During this process, Git doesn't necessarily identify "changed files" directly or compare the differences in the content of the files. Instead, it looks at changes in terms of commits and the differences introduced in those commits. The changes are then applied to the files, and the resulting state is what you see after completing the merge.</p>
<h2 id="heading-summary">Summary</h2>
<p>So, now, if you come across scenarios of why some changes are unintentionally showing up in the PRs that you have raised with your parent / other branches, do remember, that git is not only about comparing the content, it is also about comparing the commits and the where the commits are placed in the history of our timeline!</p>
<p>In the next article, we will be having a look at how to recover from the disastrous event of resetting our timeline, and why the recovery is possible!</p>
]]></content:encoded></item><item><title><![CDATA[Git Internals: Branches, Logs, Refs & The Multiverse of Madness]]></title><description><![CDATA[In the previous articles, we understood how git stores our data in the .git repository. Now that we are familiar with it, we can start with time-travelling.
But, wait! Did I mention that we can traverse through different timelines of code in the gran...]]></description><link>https://tech.amanmulani.com/git-internals-branches-logs-refs-the-multiverse-of-madness</link><guid isPermaLink="true">https://tech.amanmulani.com/git-internals-branches-logs-refs-the-multiverse-of-madness</guid><category><![CDATA[Git]]></category><category><![CDATA[vcs]]></category><category><![CDATA[version control]]></category><category><![CDATA[version control systems]]></category><dc:creator><![CDATA[Aman Mulani]]></dc:creator><pubDate>Sun, 31 Dec 2023 09:39:29 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/OMj98kMI2dE/upload/5dde6d2ab78ca839d31c192681719e8e.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the previous articles, we understood how git stores our data in the .git repository. Now that we are familiar with it, we can start with time-travelling.</p>
<p>But, wait! Did I mention that we can traverse through different timelines of code in the grand scheme of the multiverse of your code? My bad, I didn't! So let us get started with it!</p>
<blockquote>
<p>Note: You might already be aware of git branches and the related operations, however, again, the intent of this article is not to impart knowledge on the basics of branching, rather on how git internally mantains these branches!</p>
</blockquote>
<h2 id="heading-git-init-the-start-of-everything"><code>git init</code>, The start of everything...</h2>
<p>When git is initialized in the working directory, git automatically creates a default timeline for you whose history is maintained by git. The default timeline is known as the <code>main</code> timeline, formerly, it was known as the <code>master</code> timeline. All the snapshots of your code are saved in this timeline, and you can move back in time in this default timeline. So what does this timeline do?</p>
<p>Now, enter the fascinating world of Git branches. A branch, in essence, is a divergent timeline in the vast multiverse of your code. It allows developers to create alternate realities where they can experiment, innovate, and make changes without affecting the main timeline. Picture each branch as a parallel universe, coexisting independently until the developer decides to merge them back into the main timeline.</p>
<p>To bring this in perspective, when you initialize git in your working directory, by default, git creates a default branch, called <code>main</code>. This branch consists of all your commits (code snapshots). From this main branch, you can cut off other timelines for development purposes and then merge them into other timelines. When you merge two timelines, git automatically brings commits from both timelines, you might need to intervene in case of conflicts!</p>
<p>So how does git maintain the branches internally?</p>
<h2 id="heading-refs-where-timelines-are-defined"><code>refs</code>: Where timelines are defined</h2>
<p><em>A branch is nothing but a reference to the head of a sequence of commits.</em> Since you can be working on multiple features simultaneously, you may have multiple different branches, and the branch itself is not a code snapshot, but just a reference to the latest commit of that branch. Then how can we move back in time in that branch? Well, if you remember, each commit object has a parent commit in it, which enables git for easy traversal internally in the specific timeline as well.</p>
<p>So where does git store these references? Yes, you guessed it right, it stores it in the <code>.git/refs</code> folder. Let us inspect the contents of this folder.</p>
<pre><code class="lang-plaintext">➜  .git git:(main) cd refs 
➜  refs git:(main) tree
.
├── heads
│   └── main
└── tags

3 directories, 1 file
</code></pre>
<p>Continuing the example from our previous article, we have only one branch, i.e. the main branch. As shown above, the reference of branches is stored in the <code>.git/refs/heads</code> folder. The name of the branch is the name of the file containing the reference to the top commit of that branch. Let us inspect the contents of the <code>main</code> file in the folder mentioned above.</p>
<pre><code class="lang-plaintext">➜  refs git:(main) cat heads/main
b4f560a353a2202e9bb18939c7c3ccbf6e479639
➜  refs git:(main) git cat-file -p b4f560a353a2202e9bb18939c7c3ccbf6e479639
tree 408c5d6e7750f028dd67dd5b25cf98076e400638
parent cc3a96b5a05c666fca63d7b4acd29ce779ad0ce5
author Aman &lt;dummy-email@gmail.com&gt; 1703971554 +0530
committer Aman &lt;dummy-email@gmail.com&gt; 1703971554 +0530

Second commit to time travel!
</code></pre>
<p>As seen above, the main file contains only one commit, which on inspecting, turns out to be the latest commit on the main branch. Now, let us create a new branch.</p>
<pre><code class="lang-plaintext">➜  time-travel-with-git git:(main) git checkout -b timeline1 main
Switched to a new branch 'timeline1'
</code></pre>
<p>Now let us examine the contents of the <code>refs</code> folder.</p>
<pre><code class="lang-plaintext">➜  time-travel-with-git git:(timeline1) ✗ cd .git/refs                             
➜  refs git:(timeline1) tree
.
├── heads
│   ├── main
│   └── timeline1
└── tags

3 directories, 2 files
</code></pre>
<p>As seen above, a new file gets added to the heads folder, which has the same name as that of the new branch name. Now let us have a look at timeline1's content.</p>
<pre><code class="lang-plaintext">➜  refs git:(timeline1) cat heads/timeline1
b4f560a353a2202e9bb18939c7c3ccbf6e479639
➜  refs git:(timeline1) git cat-file -p b4f560a353a2202e9bb18939c7c3ccbf6e479639
tree 408c5d6e7750f028dd67dd5b25cf98076e400638
parent cc3a96b5a05c666fca63d7b4acd29ce779ad0ce5
author Aman &lt;dummy-email@gmail.com&gt; 1703971554 +0530
committer Aman &lt;dummy-email@gmail.com&gt; 1703971554 +0530

Second commit to time travel!
</code></pre>
<p>The above commit message is the same as that of the latest commit on the main branch. Why is that so?<br />If you carefully look at how we created our branch: <code>git checkout -b timeline1 main</code>, we have created it from the main branch. So it means that our new timeline starts from the latest timeline of the main and will contain all the commits of the main timeline till the point from where the timeline1 was cut off. From this point onwards, the timelines are separate and can have separate commits till the time they are merged.</p>
<p>Let us add a few more commits to the timeline1 branch.</p>
<pre><code class="lang-plaintext">➜  time-travel-with-git git:(timeline1) touch timeline1-file.txt
➜  time-travel-with-git git:(timeline1) ✗ echo "Hello from timeline1! " &gt;&gt; file2-timeline1.txt 
➜  time-travel-with-git git:(timeline1) ✗ git add .
➜  time-travel-with-git git:(timeline1) ✗ git commit -m "First commit on timeline1"
[timeline1 01b3b24] First commit on timeline1
 1 file changed, 1 insertion(+)
 create mode 100644 timeline1-file.txt
➜  time-travel-with-git git:(timeline1) touch file2-timeline1.txt
➜  time-travel-with-git git:(timeline1) ✗ echo "Hello from file2 timeline1! " &gt;&gt; file2-timeline1.txt 
➜  time-travel-with-git git:(timeline1) ✗ git add .
➜  time-travel-with-git git:(timeline1) ✗ git commit -m "Second commit on timeline1"          
[timeline1 0c8e8d9] Second commit on timeline1
 1 file changed, 1 insertion(+)
 create mode 100644 file2-timeline1.txt
</code></pre>
<p>Let us inspect the contents of the timeline1 branch file in the refs folder.</p>
<pre><code class="lang-plaintext">➜  time-travel-with-git git:(timeline1) cat .git/refs/heads/timeline1 
0c8e8d91dce2222d669776aea7af55a8c27685fa
➜  time-travel-with-git git:(timeline1) git cat-file -p 0c8e8d91dce2222d669776aea7af55a8c27685fa
tree 71fa913d1f22a71e1718b0df4dae7ca63759a088
parent 01b3b247a53a68b399494b392003c81d0ea63087
author Aman &lt;dummy-email@gmail.com&gt; 1704010558 +0530
committer Aman &lt;dummy-email@gmail.com&gt; 1704010558 +0530

Second commit on timeline1
</code></pre>
<p>As seen here, the head is now pointing to the latest commit on the timeline1 branch. Now if you have to move to another branch, git will simply look up the branch name in this refs folder and move to the commit objectId present in that file.</p>
<p>Good progress! Down goes the theory of timelines in git!</p>
<h2 id="heading-logs-the-registry-holding-all-commits-refs">logs: The registry holding all commits refs</h2>
<p>What if you want to have a look at all the commits made to date in a single go? Well, this can be useful in a variety of use cases. So if we continue with our current model, we will need to traverse backwards from the current commit up to the first commit (since the commit object has parent commits as well). In theory, this sounds great, but as the complexity of the project increases, looking up all the commits becomes a bottleneck, because it becomes a sequential operation. So what do we do now?</p>
<p>Inverted Indexes are useful in such scenarios, and git has its flavour of inverted index. The term "inverted index" here refers to Git's approach of holding a reference to all the commits in a way that can be looked up in constant time! This is a crucial optimization that enhances the efficiency of operations related to commit history. So where does Git store this information?</p>
<p>It resides inside the <code>.git/logs</code> folder. Let us examine the contents of the logs folder.</p>
<pre><code class="lang-plaintext">➜  time-travel-with-git git:(timeline1) cd .git/logs 
➜  logs git:(timeline1) tree
.
├── HEAD
└── refs
    └── heads
        ├── main
        └── timeline1

3 directories, 3 files
</code></pre>
<p>As seen above, we have the HEAD file and the refs folder. Let us first have a look at the heads folder. From the file name, it is clear that the file name represents the branch names. Let us inspect the contents of these files.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1704012511601/393f0793-1174-4a48-9297-c524b9c228b7.png" alt class="image--center mx-auto" /></p>
<p>I have modified the content to match our needs for the time being:</p>
<pre><code class="lang-plaintext">➜  logs git:(timeline1) cat refs/heads/main
0000 cc3a commit (initial): First stop in time-travel
cc3a b4f5 commit: Second commit to time travel!
➜  logs git:(timeline1) cat refs/heads/timeline1 
0000 b4f5 branch: Created from main
b4f5 01b3 commit: First commit on timeline1
01b3 0c8e commit: Second commit on timeline1
</code></pre>
<p>As seen from the above excerpt, the files contain the entire history of the given timeline (branch). The main branch has only two commits, whereas the timeline1 branch, clearly shows that the branch has been cut off from the main branch, and after that two commits were made. Git provides a command to check the history of the current timeline, <code>git log</code>.</p>
<pre><code class="lang-plaintext">➜  logs git:(timeline1) git log 

commit 0c8e8d91dce2222d669776aea7af55a8c27685fa (HEAD -&gt; timeline1)
Author: Aman
Date:   Sun Dec 31 13:45:58 2023 +0530

    Second commit on timeline1

commit 01b3b247a53a68b399494b392003c81d0ea63087
Author: Aman
Date:   Sun Dec 31 13:45:09 2023 +0530

    First commit on timeline1

commit b4f560a353a2202e9bb18939c7c3ccbf6e479639 (main)
Author: Aman
Date:   Sun Dec 31 02:55:54 2023 +0530

    Second commit to time travel!

commit cc3a96b5a05c666fca63d7b4acd29ce779ad0ce5
Author: Aman
Date:   Sun Dec 31 02:26:23 2023 +0530

    First stop in time-travel
</code></pre>
<p>Now let us move back to the main timeline, i.e. the <code>main</code> branch and see what <code>git log</code> has in store for us.</p>
<pre><code class="lang-plaintext">➜  logs git:(main) git log

commit b4f560a353a2202e9bb18939c7c3ccbf6e479639 (HEAD -&gt; main)
Author: AmanMulani &lt;amanmulani369@gmail.com&gt;
Date:   Sun Dec 31 02:55:54 2023 +0530

    Second commit to time travel!

commit cc3a96b5a05c666fca63d7b4acd29ce779ad0ce5
Author: AmanMulani &lt;amanmulani369@gmail.com&gt;
Date:   Sun Dec 31 02:26:23 2023 +0530

    First stop in time-travel
(END)
</code></pre>
<p>As seen here, git shows the history of the current timeline. In the first line, you see that HEAD is pointing to the main branch. Git maintains a HEAD reference to the latest commit of the current timeline. This reference is maintained in the <code>.git/HEAD</code> file. Let us examine the contents of the <code>HEAD</code> file:</p>
<pre><code class="lang-plaintext">➜  time-travel-with-git git:(main) cat .git/HEAD
ref: refs/heads/main
➜  time-travel-with-git git:(main) git checkout timeline1
Switched to branch 'timeline1'
➜  time-travel-with-git git:(timeline1) cat .git/HEAD
ref: refs/heads/timeline1
</code></pre>
<p>As seen from the snippet above, the <code>HEAD</code> file always contains the reference to the current branch. (Although in a detached state, <code>HEAD</code> can also point to a commit ID). Using the combination of techniques mentioned above and some other optimizations, git can perform a look-up of the history of any timeline efficiently.</p>
<h2 id="heading-summary">Summary</h2>
<p>Ahhh! Finally! We have understood how timelines aka branches are maintained internally in Git, and how Git looks up the history of any given timeline. This is particularly useful as you get into situations where basic git commands are no longer able to save your day!</p>
<p>With this knowledge in the arsenal, in the upcoming articles, we will focus on how the timelines are merged, resulting in the creation of something beautiful, or sometimes, something very very terrible!</p>
]]></content:encoded></item><item><title><![CDATA[Git Internals: Blobs, Commits & Trees]]></title><description><![CDATA[In the previous article, we discussed the approach taken by Git to save code snapshots. Now it is time to get our hands dirty and try using the time machine. In this article, we will see how git works behind the scenes.

Note: Although we delve into ...]]></description><link>https://tech.amanmulani.com/git-internals-blobs-commits-trees</link><guid isPermaLink="true">https://tech.amanmulani.com/git-internals-blobs-commits-trees</guid><category><![CDATA[Git]]></category><category><![CDATA[vcs]]></category><category><![CDATA[version control]]></category><category><![CDATA[version control systems]]></category><dc:creator><![CDATA[Aman Mulani]]></dc:creator><pubDate>Sat, 30 Dec 2023 22:08:32 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/05HLFQu8bFw/upload/4934a113c9c848481ecfc6183d8f95d0.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the previous article, we discussed the approach taken by Git to save code snapshots. Now it is time to get our hands dirty and try using the time machine. In this article, we will see how git works behind the scenes.</p>
<blockquote>
<p>Note: Although we delve into the basics of git internal, it is expected that the reader is familiar with the basic git concepts. The intend of the article is not to teach the basics of git, but to understand how things work behind the scene, and to find rationale behind why git is the way it is.</p>
</blockquote>
<h2 id="heading-three-tiered-architecture-of-git">Three-Tiered Architecture of Git</h2>
<p>Let us first have a look at the workflow that git follows to save the code snapshots. Let us suppose that you have a couple of hundred files in the folder of your concern (the folder where your code lies), also known as the working directory. Git is already initialized in this working directory. Git creates its database inside the working directory itself in the <code>.git</code> folder. This folder holds the code snapshots and is also called the git repository or simply the repository.</p>
<p>Out of these ~200 files, you have updated thirty files. So ideally, one will expect git to take the snapshot of the current state of code by just executing one command. In our context, the process of taking the snapshot of code is known as committing the code to the repository. Straightforward, right?</p>
<p>Now think of a scenario where although you have made changes to 30 files, you are sure only of 14 files, while the remaining 16 files you want to continue working on. You want to have the snapshot of code where only those 14 files have changed, and the remaining 16 files unchanged. This can be a tricky problem to solve if there is no intermediary layer between the changes that the developer has made in the working directory and committing the changes to create a snapshot.</p>
<p>To address this problem, Git uses a three-tiered architecture. The first step remains the same, i.e., the developer makes changes in the working directory. Once the changes are made, now instead of directly committing the changes (i.e. taking the snapshot of all the files changed), the developer first selects only those files that he/she wants to commit out of all the files that are changed. This stage is known as staging. Once the files that the developer intends to take a snapshot of are staged, then those files can be committed. So here's what the three-tiered architecture looks like:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703962165187/6095de23-f2fb-4a12-9d16-700e700f476a.png" alt=" the three-tiered architecture of git" class="image--center mx-auto" /></p>
<h2 id="heading-demystifying-the-git-folder">Demystifying the .git folder</h2>
<p>Now we are aware of how the git workflow to store a code snapshot works, i.e. the workflow to commit code. Where is the snapshot stored? Yes, you guessed it right, in the <code>.git</code> folder. The entire mystery lies in that folder. It is where the magic happens!</p>
<h3 id="heading-objects">Objects</h3>
<p><code>git init</code> is used to initialize git in any folder. Once git is initialized in that folder (repository), a new folder <code>.git</code> is added to the root of the folder where the <code>git init</code> command was executed.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703965206010/a48c82bf-e12b-4313-85c8-79546dc4822f.png" alt class="image--center mx-auto" /></p>
<p>Now, let us go to the place where magic happens, the <code>.git</code> folder</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703965769158/3ef9f687-37bf-4ae2-91c6-a2ac57612d8c.png" alt class="image--center mx-auto" /></p>
<p>The <code>.git</code> folder has a bunch of sub-folders, out of which the one that we are focused on in this article will be the <code>objects</code> folder. All the code snapshots are stored in the objects folder. The entire code-related data in git is stored in the form of objects. In the subsequent sections, we discuss in detail the different types of objects in git and their significance.</p>
<h3 id="heading-blobs">Blobs</h3>
<p>Now let us create a new file in our working directory.</p>
<pre><code class="lang-plaintext">➜  time-travel-with-git git:(main) touch first_file.txt                                                                                    
➜  time-travel-with-git git:(main) ✗ echo "Hello" &gt;&gt; first_file.txt
</code></pre>
<p>We see that by simply adding a file in the working directory, nothing changes in the git repository.</p>
<pre><code class="lang-plaintext">➜  time-travel-with-git git:(main) ✗ cd .git/objects 
➜  objects git:(main) tree
.
├── info
└── pack

3 directories, 0 files
</code></pre>
<p>Therefore, we can safely conclude that no changes are automatically synced up in the git repository. Let us use the <code>git status</code> command (use this command in the root folder, not in the <code>.git</code> folder) to derive some more insights.</p>
<pre><code class="lang-plaintext">➜  time-travel-with-git git:(main) ✗ git status
On branch main

No commits yet

Untracked files:
  (use "git add &lt;file&gt;..." to include in what will be committed)
    first_file.txt

nothing added to commit but untracked files present 
(use "git add" to track)
</code></pre>
<p>As seen in the message above, to track the file via git, we need to stage the file. Once we stage the file using the <code>git add first_file.txt</code> command, let us examine the contents of the <code>objects</code> folder once again. Here's the output:</p>
<pre><code class="lang-plaintext">➜  time-travel-with-git git:(main) ✗ git add first_file.txt 
➜  time-travel-with-git git:(main) ✗ cd .git/objects 
➜  objects git:(main) tree
.
├── a6
│   └── cb6ee3f97c51226a1e86bda81f48021ad27584
├── info
└── pack

4 directories, 1 file
</code></pre>
<p>We see a new folder here, which contains a file with a weird long name. What exactly is this?</p>
<p>If you recall, git stores the content in the compressed form and then calculates the SHA-1 (also known as the objectId) of the compressed content to use it as the file name. So it means that once we stage the file, git creates a snapshot of the file, and stores it in the .git/objects folder. An interesting thing to note is that git does not directly store the file name in the objects folder. Rather, it takes the first two letters of the calculated objectId and creates a folder with the same name as those two letters. The subsequent letters of the objectId are the name of the file containing the compressed contents. Therefore, the objectId is the folder name + file name. This enables quicker look-up of objects.</p>
<p>Let us inspect the contents of the file containing the compressed contents</p>
<pre><code class="lang-plaintext">➜  objects git:(main) cd a6
➜  a6 git:(main) cat cb6ee3f97c51226a1e86bda81f48021ad27584 
??W??!Qc?x??%??V^?R????ṃk?X?C51i???.???R{?prə?,???d??Fc????
</code></pre>
<p>As seen above, we are not able to infer any meaning from the compressed content. However, git provides us with a command to see the raw content from the file. The command is: <code>git cat-file</code> . This command takes the object ID as input.</p>
<pre><code class="lang-plaintext">➜  a6 git:(main) git cat-file -p a6cb6ee3f97c51226a1e86bda81f48021ad27584
Hello
</code></pre>
<p>Voila! We are now able to get the raw content back from Git!<br />One question that arises is what type of object the current file in question is. By using the <code>-t</code> flag we can find out the type of object it is.</p>
<pre><code class="lang-plaintext">➜  a6 git:(main) git cat-file -t a6cb6ee3f97c51226a1e86bda81f48021ad27584
blob
</code></pre>
<p>As seen above, the contents of our files are stored in compressed form in objects known as <em>blobs</em>.</p>
<p>However, as discussed earlier, to store the current snapshot of code, we need to commit the changes to the git repository. Before moving ahead, let us make a few random changes to the existing file and add one more file.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703969224371/cb842b86-74e9-49e7-af27-76eb2e8d8fd9.png" alt class="image--center mx-auto" /></p>
<p>If you inspect the contents of the <code>.git/objects</code> folder, it hasn't changed yet. However, once you stage the files by using the <code>git add</code> command, the content of the objects folder will have changed again.</p>
<pre><code class="lang-plaintext">➜  time-travel-with-git git:(main) ✗ git add .
➜  time-travel-with-git git:(main) ✗ cd .git/objects 
➜  objects git:(main) tree
.
├── a6
│   └── cb6ee3f97c51226a1e86bda81f48021ad27584
├── e7
│   └── e57ac6308173dec7dc1749b72c5f682aa67344
├── ef
│   └── b89ab841a90e0a67fbcdb6cad8696d0357e715
├── info
└── pack

6 directories, 3 files
</code></pre>
<p>One thing that you will quickly notice is that although we have two files, three objects are present in the <code>objects</code> directory. Any idea why?<br />This is because whenever you stage files, git takes the snapshot of all the files that have changed. So when you staged first_file.txt initially, git had created one object of type blob. The next time when we staged two files, git creates two blob objects, containing the compressed content of the staged files.</p>
<h3 id="heading-commit">Commit</h3>
<p>Let us try committing the code.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703969803894/952d425f-7706-4315-a8bf-c92fb2aa09b8.png" alt class="image--center mx-auto" /></p>
<p>Once the changes have been committed, git successfully creates a snapshot of the current state of the code. Let us inspect the objects folder. We see two additional objects:</p>
<pre><code class="lang-plaintext">➜  objects git:(main) tree
.
├── 0d
│   └── 643279d1e62540cb18cb003d9b5e12b46bd992
├── a6
│   └── cb6ee3f97c51226a1e86bda81f48021ad27584
├── cc
│   └── 3a96b5a05c666fca63d7b4acd29ce779ad0ce5
├── e7
│   └── e57ac6308173dec7dc1749b72c5f682aa67344
├── ef
│   └── b89ab841a90e0a67fbcdb6cad8696d0357e715
├── info
└── pack

8 directories, 5 files
</code></pre>
<p>On inspecting the type of object present in the cc folder, we find a new object type: <strong>commit</strong></p>
<pre><code class="lang-plaintext">➜  objects git:(main) cd cc
➜  cc git:(main) git cat-file -t cc3a96b5a05c666fca63d7b4acd29ce779ad0ce5
commit
</code></pre>
<p>Let us try to get more details about this object by using the -p flag:</p>
<pre><code class="lang-plaintext">➜  objects git:(main) git cat-file -p cc3a96b5a05c666fca63d7b4acd29ce779ad0ce5
tree 0d643279d1e62540cb18cb003d9b5e12b46bd992
author Aman &lt;dummy-author-email@gmail.com&gt; 1703969783 +0530
committer Aman &lt;dummy-author-email@gmail.com&gt; 1703969783 +0530

First stop in time-travel
</code></pre>
<p>You can see that the commit object contains a reference to another object known as a tree. The commit object also contains the commit message and the details of the author of the commit.</p>
<h3 id="heading-tree">Tree</h3>
<p>If you carefully observe, the reference of the tree object provided in the commit object is already present in the <code>objects</code> folder (the <code>0d</code> folder in <code>objects/</code>). Let us examine the object.</p>
<pre><code class="lang-plaintext">➜  objects git:(main) git cat-file -p 0d643279d1e62540cb18cb003d9b5e12b46bd992
100644 blob efb89ab841a90e0a67fbcdb6cad8696d0357e715    first_file.txt
100644 blob e7e57ac6308173dec7dc1749b72c5f682aa67344    second_file.txt
</code></pre>
<p>So the tree object contains the reference to the blob objects of all the files in the working directory. But then a question arises as to how git maintains the folder hierarchy of the working directory if the tree only consists of blob objects of individual files. For example, if I add another file, <code>third_file.txt</code> in a new folder called <code>notes</code> inside my working directory, how will git know that this file belongs to the <code>notes</code> folder and not the root folder?</p>
<p>Let us add a few more files and directories to our working directory and figure out how git handles this use case.</p>
<pre><code class="lang-plaintext">➜  time-travel-with-git git:(main) tree
.
├── first_file.txt
├── folder1
│   ├── fourth_file.txt
│   └── sub-folder11
│       ├── seventh_file.txt
│       └── third_file.txt
├── folder2
│   └── fifth_file.txt
├── folder3
│   └── sixth_file.txt
└── second_file.txt

5 directories, 7 files
</code></pre>
<p>After we stage all the files and commit these changes to the git repository, there will be a lot of objects present in the objects repository. To keep it simple, let us directly focus on the commit object</p>
<pre><code class="lang-plaintext">➜  objects git:(main) cd b4
➜  b4 git:(main) git cat-file -p b4f560a353a2202e9bb18939c7c3ccbf6e479639
tree 408c5d6e7750f028dd67dd5b25cf98076e400638
parent cc3a96b5a05c666fca63d7b4acd29ce779ad0ce5
author AmanMulani &lt;amanmulani369@gmail.com&gt; 1703971554 +0530
committer AmanMulani &lt;amanmulani369@gmail.com&gt; 1703971554 +0530

Second commit to time travel!
</code></pre>
<p>This commit object seems to be a bit different from the previous one. We have a parent object here. We will come back to that later, but let us first focus on the tree object. Let us further inspect the tree object:</p>
<pre><code class="lang-plaintext">➜  objects git:(main) git cat-file -p 408c5d6e7750f028dd67dd5b25cf98076e400638
100644 blob efb89ab841a90e0a67fbcdb6cad8696d0357e715    first_file.txt
040000 tree 44f140066e3411499a3f1dda2a4e1c1fb1fda7e4    folder1
040000 tree a6322941fa81fde0730c1bb6cdd69edfe23332b1    folder2
040000 tree 2028d42939fff173dece6e7050245b7f8e70a919    folder3
100644 blob e7e57ac6308173dec7dc1749b72c5f682aa67344    second_file.txt
</code></pre>
<p>Initially, we thought that the tree only consisted of the blob objects, which is partly true. But it also consists of other tree objects. So if we inspect the rest of the tree objects, we find that the tree object maintains the folder hierarchy via the tree objects itself.</p>
<pre><code class="lang-plaintext"># folder1
➜  objects git:(main) git cat-file -p 44f140066e3411499a3f1dda2a4e1c1fb1fda7e4
100644 blob c6c3571f4c4a10c36f3936bc0be783d2f742742b    fourth_file.txt
040000 tree 464a225e9d96b59aba367df8e9d124c95ce4549e    sub-folder11

# subfolder11
➜  objects git:(main) git cat-file -p 464a225e9d96b59aba367df8e9d124c95ce4549e
100644 blob 84cfff544bbf9ea894413243bc6e882d5b56afb5    seventh_file.txt
100644 blob 72418f23007afa1d63594d8e50215ed51b84ec03    third_file.txt

# folder2
➜  objects git:(main) git cat-file -p a6322941fa81fde0730c1bb6cdd69edfe23332b1
100644 blob f5a688aed83b1c6776885b5b05c13ba6ca3145d1    fifth_file.txt

# folder3
➜  objects git:(main) git cat-file -p 2028d42939fff173dece6e7050245b7f8e70a919
100644 blob a952388abaf3c3894d7bbab02475a0359dd7afed    sixth_file.txt
</code></pre>
<p>Therefore the tree object in essence is the code snapshot we talk about. So after we commit our changes, git creates a new tree object, which has references to all the blob objects. In the tree, only those blobs are changed that have been modified. For the files that have not changed, the old references remain as is in the tree.</p>
<h3 id="heading-parent-object-type-is-commit">Parent (object type is commit)</h3>
<p>If you recall from the example above, we also have the parent object. If you inspect the parent object, you will find that it is of type commit, and it simply points to the previous commit. Having the parent commit as a part of the commit object helps in tracking history, and has a bunch of other use cases as well which are not discussed in this article.</p>
<pre><code class="lang-plaintext">➜  objects git:(main) git cat-file -t cc3a96b5a05c666fca63d7b4acd29ce779ad0ce5
commit
➜  objects git:(main) git cat-file -p cc3a96b5a05c666fca63d7b4acd29ce779ad0ce5
tree 0d643279d1e62540cb18cb003d9b5e12b46bd992
author AmanMulani &lt;amanmulani369@gmail.com&gt; 1703969783 +0530
committer AmanMulani &lt;amanmulani369@gmail.com&gt; 1703969783 +0530

First stop in time-travel
</code></pre>
<h2 id="heading-summary">Summary</h2>
<p>By now, you are equipped with a strong foundational and fundamental understanding of git works. With the help of a simple repository, we have understood how git stores the snapshots of our data, and how to inspect the objects ourselves.</p>
<p>In the upcoming articles, we will be delving into real-time time travel from the present to the past. There will also be surprises as we learn more about how we can travel from one timeline to another via branching in the madness of this multiverse!</p>
]]></content:encoded></item><item><title><![CDATA[Git Internals: Understanding the Time Machine]]></title><description><![CDATA[In the previous article, we saw the high-level architecture of our time machine, git. However, only having that bit of knowledge doesn't get you started with time travelling. In the quest of mastering time travel, we need to have a thorough understan...]]></description><link>https://tech.amanmulani.com/git-internals-understanding-the-time-machine</link><guid isPermaLink="true">https://tech.amanmulani.com/git-internals-understanding-the-time-machine</guid><category><![CDATA[Git]]></category><category><![CDATA[vcs]]></category><category><![CDATA[version control]]></category><category><![CDATA[versioning]]></category><category><![CDATA[version control systems]]></category><dc:creator><![CDATA[Aman Mulani]]></dc:creator><pubDate>Sat, 30 Dec 2023 16:08:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/UQ2Fw_9oApU/upload/1180b8a6d145eb808af66b0b61c80b79.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the previous article, we saw the high-level architecture of our time machine, git. However, only having that bit of knowledge doesn't get you started with time travelling. In the quest of mastering time travel, we need to have a thorough understanding of how the time machine works.</p>
<h2 id="heading-problem-statement">Problem Statement</h2>
<p>First, let us try to understand what the time machine should be capable of:</p>
<ul>
<li><p>The tool should be able to save code snapshots of the given repository (folder).</p>
</li>
<li><p>The tool should be able to traverse these code snapshots seamlessly.</p>
</li>
</ul>
<p>By snapshots, we mean the current state of the code.</p>
<h2 id="heading-a-stab-at-design">A Stab at Design</h2>
<p>We can easily conclude that each snapshot needs to have a unique identifier. So whenever we save a snapshot, we should get a unique identifier associated with the snapshot, which can used to travel back to that snapshot.</p>
<h3 id="heading-compressing-files">Compressing Files</h3>
<p>Since, we are working with files, over time, the amount of disk space used to store code snapshots can increase drastically. Although the disk space isn't a big concern, however along with the disk space, the time to retrieve those snapshots also increases due to the increased size of the history. To solve this problem, let us compress the content of our files when we store the snapshot. We can easily decompress the content of our snapshot while retrieving it.</p>
<h3 id="heading-hashing-the-compressed-contents">Hashing the compressed contents</h3>
<p>Compressing files solves the problem of storage, but what about the unique identifiers? The first thing that usually strikes us when we think about a unique identifier is hashing. We will use the same concept here. To generate a unique code identifier, we will simply hash the compressed content using SHA1 (since git uses the same algo). The SHA1 generated can be used to travel back to this code snapshot.</p>
<h3 id="heading-storage-mechanism">Storage Mechanism</h3>
<p>Now that we have somewhat solved the problem of uniquely identifying code snapshots and optimally saving the code, let us delve one layer beneath on how to save this data and keep track of it. So basically, we now have to design how the contents given out by our time machine should be saved and retrieved from our client-side storage.</p>
<p>A few constraints to keep in perspective are that the client-side tool has to be lightweight with a minimal number of extra dependencies and should be highly portable. This non-functional requirement somewhat rules out the possibility of using client-side databases. Can we think of something simpler, one that is already available out of the box by the operating system?</p>
<p>Ahh! Can we store the compressed file contents in files and maintain the SHA1 identifiers in the other files for quick look-up of hashes? Well, this seems a good deal. Or, at least this is what <strong>Linus Torvalds,</strong> the inventor of the time machine git thought. He leveraged the <em>Unix-like file</em> directory structure to store the compressed content.</p>
<h3 id="heading-final-outcome">Final Outcome</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703947313718/f9ed17d3-47e3-4c46-9198-0d43f36d4dfd.png" alt="VCS Client Side Code Snapshot Storing Mechanism" class="image--center mx-auto" /></p>
<p>So we first compress the contents of the file and store them in a different file. The name of the file is the SHA1 of the file's content. To enable easier look-up for reference, this SHA1 is also stored in a different file. If you are aware of the term inverted index, this might sound something similar in principle. This enables easy storage and retrieval of code snapshots via the VCS tool!</p>
<h2 id="heading-summary">Summary</h2>
<p>Congratulations! You have designed a very basic version of git, the time machine! Without a doubt, git uses more sophisticated ways to store and keep track of the timeline. But worry not! In the next article, we will equip ourselves with how git is implemented. The power of commit, blob and tree!</p>
]]></content:encoded></item><item><title><![CDATA[Git Internals: Understanding the Time Travel]]></title><description><![CDATA[In this article, we will unfold why git is designed the way it is, and why we have referred to it as a time machine. Here, it is expected that you have a fair bit of an idea of what is a VCS.
Fundamentals
As always, let us try to understand the funda...]]></description><link>https://tech.amanmulani.com/understanding-git-time-travel-options</link><guid isPermaLink="true">https://tech.amanmulani.com/understanding-git-time-travel-options</guid><category><![CDATA[Git]]></category><category><![CDATA[vcs]]></category><category><![CDATA[version control]]></category><category><![CDATA[version control systems]]></category><dc:creator><![CDATA[Aman Mulani]]></dc:creator><pubDate>Fri, 29 Dec 2023 21:58:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/h0dngiRxMeA/upload/db5a32dfe4bfbff8566320801607738c.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this article, we will unfold why git is designed the way it is, and why we have referred to it as a time machine. Here, it is expected that you have a fair bit of an idea of what is a <a target="_blank" href="https://en.wikipedia.org/wiki/Version_control">VCS</a>.</p>
<h2 id="heading-fundamentals">Fundamentals</h2>
<p>As always, let us try to understand the fundamentals. To do so, we need to understand why git came into existence in the first place. Git is a distributed version control tool, which means that developers can make changes to the code in their local systems, and once done, they can push the changes to a central collaborative repository.</p>
<p>This is the major problem that git solves, it enables each developer to maintain several different versions of the code locally in their systems, while being able to get new changes (pull changes) from the remote repository in the central server and also push their changes to the remote repository in the central server from their local systems.</p>
<h2 id="heading-problem-statement">Problem Statement</h2>
<p>To have a better understanding, let us imagine that you are tasked with designing a tool, that can move you back to any given code snapshot and also give you the ability to share it with other developers as well.</p>
<p>By snapshot, we mean any particular version of code that you want your tool to be able to go back to. Consider a website development scenario where post the initial model (v1), code modifications occur (v2), necessitating a return to the previous snapshot (v1). You should be able to do this with your tool. You should also be able to share the snapshot of any state of your code with other developers.</p>
<h2 id="heading-solution-1">Solution 1</h2>
<p>Now let us try to understand the components needed to design such a tool. Since we have to maintain the code snapshots, it is obvious that we need some storage mechanism, i.e. we will need a database to store our code changes. We want to enable code sharing and, therefore we want the storage to be a remote central storage to save our content (don't even think about peer-to-peer connections!).</p>
<p>Just having central remote storage won't help, because to maintain the connections with multiple clients, making the client-side tool lightweight, ensuring a secure connection and maintaining role-based access control, we need to have a remote server itself, which interfaces between the client and the central storage.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703885242769/ebe361a7-2884-457a-aef9-723499c07f27.png" alt="VCS Tool Design v1" class="image--center mx-auto" /></p>
<h3 id="heading-workflow">Workflow</h3>
<ul>
<li><p>The development flow entails a developer making changes, capturing a code snapshot, and automatically syncing with the remote server while compressing the snapshot.</p>
</li>
<li><p>Every time a new snapshot is created by the developer, it is directly saved to the remote server.</p>
</li>
<li><p>The client side maintains the current code version available on the remote server, excluding local modifications.</p>
</li>
</ul>
<p>Even after having this structured approach, scalability concerns arise with the accumulation of data from numerous changes and snapshots, especially with a sizable number of developers.</p>
<p>The performance bottlenecks are due to the storage of compressed data for all files in each snapshot. For example, if our tool takes the first snapshot of the code having 1000 files as v1. It sends all of the compressed data of these 1000 files to the central repository. Now, in the second version, <em>even if we have updated only 2 lines in all the 1000 files, it will send the compressed data of all these 1000 files.</em> However, we have only modified 2 lines.</p>
<p>So a better solution would be to just store what has changed from v1 till any number of changes in the file. So for example, the first snapshot v1 will have all the file contents, however the subsequent snapshots: v2, v3, v4, etc. will be the delta, i.e. only the contents that have changed. This way we have optimized our performance and storage as well. Let's introduce some mathematical terms:</p>
<ul>
<li><p><strong>Initial Snapshot (v1):</strong></p>
<p>  $$S(v1)​=C(1)​+C(2)​+…+C(1000)$$</p>
<p>  Where S(v1) is the snapshot of version 1, and C(n) refers to the compressed data of the given file number</p>
</li>
<li><p><strong>Subsequent Snapshots (v2, v3, ...):</strong></p>
<p>  $$Δ (v1,v2)=C(modified)$$</p>
<p>  Where Δ is the change of file contents from v1 to v2.</p>
</li>
</ul>
<p>By storing only the delta, representing what has changed in each file from v1 onward, we optimize both performance and storage. See, how simple it is to time travel from one version to another!</p>
<p>Congratulations! You have just designed a tool similar to <a target="_blank" href="https://en.wikipedia.org/wiki/Apache_Subversion">SVN</a>.</p>
<h3 id="heading-problems">Problems</h3>
<p>This seems to be ideal but it is a nightmare for teams working on a decent-sized project. Problems?</p>
<ol>
<li><p><strong>Internet Dependency</strong><br /> The developer always needs to be connected to the internet. To save the current snapshot, developers often struggle when they are offline or have a slow network connection since snapshots can only be saved on the remote server. This hampers the developer's productivity.</p>
</li>
<li><p><strong>Linear Snapshot Retrieval</strong><br /> Since only the delta changes are stored, accessing a specific snapshot requires reconstructing the code's state sequentially from the initial version. This is necessary because each subsequent snapshot builds upon the previous one. Retrieving snapshots linearly ensures the integrity of the code history, but it may become less efficient as the project grows in size and complexity.</p>
</li>
<li><p><strong>Single point of failure</strong></p>
<p> If the remote server goes offline, then all the developers will be stuck in the current local copy, and won't be able to save any snapshots.</p>
</li>
</ol>
<p>Although our system is optimized in terms of computer resources, it fares <em>very poorly in terms of developer experience and productivity.</em></p>
<p>In our analogy, it means that if you have to travel in the past, you must always start from the beginning of time itself, the very first big bang! Not a pleasant experience for the time-travellers!</p>
<h2 id="heading-solution-2">Solution 2</h2>
<blockquote>
<p>Developer time is more expensive than local disk storage.</p>
</blockquote>
<p>If we think in terms of developer time being more expensive than the local disk storage, we can think of having client-side storage. Having client-side storage means that we need an internet connection only when we have to update or fetch the data from the remote repository.</p>
<p>Since we consider storage to be cheaper as compared to productive time, we can start storing the entire snapshot of the code instead of delta. At any given instance of time, the developer can push the snapshots to the remote server, which in return will maintain these snapshots, which can later be pulled by some other developer. Using some smart compression techniques and retrieval methods, we can make the system performant. So here we go, we have the following three components:</p>
<p>Client-side storage, client tool, remote server.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703885213630/86b7a676-7e23-4e02-ba45-294771ba9191.png" alt="VCS Tool Design v2" class="image--center mx-auto" /></p>
<h3 id="heading-workflow-1">Workflow</h3>
<ul>
<li><p>The developer changes the code stored in a folder (repository). This folder is somehow registered in the tool that we use, and thus there is a remote copy of this repository.</p>
</li>
<li><p>Multiple developers take a copy of this repository in their local machines. Each developer can push their snapshot of code to the remote repository. Developers can make changes and store the required snapshots in the client-side storage. Internet connectivity is not required for creating snapshots. An Internet connection is only required to push the snapshots to the remote server.</p>
</li>
<li><p>Since we store all the contents of the code with a given snapshot, it becomes easier to jump back to any version, without the need to calculate it linearly from the initial version.</p>
</li>
<li><p>Multiple developers can continue to work on their local copies even if the central remote server is down.</p>
</li>
</ul>
<p>Since there are multiple copies of snapshots present with each local machine, which can be eventually synced back with the remote server, we can call this system a loosely coupled distributed version control system that empowers the developer and can result in more developer productivity.</p>
<p>Congratulations, you have designed a very raw version of git!</p>
<h2 id="heading-summary">Summary</h2>
<p>In principle, we have designed a time machine, which can take us back in time to any snapshot by just knowing where we want to land (the version number). This is what git is. A time machine which can take us back to the dinosaur age of our code, and then bring us efficiently back to the current modern age (even if it is no better than the Stone Age. Pun intended!)</p>
<p>There are quite a few drawbacks in git as well, which are easily available on the internet if you want to read through them! Although in this post we have designed the most basic and scrappy version of git, in the upcoming posts, we will delve into the details of the internal workings of git and build upon these foundations! Looking forward to meeting you again, time-traveller!</p>
]]></content:encoded></item></channel></rss>