Advanced Git concepts (or Git internals)— How to tackle ‘plumbing’ operations

15 min readJul 23, 2021

In this article we are going to explore the git plumbing operations like cat-file, write-tree, commit-tree, update-ref and more. These git commands are low level instructions which are used under the hood by the common git commands. Exploring these operations allows for a better understanding of the GIT inner workings.

You might in fact be familiar with the add , commit , checkout , merge, etc.. Git commands, however, under the hood, Git uses these so called ‘plumbing’ operations. The just mentioned commands may be seen as high-level abstractions and combinations of low-level instructions. In this article, we are first going to see how a ‘test’ environment can be setup to better understand the inner workings of Git. After that, we’ll explore some of these plumbing commands and what effect they do have on the Git project. We will also re-construct some of the high-level Git commands using the low-level operations.

Set up

We are going to work on the command line, the setup will be slightly different for Linux and Windows machines.

First we’ll work on a splitted terminal:

On the right, we’ll have the tree structure of the .git folder and on the left we are going to explore the commands. Here, I will be using Windows Powershell, if you do not have it, I recommend you to download it from here. For Linux users, they can use their preferred terminal emulators.

Let’s now git init to initialize our test environment:

We can see that this command created a hidden folder named .git :

We can explore the content of this folder using the tree command. Here the syntax is slightly different for Windows and Linux users (as far as the recursive flag is concerned). The Linux tree command is recursive by default, while for the Windows one we need to specify the /F flag:

So we can see that there are some files ( config , description , HEAD ) and some folders ( hooks , info , objects , refs ) which the command git init has created for us. In this tutorial we will not focus on the hooks folder, so we can delete it to have a cleaner folder structure:

rm -recurse .\.git\hooks\

Now, on the right panel, we are going to create a loop showing the tree structure of the .git folder refreshing every 2 seconds:

# Windows users
for(;;){ clear; tree /F .git; sleep 2}# Linux users
while [ true ] ; do clear; tree .git; sleep 2; done

Cool, we can now start exploring the low-level Git commands.

Plumbing Operations

Hash-object

To manually create a Git object, we can pipe some content into the git hash-object command. This command, by default, would just return the hash of an object (file) specified. However, with the --stdin flag, we specify to take its input from the stdin and with the -w flag we tell the command to write that object to the Git database (in the .git/objects folder). Let’s try it:

In the right panel, we should now see something like:

That command has created a file 5d3e.. in a folder f3 inside the objects one. Now, the hash of this new object is the complete f35d3e... and if we try to cat the content of this file we get:

cat .\.git\objects\f3\5d3e67b4cdad5ef058bec4a2ef955a98c4848a

Wait, why aren’t we getting the hello world that we actually put in there? That’s because Git stores the content in a compressed format and to see the content of git objects we need to use another command.

Cat-file

This command allows us to see the content of a git object and also its type. There are, in fact, three types of objects that Git might create:

Blob
Tree
Commit

The Blob objects contain the actual content of files (every kind of file), the Tree objects can be seen as UNIX directory entries, these are written from the staging area and will link the blobs (content) with file names in a tree-like structure, while the Commit objects will associate to trees some metadata (Author, dates, comments, etc.. ).

As for now, we have only one object, let’s see its content by:

Where we just need to specify some initial characters of the hash of the file to uniquely characterize it between every other object.

With the -p we print the content of the object. To see its type, we can use the -t flag:

Update-index

As we mentioned above, to create tree objects, we need some files in the staging area. This is done by creating/updating the index with the command:

git update-index --add --cacheinfo 100644 <hash> <filename>

where the 100644 stands for a normal file. Alternatives might be 100755 for executables and 120000 for symbolic links.

Let’s create two versions of a given file:

echo “version 1” > test.txt

and then use the hash-object to create a git blob object:

git hash-object -w test.txt

Cool, on the right panel, a new object should have appeared:

Let’s do that process again:

echo “version 2” > test.txt ; git hash-object -w test.txt

and we should see another object popping out:

Wonderful, we can now update the index (actually creating one at this point) by staging the first version:

git update-index — add — cacheinfo 100644 594dc0e39bc4468ee19c67e65d37b97eb963b68b test.txt

and now, a new file index should have been created in the .git folder:

To see what’s inside this new file, we can issue the command git ls-files --stage :

We can also use the commonly used git status command. With this command we should indeed see that there is one file ready to be committed but also a new modified version not staged yet:

Write-tree

With this command, git will create a new tree object from the staging area:

and a new objects should have appeared:

let’s see its content and its type:

git cat-file -t 674d

Since tree objects are like directories, we can create e new tree and then add the older tree as a subtree to the newer one.

First, let’s create a new file:

echo “new file” > new.txt

and by adding it to the staging area, a new blob object will be automatically created:

git update-index — add .\new.txt

we can now add the ‘version 2’ of test.txt to the staging area by grabbing its hash:

git update-index — add — cacheinfo 100644 f0d983103c610431663d84b3012d1b172f2f52ea test.txt

we can inspect the staging area to see what’s happening:

and:

Alright, we can now create a new tree object by:

let’s see its content:

Perfect, we can now add the previous tree as a subtree of this by first putting the old tree into the staging area (the command syntax is git read-tree — prefix=<name> <hash>):

git read-tree — prefix=old_tree 674d4d31b97233152f3be1825cc9e765fa2b2859

checking the staging area:

we see a new entry named ‘old_tree/test.txt’.

Let’s write that tree:

This should have created a new tree object, which we can inspect by:

At this point, the structure that we constructed can be represented as follows:

Commit-tree

Now that we have our trees, we can create the commit objects to store some metadata regarding these objects. The syntax is the following:

echo "<commit_message>" | git commit-tree <tree_hash>

We’ll then create three commits, one for each tree that we created.

Let’s start with the first tree:

echo “First Commit” | git commit-tree 674d4d31b97233152f3be1825cc9e765fa2b2859

If we grab the hash of this new object:

we can inspect it:

On the bottom we can see the message “First Commit” that we passed to the git commit-tree command, while the other information are retrieved from the .gitconfig in the $HOME directory.

Beautiful! We can now create new commits and concatenate them to actually links between them and have the so called ‘commit history’:

echo "<commit_message>" | git commit-tree <tree_hash> -p <previous_commit_hash>

so:

and

Now, the hashes of your commit objects will be different than mine, and this is because the commit objects contain a timestamp (as can be appreciated by the before cat-file ) and the info about the author.

At this point, we can view the commit history by issuing the following command:

git log --stat <last_commit_hash>

in my case:

Wonderful! We built a commit history completely from low-level commands. What remains to be done is to create a so called ‘branch’. At this point, in fact, if we try to git log we will get:

What needs to be done is update the refs. This is a way for us to refer to a commit by not using the hash but using a human-friendly string. In order to do that, we can write in the .git/refs/heads folder.

Update-ref

We could directly echo the commit hash into a .git/refs/heads/main as follows:

echo <commit_hash> > .git/refs/heads/<branch_name>

This could, however, bring some problems in encoding etc.., best would be to use the update-ref command as follows:

git update-ref /refs/heads/<branch_name> <commit_hash>

In my case, I can write the third commit hash to the ‘main’ branch:

git update-ref refs/heads/main 6b05d1e73ea01f7baeb2ae1c7e3bab920db49e0a

and we should now have:

Also, it should have created a folder logs as follows:

Since the git log command will take the ref from the HEAD (which can be seen by using cat .\.git\HEAD ) and by default, this is the master branch, we still get nothing. In fact if we do:

we see that it contains refs/heads/master . Let’s change it to /refs/heads/main :

git symbolic-ref HEAD refs/heads/main

and if we now see the content of the HEAD again we get:

cat .\.git\HEAD

At this point, we can issue the command git log , which should work fine:

Wonderful, the above is the process to create a new branch and move the HEAD to point to that branch.

We should now have all the tools needed to re-create some of the Git widely used commands.

Reconstructing Git Commands

In order to better understand all these commands, we can try to reconstruct some of the high-level Git commands like add and commit.

Let’s first setup our environment to test all these features.

The folder tree structure that we are going to use will be the following:

.
│   file1.txt
│
├───folder1
│   │   file11.txt
│   │
│   ├───folder11
│   │       file111.txt
│   │       file112.txt
│   │
│   └───folder12
│           file121.txt
│
└───folder2
        file21.txt

where it doesn’t actually matter what is inside the files.

To create this structure, in a new folder, you can employ the following commands (for both Windows (using powershell) and Linux users — since the forward slashes “/” will be automatically converted to back slashes “\” in Windows):

echo "file1" > file1.txt
mkdir folder1
echo "file11" > ./folder1/file11.txt
mkdir folder1/folder11
echo "file111" > ./folder1/folder11/file111.txt
echo "file112" > ./folder1/folder11/file112.txt
mkdir folder1\folder12
echo "file121" > ./folder1/folder12/file121.txt
mkdir folder2
echo "file21" > ./folder2/file21.txt

Initialize the Git repository:

git init

and on the right panel issue the earlier command to display the tree structure of the .git folder.

Add

The add command takes files / directories and add them to the staging area. As we have seen, to add something to the staging area, we first create the blobs of these files and then we update the index by adding all these blobs to the staging area. Analogously we can directly use the update-index --add <filename> to automatically create the blob and add it to the staging area.

First, let’s inspect what happens when the Git add command is issued:

git add folder1

After sending this command, 4 objects should have been created:

checking the staging area (the index), we see that the command has created 4 blobs and added them to the staging area with their relative paths as their names:

Cool, let’s implement a Python script with the same functionality. First, let’s clear the project with:

# For Windows (Powershell)
rm -Recurse -Force .\.git\objects\8f\
rm -Recurse -Force .\.git\objects\a2\
rm -Recurse -Force .\.git\objects\fa\
rm -Recurse -Force .\.git\objects\a3\
git reset .\folder1\# For Linux
rm -R -f .git/objects/8f/
rm -R -f .git/objects/a2/
rm -R -f .git/objects/fa/
rm -R -f .git/objects/a3/
git reset folder1

Create now a new add.py file and put inside it the following code:

If we run it with:

py add.py folder1

we get:

and we should see the same 4 objects created in the .git folder:

plus the same index as before:

Wondeful, we have created our own simplified add git command!

Commit

Let’s try now to implement the Git commit command.

First, we are going to see what it actually does. With the structure that we have left from the previous part, let’s send the following command:

and boom! A lot of changes in the .git folder should have happended.

First, we have a .git\COMMIT_EDITMSG which simply contains First Commit .

Then we have a new .git\logs folder which contains the logs of our commits.

In the .git\objects folder, quite some new elements should be there now (your commit object will be different than mine):

To get the hash of the commit object, let’s use the git log command:

grab the commit hash and check its content:

and from here, we can take the tree object hash and check its content:

and we see that this tree contains another tree named folder1 as a subtree. Grab that hash and check its content:

And here we see that this tree contains the file11.txt and two other trees: folder11 and folder12 as subtrees.

Alright, I think it’s pretty clear how things work here.

Another important feature that the commit command has created, is the ‘master’ branch. In fact, whatever is pointed by the HEAD, the commit will create a ref for that (so it will create a commit object, grab its hash and do something like git update-ref $(git symbolic-ref HEAD) <commit_hash> ). We can check that the master ref contains the correct commit:

Perfect, before implementing our own commit, let’s clean everything up. The simplest way is to delete the .git folder, initialize a new git project, and issue again the py add.py folder1 command:

# For Windows (powershell)rm -Recurse -Force .git\
git init
rm -Recurse -Force .git\hooks\
py add.py folder1# For Linuxrm -R -f .git
git init
rm -R -f .git/hooks/
py add.py folder1

Now, with the naming of the staged files:

the command git write-tree is smart enough to write for us the tree structure specified with the path names. Namely, if we issue a git write-tree command, this will produce the same Git objects as the commit commands, execept the commit object.

Now, let’s see the logic flow that our python script implementing a simplified version of commit should follow:

Let’s see the same diagram with the relative git commands:

Create then a new commit.py file and put inside it the following code:

Now, we can run this script by:

And on the right panel, we should see some objects popping out. Grabbing the hash that the commands outputted for us, we can run:

Cool, we have a commit object with the correct message and with the correct tree hash. In fact, if we recall when we issued the git commit command, we had a commit object pointing to the same tree that this commit is pointing to. Meaning that the our command has created the correct tree objects (one can inspect them as we did before).

Also, if we check our refs, we get:

git synbolic-ref HEAD ; git rev-parse HEAD

meaning that our HEAD points to the correct commit object.

Finally, we can try to see the logs:

and that seems to be perfect too!

Now, if we were to try to commit again, we would get:

Wonderful!!!

Let’s now try to add more elements to the staging area:

checking the index:

we have the new ‘file21.txt’ in the ‘folder2’ folder.

If we commit:

we can check the new commit object:

and here we the parent hash ( f7bb... ) and the new tree object:

Also, the git log correctly display the commit history:

Conclusions

We explored the so called ‘plumbing’ operations in the Git context, which, as we have seen, are low-level commands that can be combined to construct the higher level (and more used) Git commands. We finally saw some simple python programs implementing simple versions of the add and commit commands.

I hope this article have brought to you some new knowledge and insights (as it gave me in writing it). Explore these commands! And try to implement more features using these low-level operations, I guarantee you that you will deepen a lot your understanding of GIT, its fun and its useful!

Thanks for reading!

Cheers!

Kevin