Advanced Git concepts (or Git internals)— How to tackle ‘plumbing’ operations
In this article we are going to explore the git plumbing operations like cat-file, write-tree, commit-tree, update-ref and more. These git commands are low level instructions which are used under the hood by the common git commands. Exploring these operations allows for a better understanding of the GIT inner workings.
You might in fact be familiar with the add
, commit
, checkout
, merge
, etc.. Git commands, however, under the hood, Git uses these so called ‘plumbing’ operations. The just mentioned commands may be seen as high-level abstractions and combinations of low-level instructions. In this article, we are first going to see how a ‘test’ environment can be setup to better understand the inner workings of Git. After that, we’ll explore some of these plumbing commands and what effect they do have on the Git project. We will also re-construct some of the high-level Git commands using the low-level operations.
Set up
We are going to work on the command line, the setup will be slightly different for Linux and Windows machines.
First we’ll work on a splitted terminal:
On the right, we’ll have the tree structure of the .git
folder and on the left we are going to explore the commands. Here, I will be using Windows Powershell, if you do not have it, I recommend you to download it from here. For Linux users, they can use their preferred terminal emulators.
Let’s now git init
to initialize our test environment:
We can see that this command created a hidden folder named .git
:
We can explore the content of this folder using the tree
command. Here the syntax is slightly different for Windows and Linux users (as far as the recursive flag is concerned). The Linux tree
command is recursive by default, while for the Windows one we need to specify the /F
flag:
So we can see that there are some files ( config
, description
, HEAD
) and some folders ( hooks
, info
, objects
, refs
) which the command git init
has created for us. In this tutorial we will not focus on the hooks
folder, so we can delete it to have a cleaner folder structure:
Now, on the right panel, we are going to create a loop showing the tree structure of the .git folder refreshing every 2 seconds:
# Windows users
for(;;){ clear; tree /F .git; sleep 2}# Linux users
while [ true ] ; do clear; tree .git; sleep 2; done
Cool, we can now start exploring the low-level Git commands.
Plumbing Operations
Hash-object
To manually create a Git object, we can pipe some content into the git hash-object
command. This command, by default, would just return the hash of an object (file) specified. However, with the --stdin
flag, we specify to take its input from the stdin and with the -w
flag we tell the command to write that object to the Git database (in the .git/objects
folder). Let’s try it:
In the right panel, we should now see something like:
That command has created a file 5d3e..
in a folder f3
inside the objects
one. Now, the hash of this new object is the complete f35d3e...
and if we try to cat
the content of this file we get:
Wait, why aren’t we getting the hello world
that we actually put in there? That’s because Git stores the content in a compressed format and to see the content of git objects we need to use another command.
Cat-file
This command allows us to see the content of a git object and also its type. There are, in fact, three types of objects that Git might create:
- Blob
- Tree
- Commit
The Blob objects contain the actual content of files (every kind of file), the Tree objects can be seen as UNIX directory entries, these are written from the staging area and will link the blobs (content) with file names in a tree-like structure, while the Commit objects will associate to trees some metadata (Author, dates, comments, etc.. ).
As for now, we have only one object, let’s see its content by:
Where we just need to specify some initial characters of the hash of the file to uniquely characterize it between every other object.
With the -p
we print the content of the object. To see its type, we can use the -t
flag:
Update-index
As we mentioned above, to create tree objects, we need some files in the staging area. This is done by creating/updating the index with the command:
git update-index --add --cacheinfo 100644 <hash> <filename>
where the 100644
stands for a normal file. Alternatives might be 100755
for executables and 120000
for symbolic links.
Let’s create two versions of a given file:
and then use the hash-object
to create a git blob object:
Cool, on the right panel, a new object should have appeared:
Let’s do that process again:
and we should see another object popping out:
Wonderful, we can now update the index (actually creating one at this point) by staging the first version:
and now, a new file index
should have been created in the .git
folder:
To see what’s inside this new file, we can issue the command git ls-files --stage
:
We can also use the commonly used git status
command. With this command we should indeed see that there is one file ready to be committed but also a new modified version not staged yet:
Write-tree
With this command, git will create a new tree object from the staging area:
and a new objects should have appeared:
let’s see its content and its type:
Since tree objects are like directories, we can create e new tree and then add the older tree as a subtree to the newer one.
First, let’s create a new file:
and by adding it to the staging area, a new blob object will be automatically created:
we can now add the ‘version 2’ of test.txt
to the staging area by grabbing its hash:
we can inspect the staging area to see what’s happening:
and:
Alright, we can now create a new tree object by:
let’s see its content:
Perfect, we can now add the previous tree as a subtree of this by first putting the old tree into the staging area (the command syntax is git read-tree — prefix=<name> <hash>
):
checking the staging area:
we see a new entry named ‘old_tree/test.txt’.
Let’s write that tree:
This should have created a new tree object, which we can inspect by:
At this point, the structure that we constructed can be represented as follows:
Commit-tree
Now that we have our trees, we can create the commit objects to store some metadata regarding these objects. The syntax is the following:
echo "<commit_message>" | git commit-tree <tree_hash>
We’ll then create three commits, one for each tree that we created.
Let’s start with the first tree:
If we grab the hash of this new object:
we can inspect it:
On the bottom we can see the message “First Commit” that we passed to the git commit-tree
command, while the other information are retrieved from the .gitconfig
in the $HOME
directory.
Beautiful! We can now create new commits and concatenate them to actually links between them and have the so called ‘commit history’:
echo "<commit_message>" | git commit-tree <tree_hash> -p <previous_commit_hash>
so:
and
Now, the hashes of your commit objects will be different than mine, and this is because the commit objects contain a timestamp (as can be appreciated by the before cat-file
) and the info about the author.
At this point, we can view the commit history by issuing the following command:
git log --stat <last_commit_hash>
in my case:
Wonderful! We built a commit history completely from low-level commands. What remains to be done is to create a so called ‘branch’. At this point, in fact, if we try to git log
we will get:
What needs to be done is update the refs. This is a way for us to refer to a commit by not using the hash but using a human-friendly string. In order to do that, we can write in the .git/refs/heads
folder.
Update-ref
We could directly echo the commit hash into a .git/refs/heads/main
as follows:
echo <commit_hash> > .git/refs/heads/<branch_name>
This could, however, bring some problems in encoding etc.., best would be to use the update-ref command as follows:
git update-ref /refs/heads/<branch_name> <commit_hash>
In my case, I can write the third commit hash to the ‘main’ branch:
and we should now have:
Also, it should have created a folder logs
as follows:
Since the git log
command will take the ref from the HEAD
(which can be seen by using cat .\.git\HEAD
) and by default, this is the master
branch, we still get nothing. In fact if we do:
we see that it contains refs/heads/master
. Let’s change it to /refs/heads/main
:
and if we now see the content of the HEAD again we get:
At this point, we can issue the command git log
, which should work fine:
Wonderful, the above is the process to create a new branch and move the HEAD to point to that branch.
We should now have all the tools needed to re-create some of the Git widely used commands.
Reconstructing Git Commands
In order to better understand all these commands, we can try to reconstruct some of the high-level Git commands like add and commit.
Let’s first setup our environment to test all these features.
The folder tree structure that we are going to use will be the following:
.
│ file1.txt
│
├───folder1
│ │ file11.txt
│ │
│ ├───folder11
│ │ file111.txt
│ │ file112.txt
│ │
│ └───folder12
│ file121.txt
│
└───folder2
file21.txt
where it doesn’t actually matter what is inside the files.
To create this structure, in a new folder, you can employ the following commands (for both Windows (using powershell) and Linux users — since the forward slashes “/” will be automatically converted to back slashes “\” in Windows):
echo "file1" > file1.txt
mkdir folder1
echo "file11" > ./folder1/file11.txt
mkdir folder1/folder11
echo "file111" > ./folder1/folder11/file111.txt
echo "file112" > ./folder1/folder11/file112.txt
mkdir folder1\folder12
echo "file121" > ./folder1/folder12/file121.txt
mkdir folder2
echo "file21" > ./folder2/file21.txt
Initialize the Git repository:
git init
and on the right panel issue the earlier command to display the tree structure of the .git
folder.
Add
The add command takes files / directories and add them to the staging area. As we have seen, to add something to the staging area, we first create the blobs of these files and then we update the index by adding all these blobs to the staging area. Analogously we can directly use the update-index --add <filename>
to automatically create the blob and add it to the staging area.
First, let’s inspect what happens when the Git add command is issued:
After sending this command, 4 objects should have been created:
checking the staging area (the index), we see that the command has created 4 blobs and added them to the staging area with their relative paths as their names:
Cool, let’s implement a Python script with the same functionality. First, let’s clear the project with:
# For Windows (Powershell)
rm -Recurse -Force .\.git\objects\8f\
rm -Recurse -Force .\.git\objects\a2\
rm -Recurse -Force .\.git\objects\fa\
rm -Recurse -Force .\.git\objects\a3\
git reset .\folder1\# For Linux
rm -R -f .git/objects/8f/
rm -R -f .git/objects/a2/
rm -R -f .git/objects/fa/
rm -R -f .git/objects/a3/
git reset folder1
Create now a new add.py
file and put inside it the following code:
If we run it with:
py add.py folder1
we get:
and we should see the same 4 objects created in the .git
folder:
plus the same index
as before:
Wondeful, we have created our own simplified add git command!
Commit
Let’s try now to implement the Git commit command.
First, we are going to see what it actually does. With the structure that we have left from the previous part, let’s send the following command:
and boom! A lot of changes in the .git
folder should have happended.
First, we have a .git\COMMIT_EDITMSG
which simply contains First Commit
.
Then we have a new .git\logs
folder which contains the logs of our commits.
In the .git\objects
folder, quite some new elements should be there now (your commit object will be different than mine):
To get the hash of the commit object, let’s use the git log
command:
grab the commit hash and check its content:
and from here, we can take the tree
object hash and check its content:
and we see that this tree contains another tree named folder1
as a subtree. Grab that hash and check its content:
And here we see that this tree contains the file11.txt
and two other trees: folder11
and folder12
as subtrees.
Alright, I think it’s pretty clear how things work here.
Another important feature that the commit command has created, is the ‘master’ branch. In fact, whatever is pointed by the HEAD, the commit will create a ref for that (so it will create a commit object, grab its hash and do something like git update-ref $(git symbolic-ref HEAD) <commit_hash>
). We can check that the master
ref contains the correct commit:
Perfect, before implementing our own commit, let’s clean everything up. The simplest way is to delete the .git
folder, initialize a new git project, and issue again the py add.py folder1
command:
# For Windows (powershell)rm -Recurse -Force .git\
git init
rm -Recurse -Force .git\hooks\
py add.py folder1# For Linuxrm -R -f .git
git init
rm -R -f .git/hooks/
py add.py folder1
Now, with the naming of the staged files:
the command git write-tree
is smart enough to write for us the tree structure specified with the path names. Namely, if we issue a git write-tree
command, this will produce the same Git objects as the commit
commands, execept the commit object.
Now, let’s see the logic flow that our python script implementing a simplified version of commit should follow:
Let’s see the same diagram with the relative git commands:
Create then a new commit.py
file and put inside it the following code:
Now, we can run this script by:
And on the right panel, we should see some objects popping out. Grabbing the hash that the commands outputted for us, we can run:
Cool, we have a commit object with the correct message and with the correct tree hash. In fact, if we recall when we issued the git commit
command, we had a commit object pointing to the same tree that this commit is pointing to. Meaning that the our command has created the correct tree objects (one can inspect them as we did before).
Also, if we check our refs, we get:
meaning that our HEAD points to the correct commit object.
Finally, we can try to see the logs:
and that seems to be perfect too!
Now, if we were to try to commit again, we would get:
Wonderful!!!
Let’s now try to add more elements to the staging area:
checking the index:
we have the new ‘file21.txt’ in the ‘folder2’ folder.
If we commit:
we can check the new commit object:
and here we the parent hash ( f7bb...
) and the new tree object:
Also, the git log
correctly display the commit history:
Conclusions
We explored the so called ‘plumbing’ operations in the Git context, which, as we have seen, are low-level commands that can be combined to construct the higher level (and more used) Git commands. We finally saw some simple python programs implementing simple versions of the add and commit commands.
I hope this article have brought to you some new knowledge and insights (as it gave me in writing it). Explore these commands! And try to implement more features using these low-level operations, I guarantee you that you will deepen a lot your understanding of GIT, its fun and its useful!
Thanks for reading!
Cheers!
Kevin