Git-LFS under the hood

When working with Git, you should know that cloning a repository will give you a full copy of nearly all the data that the server has.
This includes every version of every file in the repository.

Although Git does a fantastic job when it comes to storing files efficiently, this is not the case for binary files.

So... what happens when you store binary files (e.g., media files) in your repository?
How does Git even handle the versioning of such files? And what does your current setup do under the hood?

Let's take a look by simulating those scenarios in a test environment!

Why bother?

Since Git does in fact retrieve a full copy of your repository you might have already guessed that this can be especially problematic when
cloning a repository with many binary files that are changed frequently (e.g., large game assets like sprites, audio samples,.etc.).

Using git-lfs solves this problem and reduces your repository size while also speeding up the clone time.

Sounds great, right? Let's see how this works in practice.

How does git-lfs work?

To get an idea of how Git LFS works under the hood, we can install the lfs-test-server and use it as the LFS server for our GitHub test repository.

If you want to follow along or want to know how you can set up a LFS test server, feel free to read the Appendix section where the configuration is described.

Since we now have a local LFS server for testing purposes, we can start configuring our test repository to use it.
For this, we can create the following .lfsconfig file, add it to our repository, and run git lfs install to initialize Git LFS for our repository.

    [lfs]
      url = "https://localhost:9999/"

    [http]
      sslverify = false

For testing, I added the git-lfs-linux-adm64-v3.4.0.tar.gz file to my repository and tracked it with Git LFS using git lfs track "*.tar.gz".

This creates a .gitattributes file with the following content:

    *.tar.gz filter=lfs diff=lfs merge=lfs -text

Ensure that this file is also added to your repository, so that collaborators also track *.tar.gz files in the same way.

Git LFS also comes with a file locking API for collaboration, but I deactivated it for simplicity with the following command:

    git config lfs.https://localhost:8080/.locksverify false

At this point, we have configured git to use our local LFS test server, configured tracking for *.tar.gz files, added the .gitattributes file generated by Git LFS
to our repository, and also added git-lfs-linux-adm64-v3.4.0.tar.gz to test our setup.

Once we commit and push those files to our remote server, we can see the following output:

    Uploading LFS objects: 100% (1/1), 4.1 MB | 0 B/s, done.
    Objekte aufzählen: 5, fertig.
    Zähle Objekte: 100% (5/5), fertig.
    Delta-Kompression verwendet bis zu 12 Threads.
    Komprimiere Objekte: 100% (4/4), fertig.
    Schreibe Objekte: 100% (5/5), 541 Bytes | 541.00 KiB/s, fertig.
    Gesamt 5 (Delta 0), Wiederverwendet 0 (Delta 0), Pack wiederverwendet 0

The first line indicates that our LFS object was successfully uploaded. We can verify that by navigating to the management UI of our test server:

LFS Server

So... if the file is stored on the LFS... what is stored on the remote server?
This is where the pointer files come into play:

Pointer file
(The tar.gz file is a text file on our remote server.)

The remote server only knows the name of the file through the text file, which points to the actual file and can be resolved on the client side via Git LFS.
You can find the actual binary file in the .git/lfs/objects/../..-folder, as the pointer suggests:
Local LFS folder

To make this a bit more interesting, I changed our .tar.gz file by unpacking it and removing some files.
After committing and pushing the changed file, we can see the following output again:

    Uploading LFS objects: 100% (1/1), 4.8 MB | 0 B/s, done.
    Objekte aufzählen: 9, fertig.
    Zähle Objekte: 100% (9/9), fertig.
    Delta-Kompression verwendet bis zu 12 Threads.
    Komprimiere Objekte: 100% (7/7), fertig.
    Schreibe Objekte: 100% (7/7), 3.22 KiB | 3.22 MiB/s, fertig.
    Gesamt 7 (Delta 0), Wiederverwendet 0 (Delta 0), Pack wiederverwendet 0

...and a second file appears on our test server:

Large file changed

If we switch to GitHub, we also see that the main branch now points to the new version of the file, as expected:

Pointer file

Since we had both versions of the tar.gz file in our local environment, we can find both of them in our lfs objects folder:
Loca Files

So... what's the benefit in this case?

If we clone the test repository again, we will notice that we only receive the most recent version of our tar.gz file without having to download all previous versions of this file again:

Preview

If we want to take this further, we can even verify that our remote server never stored this file by cloning the repository again without having LFS configured.
This would give us all the files, except the binary files:

Cloned without LFS

Let's say we want to have the previous version of the file. What can we do?

One option would be to simply checkout the previous state of the main branch. Git LFS takes care of getting the previous file for you using the pointer file.
After that, you can see that you have both versions locally available again:

Preview

To sum up, Git LFS makes use of the smudge and clean filters of Git. Those filters allow git-lfs to change a file on checkout (smudge) and on commit (clean).
This enables Git LFS to "replace" the binary file with a pointer file for your remote server while simultaneously sending the binary file to your LFS server.
When using git checkout, Git LFS intercepts Git via the smudge filter again and replaces the pointer file with the actual file by requesting it from the storage server.

In short
With Git LFS, you will get large files from an external data source on-demand instead of getting every version of every (binary) file all the time.

If you are interested in how exactly Git LFS makes use of smudge and clean, you can find a detailed explanation in the official git-lfs client specification.

Thanks for reading!

Appendix

Setting up the local LFS test server

Setting up the test server is fairly easy. If you want to follow along, you can do it with the following steps:

Install git-lfs by following the instructions on their page for your OS.
Install Go(Lang) as described here.
Install the lfs-test-server with go install github.com/git-lfs/lfs-test-server@latest.
Generate a cert with openssl req -x509 -sha256 -nodes -days 2100 -newkey rsa:2048 -keyout lfs.key -out lfs.crt
Create a run-lfs.sh script to configure and run the server as described below.
Run chmod +x run-lfs.sh and execute the run-script in order to start the local lfs test server.

run-lfs.sh

    #!/bin/bash

    set -eu
    set -o pipefail


    LFS_LISTEN="tcp://:9999"
    LFS_HOST="127.0.0.1:9999"
    LFS_CONTENTPATH="content"
    LFS_ADMINUSER="admin"
    LFS_ADMINPASS="password"
    LFS_CERT="lfs.crt"
    LFS_KEY="lfs.key"
    LFS_SCHEME="https"

    export LFS_LISTEN LFS_HOST LFS_CONTENTPATH LFS_ADMINUSER LFS_ADMINPASS LFS_CERT LFS_KEY LFS_SCHEME

    # Needs to be adapted ;)
    /home/philipp/go/bin/lfs-test-server

Once this is done, you can access the management UI via https://localhost:9999/mgmt/objects.

Analyzing and migrating repositories

If you are not sure if you have a problem with large files, you can analyze your repository with the git-sizer tool.
There are also tools that help you remove large files entirely (including the history) that you can check out: bfg-repo-cleaner, git-filter-repo
Git LFS also comes with a git lfs migrate command that is pretty useful.

Large files on GitHub

GitHub blocks files larger than 100 MiB and only allows adding files up to 25 MiB through the browser.
If you want to store larger files, you would have to use Git LFS.

GitHub also recommends distributing huge files via GitHub releases if necessary, whereby each file must be smaller than 2 GiB.
If you are planning to use GitHub's LFS, please ensure you read about the quota for your subscription plan.

Also, be aware that GitHub has no way of deleting remote Git LFS objects except by deleting the whole repository.

Sources / relevant links:

by Philipp Meier, Oct 2023

Tags:

git

git-lfs

visitor@p-meier.dev:~$