You slice and dice your files in a git repo like a pro and accidentally commit a binary file. It happened to you as well, don’t pretend it didn’t.
Sooner or later you recognizes this file shouldn’t be there, it is clogging your git repo for no reason. OK, you delete the file and commit. But the repo size doesn’t get any smaller. Hm…
Indeed, next time you do
git clone you are wondering why your repo is still megabytes in size, while it has just some source code files?
The thing is, by just deleting the file from your working tree and committing this action you don’t make things any better. This large file still sits somewhere in
.git directory waiting for you to rewind the history back and get it. The problem though is that you want this file gone for good.
All the tags, branches are preserved with this procedure, although I do not guarantee that the workflow will work in your case. Do a backup always.
To see if your repo holds those monster files you can leverage some git commands, but I found this self-contained python script quite a good fit for the purpose:
Nice and easy we get 20 largest files which I have no intention to keep and they make the size of the repo to be in 70MB range with compression. No good.
Now to the fun part, lets remove those files, making our repo fit again!
One problem, though, it’s not that easy. First of all, this operation is very intrusive. As Git stores the commits in a graph fashion, you can’t change some commit in the middle, without rewriting the commits after it.
So be prepared, that all the commits will eventually have new hashes. Evaluate the consequences and implications of it.
If you start searching, you will find many workflows, dating back to early 2000s and git 1.x. I tried them and I regret. Eventually I found the working combination that I tested on two of my repos and they worked flawlessly.
Do a backup of your original repo
Now clone the repo with a
--mirror option. That step is very important. You will have your repo cloned under
<repo-name>.git directory, but you won’t see actual files, instead you will have the git database of this repo.
The actual tool that does the job is called git-filter-repo. It is a successor to git-filter-branch and BFG Repo Cleaner.
Then you can read about the options this script supports, for me I chose the easiest path possible – delete every file that is bigger than X Megabytes. So I entered the directory that appeared after I did
git clone --mirror and executed the following command:
For my no-so-big repo with 500 commits, it finished in under a second. It removed all the files bigger than 3Megs and re-wrote all the commits that were affected by that change.
We are not done yet. Although the files were removed for good, we still need to tell git to run a garbage collection procedure to forget about those missing files:
Now the final part. We are ready to update the remote with our new git history. Interesting enough it is done with a simple
force is needed ¯_(ツ)_/¯