Git Filter Branch Performance Limitations

order). This is a very destructive approach, so *make a backup* or go back to cloning it. You have been warned. * Remove the original refs backed up by git-filter-branch: say `git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d`. * Expire all reflogs with `git reflog expire --expire=now --all`. * Garbage collect all unreferenced objects with `git gc --prune=now` (or if your git-gc is not new enough to support arguments to `--prune`, use `git repack -ad; git prune` instead). [[PERFORMANCE]] PERFORMANCE ----------- The performance of git-filter-branch is glacially slow; its design makes it impossible for a backward-compatible implementation to ever be fast: * In editing files, git-filter-branch by design checks out each and every commit as it existed in the original repo. If your repo has `10^5` files and `10^5` commits, but each commit only modifies five files, then git-filter-branch will make you do `10^10` modifications, despite only having (at most) `5*10^5` unique blobs. * If you try and cheat and try to make git-filter-branch only work on files modified in a commit, then two things happen ** you run into problems with deletions whenever the user is simply trying to rename files (because attempting to delete files that don't exist looks like a no-op; it takes some chicanery to remap deletes across file renames when the renames happen via arbitrary user-provided shell) ** even if you succeed at the map-deletes-for-renames chicanery, you still technically violate backward compatibility because users are allowed to filter files in ways that depend upon topology of commits instead of filtering solely based on file contents or names (though this has not been observed in the wild). * Even if you don't need to edit files but only want to e.g. rename or remove some and thus can avoid checking out each file (i.e. you can use --index-filter), you still are passing shell snippets for your filters. This means that for every commit, you have to have a prepared git repo where those filters can be run. That's a significant setup. * Further, several additional files are created or updated per commit by git-filter-branch. Some of these are for supporting the convenience functions provided by git-filter-branch (such as map()), while others are for keeping track of internal state (but could have also been accessed by user filters; one of git-filter-branch's regression tests does so). This essentially amounts to using the filesystem as an IPC mechanism between git-filter-branch and the user-provided filters. Disks tend to be a slow IPC mechanism, and writing these files also effectively represents a forced synchronization point between separate processes that we hit with every commit. * The user-provided shell commands will likely involve a pipeline of commands, resulting in the creation of many processes per commit. Creating and running another process takes a widely varying amount of time between operating systems, but on any platform it is very slow relative to invoking a function. * git-filter-branch itself is written in shell, which is kind of slow. This is the one performance issue that could be backward-compatibly fixed, but compared to the above problems that are intrinsic to the design of git-filter-branch, the language of the tool itself is a relatively minor issue. ** Side note: Unfortunately, people tend to fixate on the written-in-shell aspect and periodically ask if git-filter-branch could be rewritten in another language to fix the performance issues. Not only does that ignore the bigger intrinsic problems with the design, it'd help less than you'd expect: if git-filter-branch itself were not shell, then the convenience functions (map(), skip_commit(), etc) and the `--setup` argument could no longer be executed once at the beginning

The text discusses the performance limitations of git filter-branch, explaining that its design makes it inherently slow due to factors such as checking out each commit, running shell commands, and creating multiple processes, and that attempts to optimize it by rewriting in another language would have limited impact due to the fundamental design issues.