Home Explore Blog CI



git

7th chunk of `Documentation/git-filter-branch.adoc`
6aa4300dbeb4343d537d53150cf55de8c472e3cc35615f260000000100000fa2
 order).  This is a very destructive
approach, so *make a backup* or go back to cloning it.  You have been
warned.

* Remove the original refs backed up by git-filter-branch: say `git
  for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git
  update-ref -d`.

* Expire all reflogs with `git reflog expire --expire=now --all`.

* Garbage collect all unreferenced objects with `git gc --prune=now`
  (or if your git-gc is not new enough to support arguments to
  `--prune`, use `git repack -ad; git prune` instead).

[[PERFORMANCE]]
PERFORMANCE
-----------

The performance of git-filter-branch is glacially slow; its design makes it
impossible for a backward-compatible implementation to ever be fast:

* In editing files, git-filter-branch by design checks out each and
  every commit as it existed in the original repo.  If your repo has
  `10^5` files and `10^5` commits, but each commit only modifies five
  files, then git-filter-branch will make you do `10^10` modifications,
  despite only having (at most) `5*10^5` unique blobs.

* If you try and cheat and try to make git-filter-branch only work on
  files modified in a commit, then two things happen

  ** you run into problems with deletions whenever the user is simply
     trying to rename files (because attempting to delete files that
     don't exist looks like a no-op; it takes some chicanery to remap
     deletes across file renames when the renames happen via arbitrary
     user-provided shell)

  ** even if you succeed at the map-deletes-for-renames chicanery, you
     still technically violate backward compatibility because users
     are allowed to filter files in ways that depend upon topology of
     commits instead of filtering solely based on file contents or
     names (though this has not been observed in the wild).

* Even if you don't need to edit files but only want to e.g. rename or
  remove some and thus can avoid checking out each file (i.e. you can
  use --index-filter), you still are passing shell snippets for your
  filters.  This means that for every commit, you have to have a
  prepared git repo where those filters can be run.  That's a
  significant setup.

* Further, several additional files are created or updated per commit
  by git-filter-branch.  Some of these are for supporting the
  convenience functions provided by git-filter-branch (such as map()),
  while others are for keeping track of internal state (but could have
  also been accessed by user filters; one of git-filter-branch's
  regression tests does so).  This essentially amounts to using the
  filesystem as an IPC mechanism between git-filter-branch and the
  user-provided filters.  Disks tend to be a slow IPC mechanism, and
  writing these files also effectively represents a forced
  synchronization point between separate processes that we hit with
  every commit.

* The user-provided shell commands will likely involve a pipeline of
  commands, resulting in the creation of many processes per commit.
  Creating and running another process takes a widely varying amount
  of time between operating systems, but on any platform it is very
  slow relative to invoking a function.

* git-filter-branch itself is written in shell, which is kind of slow.
  This is the one performance issue that could be backward-compatibly
  fixed, but compared to the above problems that are intrinsic to the
  design of git-filter-branch, the language of the tool itself is a
  relatively minor issue.

  ** Side note: Unfortunately, people tend to fixate on the
     written-in-shell aspect and periodically ask if git-filter-branch
     could be rewritten in another language to fix the performance
     issues.  Not only does that ignore the bigger intrinsic problems
     with the design, it'd help less than you'd expect: if
     git-filter-branch itself were not shell, then the convenience
     functions (map(), skip_commit(), etc) and the `--setup` argument
     could no longer be executed once at the beginning

Title: Git Filter Branch Performance Limitations
Summary
The text discusses the performance limitations of git filter-branch, explaining that its design makes it inherently slow due to factors such as checking out each commit, running shell commands, and creating multiple processes, and that attempts to optimize it by rewriting in another language would have limited impact due to the fundamental design issues.