Git Range-Diff Algorithm and Implementation

commit, for example, is completely red, even if the intent of the old commit was to add something. To help with that, `range` uses the `--dual-color` mode by default. In this mode, the diff of diffs will retain the original diff colors, and prefix the lines with -/+ markers that have their *background* red or green, to make it more obvious that they describe how the diff itself changed. Algorithm --------- The general idea is this: we generate a cost matrix between the commits in both commit ranges, then solve the least-cost assignment. The cost matrix is populated thusly: for each pair of commits, both diffs are generated and the "diff of diffs" is generated, with 3 context lines, then the number of lines in that diff is used as cost. To avoid false positives (e.g. when a patch has been removed, and an unrelated patch has been added between two iterations of the same patch series), the cost matrix is extended to allow for that, by adding fixed-cost entries for wholesale deletes/adds. Example: Let commits `1--2` be the first iteration of a patch series and `A--C` the second iteration. Let's assume that `A` is a cherry-pick of `2,` and `C` is a cherry-pick of `1` but with a small modification (say, a fixed typo). Visualize the commits as a bipartite graph: ------------ 1 A 2 B C ------------ We are looking for a "best" explanation of the new series in terms of the old one. We can represent an "explanation" as an edge in the graph: ------------ 1 A / 2 --------' B C ------------ This explanation comes for "free" because there was no change. Similarly `C` could be explained using `1`, but that comes at some cost c>0 because of the modification: ------------ 1 ----. A | / 2 ----+---' B | `----- C c>0 ------------ In mathematical terms, what we are looking for is some sort of a minimum cost bipartite matching; `1` is matched to `C` at some cost, etc. The underlying graph is in fact a complete bipartite graph; the cost we associate with every edge is the size of the diff between the two commits' patches. To explain also new commits, we introduce dummy nodes on both sides: ------------ 1 ----. A | / 2 ----+---' B | o `----- C c>0 o o o o ------------ The cost of an edge `o--C` is the size of `C`'s diff, modified by a fudge factor that should be smaller than 100%. The cost of an edge `o--o` is free. The fudge factor is necessary because even if `1` and `C` have nothing in common, they may still share a few empty lines and such, possibly making the assignment `1--C`, `o--o` slightly cheaper than `1--o`, `o--C` even if `1` and `C` have nothing in common. With the fudge factor we require a much larger common part to consider patches as corresponding. The overall time needed to compute this algorithm is the time needed to compute n+m commit diffs and then n*m diffs of patches, plus the time needed to compute the least-cost assignment between n and m diffs. Git uses an implementation of the Jonker-Volgenant algorithm to solve the assignment problem, which has cubic runtime complexity. The matching found in this case will look like this: ------------ 1 ----. A | / 2 ----+---' B .--+-----' o -' `----- C c>0 o ---------- o o ---------- o ------------ SEE ALSO -------- linkgit:git-log[1] GIT --- Part of the linkgit:git[1] suite

The git range-diff algorithm generates a cost matrix between commits, solves the least-cost assignment, and uses a bipartite graph to find the best explanation of the new commit series in terms of the old one, utilizing the Jonker-Volgenant algorithm to achieve this with cubic runtime complexity.