Home Explore Blog CI



git

4th chunk of `Documentation/git-range-diff.adoc`
5f8c919a58766aa05981fecf1fbf3388a139c2d11faee4520000000100000daa
 commit, for example, is completely red,
even if the intent of the old commit was to add something.

To help with that, `range` uses the `--dual-color` mode by default. In
this mode, the diff of diffs will retain the original diff colors, and
prefix the lines with -/+ markers that have their *background* red or
green, to make it more obvious that they describe how the diff itself
changed.


Algorithm
---------

The general idea is this: we generate a cost matrix between the commits
in both commit ranges, then solve the least-cost assignment.

The cost matrix is populated thusly: for each pair of commits, both
diffs are generated and the "diff of diffs" is generated, with 3 context
lines, then the number of lines in that diff is used as cost.

To avoid false positives (e.g. when a patch has been removed, and an
unrelated patch has been added between two iterations of the same patch
series), the cost matrix is extended to allow for that, by adding
fixed-cost entries for wholesale deletes/adds.

Example: Let commits `1--2` be the first iteration of a patch series and
`A--C` the second iteration. Let's assume that `A` is a cherry-pick of
`2,` and `C` is a cherry-pick of `1` but with a small modification (say,
a fixed typo). Visualize the commits as a bipartite graph:

------------
    1            A

    2            B

		 C
------------

We are looking for a "best" explanation of the new series in terms of
the old one. We can represent an "explanation" as an edge in the graph:


------------
    1            A
	       /
    2 --------'  B

		 C
------------

This explanation comes for "free" because there was no change. Similarly
`C` could be explained using `1`, but that comes at some cost c>0
because of the modification:

------------
    1 ----.      A
	  |    /
    2 ----+---'  B
	  |
	  `----- C
	  c>0
------------

In mathematical terms, what we are looking for is some sort of a minimum
cost bipartite matching; `1` is matched to `C` at some cost, etc. The
underlying graph is in fact a complete bipartite graph; the cost we
associate with every edge is the size of the diff between the two
commits' patches. To explain also new commits, we introduce dummy nodes
on both sides:

------------
    1 ----.      A
	  |    /
    2 ----+---'  B
	  |
    o     `----- C
	  c>0
    o            o

    o            o
------------

The cost of an edge `o--C` is the size of `C`'s diff, modified by a
fudge factor that should be smaller than 100%. The cost of an edge
`o--o` is free. The fudge factor is necessary because even if `1` and
`C` have nothing in common, they may still share a few empty lines and
such, possibly making the assignment `1--C`, `o--o` slightly cheaper
than `1--o`, `o--C` even if `1` and `C` have nothing in common. With the
fudge factor we require a much larger common part to consider patches as
corresponding.

The overall time needed to compute this algorithm is the time needed to
compute n+m commit diffs and then n*m diffs of patches, plus the time
needed to compute the least-cost assignment between n and m diffs. Git
uses an implementation of the Jonker-Volgenant algorithm to solve the
assignment problem, which has cubic runtime complexity. The matching
found in this case will look like this:

------------
    1 ----.      A
	  |    /
    2 ----+---'  B
       .--+-----'
    o -'  `----- C
	  c>0
    o ---------- o

    o ---------- o
------------


SEE ALSO
--------
linkgit:git-log[1]

GIT
---
Part of the linkgit:git[1] suite

Title: Git Range-Diff Algorithm and Implementation
Summary
The git range-diff algorithm generates a cost matrix between commits, solves the least-cost assignment, and uses a bipartite graph to find the best explanation of the new commit series in terms of the old one, utilizing the Jonker-Volgenant algorithm to achieve this with cubic runtime complexity.