are not ancestor
of the new "bad" commit.
So when a commit is marked as "bad" we know we can remove all the
commits in the graph except those that are ancestors of the new "bad"
commit. This means that:
-------------
information_if_bad(X) = N - number_of_ancestors(X) (TRUE)
-------------
where N is the number of commits in the (cleaned up) graph.
So in the end this means that to find the best bisection commits we
should maximize the function:
-------------
f(X) = min(number_of_ancestors(X), N - number_of_ancestors(X))
-------------
And this is nice because at step 2) we compute number_of_ancestors(X)
and so at step 3) we compute f(X).
Let's take the following graph as an example:
-------------
G-H-I-J
/ \
A-B-C-D-E-F O
\ /
K-L-M-N
-------------
If we compute the following non optimal function on it:
-------------
g(X) = min(number_of_ancestors(X), number_of_descendants(X))
-------------
we get:
-------------
4 3 2 1
G-H-I-J
1 2 3 4 5 6/ \0
A-B-C-D-E-F O
\ /
K-L-M-N
4 3 2 1
-------------
but with the algorithm used by git bisect we get:
-------------
7 7 6 5
G-H-I-J
1 2 3 4 5 6/ \0
A-B-C-D-E-F O
\ /
K-L-M-N
7 7 6 5
-------------
So we chose G, H, K or L as the best bisection point, which is better
than F. Because if for example L is bad, then we will know not only
that L, M and N are bad but also that G, H, I and J are not the first
bad commit (since we suppose that there is only one first bad commit
and it must be an ancestor of L).
So the current algorithm seems to be the best possible given what we
initially supposed.
Skip algorithm
~~~~~~~~~~~~~~
When some commits have been skipped (using "git bisect skip"), then
the bisection algorithm is the same for step 1) to 3). But then we use
roughly the following steps:
6) sort the commit by decreasing associated value
7) if the first commit has not been skipped, we can return it and stop
here
8) otherwise filter out all the skipped commits in the sorted list
9) use a pseudo random number generator (PRNG) to generate a random
number between 0 and 1
10) multiply this random number with its square root to bias it toward
0
11) multiply the result by the number of commits in the filtered list
to get an index into this list
12) return the commit at the computed index
Skip algorithm discussed
~~~~~~~~~~~~~~~~~~~~~~~~
After step 7) (in the skip algorithm), we could check if the second
commit has been skipped and return it if it is not the case. And in
fact that was the algorithm we used from when "git bisect skip" was
developed in Git version 1.5.4 (released on February 1st 2008) until
Git version 1.6.4 (released July 29th 2009).
But Ingo Molnar and H. Peter Anvin (another well known linux kernel
developer) both complained that sometimes the best bisection points
all happened to be in an area where all the commits are
untestable. And in this case the user was asked to test many
untestable commits, which could be very inefficient.
Indeed untestable commits are often untestable because a breakage was
introduced at one time, and that breakage was fixed only after many
other commits were introduced.
This breakage is of course most of the time unrelated to the breakage
we are trying to locate in the commit graph. But it prevents us to
know if the interesting "bad behavior" is present or not.
So it is a fact that commits near an untestable commit have a high
probability of being untestable themselves. And the best bisection
commits are often found together too (due to the bisection algorithm).
This is why it is a bad idea to just chose the next best unskipped
bisection commit when the first one has been skipped.
We found that most commits on the graph may give quite a lot of
information when they are tested. And the commits that