sábado, 20 de julio de 2013

Patience diff algorithm benefits for darcs

In this post i am going to explain the benefits of Bram Cohen's patience diff algorithm for darcs but first we have to understand how this algorithm works. There are great posts on the web that explain it really well so instead of trying to explain it again i'm going to quote the important things i need to remark the benefits it have for darcs and haskell-like non-curly languages by examples.

A brief summary of what the patience diff algorithm does from Bram Cohen's Blog:
1. Match the first lines of both if they're identical, then match the second, third, etc. until a pair doesn't match.
2. Match the last lines of both if they're identical, then match the next to last, second to last, etc. until a pair doesn't match.
3. Find all lines which occur exactly once on both sides, then do longest common subsequence on those lines, matching them up.
4. Do steps 1-2 on each section between matched lines
From Alfedenzo's Blog:
The common diff algorithm is based on the longest common subsequence problem. Given (in this case) two documents, finding all lines that occur in both, in the same order. That is, making a third document such that every line in the document appears in both of the original documents, and in the same order. Once you have the longest common subsequence, all that remains is to describe the differences between each document and the common document, a much easier problem since the common document is a subset of the other documents.
While the diffs generated by this method are efficient, they tend not to be as human readable.
Patience Diff also relies on the longest common subsequence problem, but takes a different approach. First, it only considers lines that are (a) common to both files, and (b) appear only once in each file. This means that most lines containing a single brace or a new line are ignored, but distinctive lines like a function declaration are retained. Computing the longest common subsequence of the unique elements of both documents leads to a skeleton of common points that almost definitely correspond to each other. The algorithm then sweeps up all contiguous blocks of common lines found in this way, and recurses on those parts that were left out, in the hopes that in this smaller context, some of the lines that were ignored earlier for being non-unique are found to be unique. Once this process is finished, we are left with a common subsequence that more closely corresponds to what humans would identify.

Then when you modify something that is between unique lines like this:
you get this two different patches depending which algorithm is used:
Patience diff:
Myers diff:
In this case, you have one hunk instead of three in the case of unrelated functions, but in the case of doSomething, you still have a separate hunk because of the unique line in common.
Normally the Myers diff should perform bad when some lines are only moved from one place to another like in this case, and i'm glad to say that with the fine tuned myers implementation of darcs this doesn't happen. But it still happens in curly-braced languages, like in this case(from here):
You get this two different diffs:
Patience diff:
Myers Diff:
In theory this could happen with not curly braces if there are non-unique equal lines in a file like this:
Patience Diff:
Myers Diff:
Here you can see that the hunks offered by the patience diff algorithm are more useful and understandable. But this example depends in equal lines hardly found in real cases especially in haskell when the whitespaces aren't necessarily the same between functions as in languages ​​like python.
Usually i would say is better to have smaller hunks that are isolated between functions, because it should avoid dependencies between patches, but then there are more changes to select/unselect and some times it depends on what you think is best to avoid conflicts between patchs. That is why you get the choice to use one algorithm or the other.
You should take in consideration how the algorithm is used.
When you use a command with a diff algorithm flag, the algorithm is used always to calculate the hunks of the actual unrecorded changes. The commands that make this are record, apply, mark-conflicts, pull, unpull, obliterate, revert, unrevert and rebase(suspend, unsuspend, reify, obliterate, inject and pull) . The flag don't change an already saved patch, like one sended or one pushed. Thereby patches to be applied or pulled, are not modified by the diff flag.
When you use the record command, the patch saved depends on the diff algorithm and the hunks manually chosen. The patch is saved in hunks  so when you resolve conflicts between patchs this saved hunks are used.
In the case of unrevert, you should take into account that the patch saved by revert is not affected by the unrevert's diff flag. You can only get a different patch is you use the flag when you make the revert i.e. "darcs revert --patience".

No hay comentarios:

Publicar un comentario en la entrada