viernes, 20 de septiembre de 2013

Automatic detection of file renames for Darcs - Part 2

In the last few weeks i was refining the automatic detection of file renames implementation adding support for windows, and support for more complicated renames.

Now if you like you can consult the inode information saved in the index at any time with "darcs show index":
⮁ darcs init
⮁ mkdir testdir
⮁ touch testfile
⮁ darcs record -al -m "test files"
Finished recording patch 'test files'
⮁ ls -i1d . testdir testfile
2285722 .
2326707 testdir
2238437 testfile

⮁ darcs show index
07ec6ccf873cf215ac0789a420f154ba9218b7ca5c4fce432584edab49766a7c 2285722 ./
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 2326707 testdir/
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 2238437 testfile
Now with the new dependency algorithm, you can make more complicated renames, like exchange of filenames, folder moves. The algorithm don't manage exchange of filenames inside of a folder that have been renamed exchanging names, anything else is managed fine.
For example:
⮁ ls -1pC
_darcs/  dir/  dir2/  dir3/  foo  foo2  foo3  foo4  foo5
⮁ mv foo dir3
⮁ mv foo2 dir
⮁ mv foo3 dir2
⮁ mv foo4 foo4.tmp
⮁ mv foo5 foo4
⮁ mv foo4.tmp foo5
⮁ mv dir3 dir
⮁ mv dir dir2/dir2
⮁ mv dir2 dir
⮁ darcs whatsnew --look-for-moves
move ./dir ./dir2/dir2
move ./dir2 ./dir
move ./dir3 ./dir/dir2/dir3
move ./foo ./dir/dir2/dir3/foo3
move ./foo2 ./dir/dir2/foo2
move ./foo3 ./dir/foo3
move ./foo4 ./foo4.tmp~
move ./foo5 ./foo4
move ./foo4.tmp~ ./foo5
The moves shown by "darcs whatsnew --look-for-moves" are not exactly the ones made but yield the same final result.

jueves, 19 de septiembre de 2013

Automatic detection of replaces for Darcs - Part 1

In the last post i show some examples and use cases of the "--look-for-replaces" flag for whatsnew, record, and amend-record commands in Darcs. When used, this flag provides automatic detection of replaces(possible ones), even when the modified files shows more differences than only the replaces, and even shows possible "forced" replaces.
The simplest case is when you made a replace in you editor in of choice and don't do any other change to the file and then, after check all is ok, remember that you could have used a replace patch.
file before:
line1 foo
line2 foo
line3 foo
file after:
line1 bar
line2 bar
line3 bar
Then, instead of:
> darcs revert -a file
Reverting changes in "file":

Finished reverting.
> darcs replace foo bar file
> darcs record -m "replace foo bar"
replace ./file [A-Za-z_0-9] foo bar
Shall I record this change? (1/1) [ynW...], or ? for more options: y
Do you want to record these changes? [Yglqk...], or ? for more options: y
Finished recording patch 'replace foo bar'
You could do:
> darcs record --look-for-replaces -m "replace foo bar"
replace ./file [A-Za-z_0-9] foo bar
Shall I record this change? (1/1) [ynW...], or ? for more options: y
Do you want to record these changes? [Yglqk...], or ? for more options: y
Finished recording patch 'replace foo bar'
But it doesn't have to be a full replace. For instance, if you don't want to change a pair replaces, when you try to detect the changes instead of:
file before:
line1 foo
line2 foo
line3 foo
line4 foo
file after:
line1 bar
line2 bar
line3 bar
line4 foo
Then, instead of:
> darcs whatsnew
hunk ./file 1
-line1 foo
-line2 foo
-line3 foo
+line1 bar
+line2 bar
+line3 bar
With the new flag you could record this:
> darcs whatsnew --look-for-replaces
replace ./file [A-Za-z_0-9] foo bar
hunk ./file 4
-line4 bar
+line4 foo
Say you replace a word for another word that was already in the file. Normally this would mean that you should use "darcs replace --force". The look-for-replaces flag always "forces" the replaces, so if you try this, the changes to make the replace reversible will be shown before the replace patch:
file before:
line1 foo
line2 foo
line3 foo
line4 bar
file after:
line1 bar
line2 bar
line3 bar
line4 bar
With the new flag you will see the same patches like if you have made a "darcs replace --force foo bar file":
> darcs whatsnew --look-for-replaces
hunk ./file 4
-line4 bar
+line4 foo
replace ./file [A-Za-z_0-9] foo bar
Given certain limitations you could have any number of replaces detected, like this:
file before:
foo foo2 foo3
fee fee2 fee3
file after:
bar bar2 bar3
bor bor2 bor3
All the replaces are shown below:
> darcs whatsnew --look-for-replaces
replace ./file [A-Za-z_0-9] fee bor
replace ./file [A-Za-z_0-9] fee2 bor2
replace ./file [A-Za-z_0-9] fee3 bor3
replace ./file [A-Za-z_0-9] foo bar
replace ./file [A-Za-z_0-9] foo2 bar2
replace ./file [A-Za-z_0-9] foo3 bar3
If you want to know more about the limitations of this functionality, check Automatic detection of replaces for Darcs - Part 2.

Automatic detection of replaces for Darcs - Part 2

The last weeks i was implementing "--look-for-replaces" flag for whatsnew, record, and amend-record commands in Darcs. When used, this flag provides automatic detection of replaces(possible ones) even when the modified files shows more differences than only the replaces, given they meet the following prerequisites:
1. For a given "word" and a given file, there is not need for all the instances to be replaced, but there must be only one replace suggestion posible. i.e.:
this is ok:
file before:
foo
foo
foo
file after:
foo
bar
bar
this is not detected:
file before:
foo
foo
foo
file after:
foo
bar
bar2
2. The replace must happen in lines that have the same amount of words between the recorded and the working state, otherwise it would not be detected.
this is ok:
file before:
foo
foo
foo
file after:
foo roo
bar fee
bar
this is not detected(i don't know which is to detect anyway):
file before:
figaro foo
figaro foo
figaro foo
file after:
figaro foo
figaro bar bee
figaro foo bar
3. There must be at least one hunk with the same amount of lines in the - and + side that contains the replace.
this is not detected:
file before:
line1 foo
line2 foo
line3 foo
file after:
line1 bar
line2or3 bar
It would not detect this replace, even if it is a "perfect" replace, because it does not have the same number of lines, and is not trivial to tell which line is the one "modified" and which one is the one "deleted".

For more details about the implementation you could look on the look-for-replaces wiki page

martes, 13 de agosto de 2013

Automatic detection of file renames for Darcs

In the last few weeks i was implementing automatic detection of file renames adding "look-for-moves" flag to the amend-record, record, and whatsnew commands.

In darcs are 3 states:
  • The recorded state is the one is marked by the last record made.
  • The working state is the actual state of the files in the repository with all the last changes.
  • The pending state is the one that mark changes like file adds, moves, replaces, etc, before they are recorded. Is a temporal state between recorded and working that let darcs know about what filenames to track, and changes that are not common like replaces.

 If a file rename is not marked in the pending state, darcs lost track of the file and can't know where it is, and then `darcs whatsnew` and `darcs record` will indicate the file as deleted.
To detect this file rename I choose to use the inode info in the filesystem to check for equality between different filenames in the recorded and working state of the repo. for those who don't know, the inode is an index number assigned by the file system to identify a specific file data. The file name is linked to the data by this number, and it's used by directories as well. You can consult this number with "ls -i".
⮁ mkdir testdir
⮁ touch testfile
⮁ ln testfile testfile.hardlink
⮁ ln -s testfile testfile.symboliclink
⮁ ls -i1
10567718 testdir
10485776 testfile
10485776 testfile.hardlink
10485767 testfile.symboliclink 
You can see that the hardlink shares the same number with the test file, this is because a file is essentially a hardlink to the file data and when you make a new hardlink you are sharing the same file data, so the same inode number.
To have an old inode to filename mapping, there must be some record of the files inodes in some place, so I added the inode info to the index of hashed-storage in _darcs/index. The index save the last info about the record plus the pending state, sort of, so is a perfect fit to save this info.
Then comparing the RecordedAndPending Tree(from the index) with the Working Tree i get the file changes in a pair list mapping between the two states. With this list I resolve dependencies between the different moves, making temporal names if it's necessary and generating a FL list of move patches to merge with the changes between pending and working patches.
This patches are shown in with whatsnew or are selected with record/amend-record to be recorded in the repo.
There is a little more to make this happen but that's the core idea of the implementation.
The algorithm doesn't care if the file are modified or not, because it doesn't care of the content of the files, so it's very robust in that sense.
With this implementation you could do any move directly with "mv", and is very lightweight and fast in detecting moves so is likely a good decision make "--look-for-moves" a default flag. You could do things like this:
⮁ darcs init
Repository initialized.
touch foo
darcs record -a -m add_file_foo -A x --look-for-adds
Finished recording patch 'add_file_foo'
mv foo foo2
darcs whatsnew --look-for-moves
move ./foo ./foo2
This doesn't work on Windows yet, because fileID(the function on unix-compat that get the inode number) is lacking an implementation on windows. I know the windows API have GetFileInformationByHandle (it returns a BY_HANDLE_FILE_INFORMATION structure that contains the file index number[1]), so there doesn't have to be too hard to add an implementation of it with some boilerplate code to make the interface.
More complicated moves should work and some does but I was having problems with the dependency resolving algorithm implementation. I made some mistakes in the first implementation and I'm dragging them since then. I'm confident to know what is the error so I will fix it soon.
UPDATE: i'm testing a windows implementation with the Win32 haskell library on a virtual machine.

sábado, 20 de julio de 2013

Patience diff algorithm benefits for darcs

In this post i am going to explain the benefits of Bram Cohen's patience diff algorithm for darcs but first we have to understand how this algorithm works. There are great posts on the web that explain it really well so instead of trying to explain it again i'm going to quote the important things i need to remark the benefits it have for darcs and haskell-like non-curly languages by examples.

A brief summary of what the patience diff algorithm does from Bram Cohen's Blog:
1. Match the first lines of both if they're identical, then match the second, third, etc. until a pair doesn't match.
2. Match the last lines of both if they're identical, then match the next to last, second to last, etc. until a pair doesn't match.
3. Find all lines which occur exactly once on both sides, then do longest common subsequence on those lines, matching them up.
4. Do steps 1-2 on each section between matched lines
From Alfedenzo's Blog:
The common diff algorithm is based on the longest common subsequence problem. Given (in this case) two documents, finding all lines that occur in both, in the same order. That is, making a third document such that every line in the document appears in both of the original documents, and in the same order. Once you have the longest common subsequence, all that remains is to describe the differences between each document and the common document, a much easier problem since the common document is a subset of the other documents.
While the diffs generated by this method are efficient, they tend not to be as human readable.
Patience Diff also relies on the longest common subsequence problem, but takes a different approach. First, it only considers lines that are (a) common to both files, and (b) appear only once in each file. This means that most lines containing a single brace or a new line are ignored, but distinctive lines like a function declaration are retained. Computing the longest common subsequence of the unique elements of both documents leads to a skeleton of common points that almost definitely correspond to each other. The algorithm then sweeps up all contiguous blocks of common lines found in this way, and recurses on those parts that were left out, in the hopes that in this smaller context, some of the lines that were ignored earlier for being non-unique are found to be unique. Once this process is finished, we are left with a common subsequence that more closely corresponds to what humans would identify.

Then when you modify something that is between unique lines like this:
you get this two different patches depending which algorithm is used:
Patience diff:
Myers diff:
In this case, you have one hunk instead of three in the case of unrelated functions, but in the case of doSomething, you still have a separate hunk because of the unique line in common.
Normally the Myers diff should perform bad when some lines are only moved from one place to another like in this case, and i'm glad to say that with the fine tuned myers implementation of darcs this doesn't happen. But it still happens in curly-braced languages, like in this case(from here):
You get this two different diffs:
Patience diff:
Myers Diff:
In theory this could happen with not curly braces if there are non-unique equal lines in a file like this:
Patience Diff:
Myers Diff:
Here you can see that the hunks offered by the patience diff algorithm are more useful and understandable. But this example depends in equal lines hardly found in real cases especially in haskell when the whitespaces aren't necessarily the same between functions as in languages ​​like python.
Usually i would say is better to have smaller hunks that are isolated between functions, because it should avoid dependencies between patches, but then there are more changes to select/unselect and some times it depends on what you think is best to avoid conflicts between patchs. That is why you get the choice to use one algorithm or the other.
You should take in consideration how the algorithm is used.
When you use a command with a diff algorithm flag, the algorithm is used always to calculate the hunks of the actual unrecorded changes. The commands that make this are record, apply, mark-conflicts, pull, unpull, obliterate, revert, unrevert and rebase(suspend, unsuspend, reify, obliterate, inject and pull) . The flag don't change an already saved patch, like one sended or one pushed. Thereby patches to be applied or pulled, are not modified by the diff flag.
When you use the record command, the patch saved depends on the diff algorithm and the hunks manually chosen. The patch is saved in hunks  so when you resolve conflicts between patchs this saved hunks are used.
In the case of unrevert, you should take into account that the patch saved by revert is not affected by the unrevert's diff flag. You can only get a different patch is you use the flag when you make the revert i.e. "darcs revert --patience".

viernes, 12 de julio de 2013

Integrating patience diff and some benchmarks

The past two weeks i was integrating the patience diff algorithm adding an option flag so it can be selected from the command line. This was not as easy as it seems because the diff algorithm was deeply integrated in the codebase and adding a way to select between the two implementation would mean changing more than changing little bits of code here and there. But its done. There are a few rough edges here and there but is ready to be used.
Now you can use "--patience" to use the alternative patience diff algorithm. There should be relatively easy to add more algorithms now.

Regarding the performance, after running a variety of benchmarks with both algorithms in the end I really couldn't get conclusive results. Depending of the input one algorithm could perform better that the other, but only by a shallow margin. This are some of the runs i got(ShellTestsBenchmark, PureCriterionBenchmark).

You can see the code in darcs-patience and darcs-patience-benchmarks
 repositories.

This week I also was researching the best way to implement --look-for-moves flag. I came to the conclusion that the best trade-off is to extend the functionality in hashed-storage adding the inode of the files and then making a changes in some of the darcs commands to maintain and support this new info. I will be using unix-compat package. The functionality is not yet fully implemented for Windows but it may be a solution to that with this. Unfortunately I will have to move to a windows machine to see how this could be implemented. There are some corner cases to consider as well.

You are going to be able to see my progress in darcs-look-for-moves and hashed-storage-look-for-moves repositories in the next weeks.

domingo, 30 de junio de 2013

Patience Diff for Darcs

As you know, my GSoC project consist in enhancing record command for Darcs.

As part of this, the last two weeks I was implementing the patience diff algorithm for patches and testing it for correctness, performance and usefulness.

First, after identifying where I had to add the code, I proceeded to see if there was any implementation that could use (reuse code, don't reinvent the wheel). To my joy, David Roundy implemented it in his RCS Iolaus and it was almost a direct replacement.
Before doing this I was working in a stack overflow issue (issue2313, patch1076) and i found that big files and diffs cause a stack overflow in this implementation too. After profiling i reduce the problem to the function "byparagraph". It wasn't tail-recursive so after convert it the problem was away and i happy to say it outperform the old diff implementation in this use cases.
For testing the performance i used two tools: criterion and GHC time and space profiling system. The result shows that there isn't any difference in day to day use, performance wise. It particularly perform really bad with files that have many lines equal. I don't think this is really a problem because is very rare(but i'm looking into it) . The old algorithm use more memory but i couldn't make use of all my 4gb of ram as much as i try it, so i can say this is not an issue.
For testing usefulness i make some examples(based on this posts [1][2][3]). It seems to happen that the two different algorithms return actually quite similar patches(if not exactly the same patch) in non-curly braced languages like haskell. On the other hand, in curly-braced languages patience diff is much better.

All in all, i could say that there is several advantages to make the switch. I will be updating my findings about the performance and usefulness in next posts.
You can review the code in http://hub.darcs.net/jlneder/darcs-patience.
In http://hub.darcs.net/jlneder/darcs-patience-tests there is code to compare the two algorithm implementations.