Find a file
2022-05-29 16:56:24 +02:00
src a bit of finalization 2022-05-29 16:56:24 +02:00
adiff.cabal a bit of a revamp 2022-05-25 08:30:09 +02:00
CHANGELOG.md init 2020-08-13 12:54:20 +02:00
LICENSE init 2020-08-13 12:54:20 +02:00
README.md a bit of a revamp 2022-05-25 08:30:09 +02:00
Setup.hs init 2020-08-13 12:54:20 +02:00

adiff (arbitrary-tokens diff, patch and merge)

This is a half-working pre-alpha version, use with care.

Short summary

The main aim of this toolbox is to help with finding differences in text formats that do not have a fixed "line-by-line" semantics, as assumed by standard unix diff and related tools.

The problem was previously tackled by Arek Antoniewicz on MFF CUNI, who produced a working software package in C++, and designed the Regex-edged DFAs (REDFAs) that were used for user-specifiable tokenization of the input. The work on the corresponding thesis is finished.

This started as a simple Haskell port of that work, and packed some relatively orthogonal improvements (mainly the histogram-style diffing). I later got rid of the REDFA concept -- while super-interesting and useful in theory, I didn't find a sufficiently universal way to build good lexers from user-specified strings. Having a proper Regex representation library (so that e.g. reconstructing Flex is easy) would help a lot.

TODO list

  • Implement patch functionality, mainly patchfile parsing and fuzzy matching of hunk context. diff and diff3 works.
  • Implement a splitting heuristic for diffs, so that diffing of large files doesn't take aeons
  • check if we can have external lexers, unix-style

How-To

Install using cabal. The adiff program has 3 sub-commands that work like diff, patch and diff3.

Example

Let's have a file orig:

Roses are red. Violets are blue.
Patch is quite hard. I cannot rhyme.

and a modified file mine:

Roses are red. Violets are blue.
Patching is hard. I still cannot rhyme.

Let's use the words lexer, which marks everything whitespace-ish as whitespace, and picks up groups of non-whitespace "content" characters.

Diffing the 2 files gets done as such:

 $ cabal run adiff -- -l words diff orig mine

You should get something like this:

@@ -7 +7 @@
 . 
 |are
 . 
 |blue.
 .\n
-|Patch
+|Patching
 . 
 |is
-. 
-|quite
 . 
 |hard.
 . 
 |I
+. 
+|still
 . 
 |cannot
 . 
 |rhyme.
 .\n

Let's pretend someone has sent us a new version, with a better formated verse and some other improvements, in file yours:

Roses are red.
Violets are blue.
Patch is quite hard.
I cannot do verses.

We can run diff3 to get a patch with both changes, optionally with reduced context:

 $ cabal run adiff -- -l words diff3 mine orig yours -C1

...which outputs:

@@ -4 +4 @@
 |red.
-. 
+.\n
 |Violets
@@ -11 +11 @@
 .\n
-|Patch
+|Patching
 . 
 |is
-. 
-|quite
 . 
 |hard.
-. 
+.\n
 |I
+. 
+|still
 . 
@@ -23 +23 @@
 . 
-|rhyme.
+|do
+. 
+|verses.
 .\n

...or get a merged output right away, using the -m/--merge option:

Roses are red.
Violets are blue.
Patching is hard.
I still cannot do verses.

...or completely ignore whatever whitespace changes that the people decided to do for whatever reason, with -i/--ignore-whitespace (also works without -m):

Roses are red. Violets are blue.
Patching is hard. I still cannot do verses.

If there's a conflict (substituing the Patch to Merging in file yours), it gets highlighted in the merged diff as such:

[...]
 . 
 |blue.
 .\n
<|Patching
=|Patch
>|Merging
 . 
 |is
-. 
-|quite
[...]

and using the standard conflict marks in the merged output:

Roses are red.
Violets are blue.
<<<<<<<Patching|||||||Patch=======Merging>>>>>>> is hard.
I still cannot do verses.