This article series come in N parts:
- Part 1: Overview of the article series, outline of algorithms used.
- Part 2: The Longest Common Substring implementation.
A "diff algorithm" is an algorithm that takes two collections, presumably an old and a newer version of the same collection, compares them, and outputs a sequence of operations or pieces of information about the two collections, usually what happened to it.
As an example, you could read the contents of two source code files, before and after a change, store the contents in a collection of strings, one string per line, and compare the two collections. Out of this comparison you would get a "diff" telling you what happened to the file between the two versions.
There are two ways to consider the output:
- A description of the relationship between the two collections
- What to do to the old version in order to upgrade it to have the same content as the new version.
As such, the output of the diff algorithm is a series of:
- equal, added, removed operations for the first case
- or copy, insert, delete operations for the second case
They mean exactly the same however, and in this article series we will treat it as case #2, how to modify the old version in order to bring up to date compared to the new version, so I will deal with copy, insert, and delete operations.
The goal of this article series is to walk you through implementing such a diff algorithm in C# for your .NET projects.
All the source code will be shown, no tricks, no "left as an exercise for the reader". The code will in some places be naive, simple, and perhaps not as performant as a more advanced implementation, and those things, i.e. making it a top-notch implementation, will be left as an exercise for the reader, but at the end of the series you should have a set of classes that will do everything I describe in these articles.
This first part will focus on the algorithm, taking a very simple text example and showing how the algorithm will figure out the various operations. Then, in part 2 and onwards we will focus on implementing this algorithm.
When we deal with "diff" algorithms, at least in context of what I've described above, we have some basic assumptions that has to hold true for the data, otherwise we will get odd and generally not useful output from the diff code.
- The two collections are "ordered" the same way. For instance, in terms of lines from source code files, we assume that you haven't sorted the lines or anything. If you didn't modify a piece of code in the file, it is still at or around the place it was originally.
- The two collections are related. You could open up a source code file and replace everything in it. The diff, however, would probably end up being just "delete all the old stuff" and "insert all the new stuff". Correct, but not very helpful.
The basic algorithm that I will use revolves around the Longest Common Substring (LCS) algorithm.
Basically, this algorithm compares two collections of elements, and finds the longest sequence that exists unbroken in both collections. In terms of a source code file, it would be longest stretch of lines of text that exists, without differences, in both files.
Now, since it is easy to illustrate and explain a diff implementation using text strings, that's what I'm going to use for this series. However, the implementation I will show will revolve around generic collections of elements. This means that you could use the diff code to figure out the differences between two strings (which are collections of characters), two collections of strings read in from text files, two collections of ORM-like objects. Basically, anything that can be compared.
However, for this series I will use text examples as they lend themselves easily to illustrations.
Let's show an example and see how it works.
Take the following two pieces of text:
This long piece of text will have a common part found by LCS. This extra long piece of text will have some common parts found by LCS.
The diff function would take the two collections, run them through a LCS function, and "align up the common parts".
The result could look like this:
This long piece of text will have a common part found by LCS. This extra long piece of text will have some common parts found by LCS. ^- longest common substring -^
Now, the way the diff implementation will work is that a piece of the two collections that are equal, i.e. found by the LCS function, is a copy operation. But, in this case there's text both before and after the LCS, and we don't yet know what kind of operations that will handle those, but the current list of operations is now:
- some operations here
- copy 30 characters,
·long piece of text will have·
- some more operations here
(I used the small dot, ·, to indicate a space, since code formatting on the site removes the extra spaces at the start and end of those
inline code blocks)
The diff function would then recursively apply itself to the portions before and after the LCS to figure out what operations to use.
I'm going to gloss over all the recursive calls here and instead illustrate what the end result will be. We will see more about how this works in part 2 when we start implementing the algorithm in code. Of course, the illustration below won't actually be manifested by the diff function, but we can use it, knowing that this is basically how the data will be treated, to figure out what operations the final result will contain.
Here's how the end result would look:
This______ long piece of text will have ____a common part_ found by LCS. This extra long piece of text will have some_ common parts found by LCS.
I used underscores to indicate portions of the two collections that aren't present in both. These will end up as either a insert operation, or a delete operation.
The rules for the operations are as follows:
- Any text present in both, i.e. found by LCS, is a copy operation
- Any text present in the first collection, but not the second, is a delete operation. After all, it was present in the old version, but not in the new, and thus it was deleted.
- Any text present in the second collection, but not the first, is a insert operation. It was not present in the old version, but it was in the new, and thus it had to be added.
So, the final sequence of operations for our result is as follows:
- copy 4 characters,
- insert 6 characters,
- copy 30 characters,
·long piece of text will have·
- insert 4 characters,
- delete 1 character,
- copy 12 characters,
- insert 1 character,
- copy 14 character,
·found by LCS.
If you start out from the left end of the old version of the text, and apply the operations one at a time, you will end up with the new version.
The exact ordering of operations when a delete and insert operation are together is not important. If you reverse the order of operation 4 and 5 above, you still get the same result.
Note: You could say that the delete operation is mislabeled and that it should really be called a skip operation. Since the text you're deleting isn't actually present in the new version yet, you're in fact skipping the text in the old version.