3
This is part 3 of an article series where I will walk you through creating a simple and reusable diff-algorithm for your .NET projects.

Previously, on DevDirective...

If you haven't already read part 1 and part 2, I strongly suggest you do so now and come back here afterwards. It will make this part, and subsequent ones, much easier to understand when you have all the context.

Article Series

This article series come in N parts:

  1. Part 1: Overview of the article series, outline of algorithms used.
  2. Part 2: The Longest Common Substring implementation.
  3. Part 3: The Diff implementation.

Repository

As mentioned in part 2, you can use LINQPad and download the program code for these articles from my Kiln account.

Overview

In this part I will write the actual Diff implementation. If you remember from the first part, I said that the basic implementation looks a bit like this:

  1. Find the longest common substring between the two collections
  2. Consider that section as a copy operation
  3. Recursively apply itself to the sections before and after the longest common substring

I glossed over the details, so in this article I will formalize it a bit.

Let's take the two strings step by step and see how the recursive bit actually works.

First we find the LCS between the two full texts, and we already know how that looks from the first two parts:

      This long piece of text will have a common part found by LCS.
This extra long piece of text will have some common parts found by LCS.
          ^- longest common substring -^

The LCS is a copy operation, and we need to remember this. Now we have to recursively diff the sections before and after the LCS, let's start with the section before.

We thus compare these two strings (I'm going to just show the two strings and their LCS in one go, to cut down on the length of this article):

This
This extra
^^^^ LCS 

This is also a copy operation. We diff recursively again, but this time there are no sections before the LCS, so let's take the one after.

Basically we compare an empty string (there is nothing after This in the first string) with ·extra, and this time we find no LCS.

What if we don't find an LCS? Well, there are 3 distinct cases that can produce this situation:

  1. The first collection is empty, the second is not, thus everything in the second collection is considered as inserted data, remember that the second collection is the new content, so if there is stuff in the new collection that isn't in the old, it has to have been inserted.
  2. The first collection contains data, but the second one is empty. This is the opposite case compared to the first one, and everything in the first collection here was thus deleted, since it is not in the new collection.
  3. We actually have content in both, but no LCS, which means there are no common parts. In other words, everything in the first collection (old collection), was replaced with everything in the second (new) collection. We will implement this as a delete + an insert.

So, in the case we cornered ourselves into above we had text in the second collection, but the first was empty. This is case #1, so we now have inserted text.

Let's summarize here before moving on.

  1. copy 30 characters
  2. before #1, copy 4 characters
  3. after #2, insert 6 characters

Let's reorder:

  1. copy 4 characters (This)
  2. insert 6 characters (·extra)
  3. copy 30 characters (·long piece of text will have·)

Is this starting to look familiar? It should because this is the start of the operations I outlined at the end of part 1.

So, now we have dealt with everything before that long 30-character LCS, let's unwind the recursive calls back to it and deal with the sections after it.

We thus compare the following two strings:

   a common part found by LCS.
some common parts found by LCS.
    ^--- LCS --^

This is a copy operation of 12 characters. Recursively we diff the section before it:

a
some

There's no LCS here. Now we have found case #3, we have content in both collections, so we implement this case as a delete + an insert operation. Since we're going to unwind back to the section after the copy 12 operation above, let's tuck the new operations on to the end of the list we made earlier:

  1. copy 4 characters (This)
  2. insert 6 characters (·extra)
  3. copy 30 characters (·long piece of text will have·)
  4. delete 1 character (a)
  5. insert 4 characters (some)
  6. copy 12 characters (·common part)

Compare the section after our 12-character copy:

 ·found by LCS.
s found by LCS.
 ^---- LCS ---^

This is a 14-character copy operation. There is a small section before it though that consists of a single s in the second collection and nothing in the first.

Thus, before we copy the 14 characters we have to insert the s (case #1), and we can tuck on the final operations to our list:

  1. copy 4 characters (This)
  2. insert 6 characters (·extra)
  3. copy 30 characters (·long piece of text will have·)
  4. delete 1 character (a)
  5. insert 4 characters (some)
  6. copy 12 characters (·common part)
  7. insert 1 character (s)
  8. copy 14 characters (·found by LCS.)

I hope you were able to follow all of that, when you see my implementation below I'm also hoping you will be able to match the implementation up with the logic above.

Changes changes changes

However, before we get to the Diff implementation, we need to adjust the code we wrote in the previous part, as it has a problem, a big problem.

The problem is that the methods operate on the complete collections. If we are to recursively apply the Diff algorithm to different sections of the collections, we need a way to "split" the collections into sections.

Thus, we can do one of two things:

  1. We can make new collections, copying out only the section of the original collections that we need.
  2. We can adjust the methods we wrote last time to specify which section of the collection to operate on.

Since there will be potentially lots of recursive calls here, creating new collections and copying elements doesn't sound like something I'd like to do, the pressure on the garbage collector would be quite large compared to the alternative.

So instead, I will adjust the methods we wrote last time to be able to specify which section to operate on.

The type we created last time, the LongestCommonSubstringResult struct is unchanged, so I'm not going to repeat that, but the FindLongestCommonSubstring and the CountEqual methods both need to be adjusted, and here they are in adjusted form:

public LongestCommonSubstringResult FindLongestCommonSubstring<T>(
    IList<T> firstCollection, int firstStart, int firstEnd,
    IList<T> secondCollection, int secondStart, int secondEnd,
    IEqualityComparer<T> equalityComparer)
{
    // default result, if we can't find anything
    var result = new LongestCommonSubstringResult();

    for (int index1 = firstStart; index1 < firstEnd; index1++)
    {
        for (int index2 = secondStart; index2 < secondEnd; index2++)
        {
            if (equalityComparer.Equals(
                firstCollection[index1],
                secondCollection[index2]))
            {
                int length = CountEqual(
                    firstCollection, index1, firstEnd,
                    secondCollection, index2, secondEnd,
                    equalityComparer);

                // Is longer than what we already have --> record new LCS
                if (length > result.Length)
                {
                    result = new LongestCommonSubstringResult(
                        index1,
                        index2,
                        length);
                }
            }
        }
    }

    return result;
}

public int CountEqual<T>(
    IList<T> firstCollection, int firstPosition, int firstEnd,
    IList<T> secondCollection, int secondPosition, int secondEnd,
    IEqualityComparer<T> equalityComparer)
{
    int length = 0;
    while (firstPosition < firstEnd
        && secondPosition < secondEnd)
    {
        if (!equalityComparer.Equals(
            firstCollection[firstPosition],
            secondCollection[secondPosition]))
        {
            break;
        }

        firstPosition++;
        secondPosition++;
        length++;
    }
    return length;
}

If you compare the above two methods with their part-2 counterparts, you'll notice that I added two extra parameters for each collection firstPosition, firstEnd for the first collection, and secondStart and secondEnd for the second to the FindLongestCommonSubstring, and I added the two *end parameters to the CountEqual method as well.

I also changed the loops inside to start and stop at the respective places, instead of using 0 as the start and *.Count as the end.

This now allows us to specify the sections of the two collections that we want to find the LCS in. Perfect.

Diff

Finally we get to the meat of the whole article series, the actual Diff implementation. This is, I hope, what you have been waiting for.

I'm going to implement the method as an iterator method, using yield return to produce the different sections. However, before (you knew it wouldn't be that easy didn't you?) we get to the actual Diff code, we need to define something that we will return from it.

The list of operations described earlier has to be returned in some kind of data structure, and I'm going to define two types, an enum for the types of operations, and a new struct to hold the type + the number of elements that were copied, inserted, or deleted.

So, the following two types are needed:

public enum DiffSectionType
{
    Copy,
    Insert,
    Delete
}

public struct DiffSection
{
    private readonly DiffSectionType _Type;
    private readonly int _Length;

    public DiffSection(DiffSectionType type, int length)
    {
        _Type = type;
        _Length = length;
    }

    public DiffSectionType Type
    {
        get
        {
            return _Type;
        }
    }

    public int Length
    {
        get
        {
            return _Length;
        }
    }

    public override string ToString()
    {
        return string.Format("{0} {1}", Type, Length);
    }
}

There's not much code here, so I assume you can read it through without me explaining it. Let's move on, finally.

The Diff method, as I said, will be implemented as an iterator method, so let's start with the obvious part of it, finding the LCS and dealing with it:

private IEnumerable<DiffSection> Diff<T>(
    IList<T> firstCollection, int firstStart, int firstEnd,
    IList<T> secondCollection, int secondStart, int secondEnd,
    IEqualityComparer<T> equalityComparer)
{
    var lcs = FindLongestCommonSubstring(
        firstCollection, firstStart, firstEnd,
        secondCollection, secondStart, secondEnd,
        equalityComparer);

    if (lcs.Success)
    {
        // deal with the section before
        var sectionsBefore = Diff(
            firstCollection, firstStart, lcs.PositionInFirstCollection,
            secondCollection, secondStart, lcs.PositionInSecondCollection,
            equalityComparer);
        foreach (var section in sectionsBefore)
            yield return section;

        // output the copy operation
        yield return new DiffSection(
            DiffSectionType.Copy,
            lcs.Length);

        // deal with the section after
        var sectionsAfter = Diff(
            firstCollection, lcs.PositionInFirstCollection + lcs.Length, firstEnd,
            secondCollection, lcs.PositionInSecondCollection + lcs.Length, secondEnd,
            equalityComparer);
        foreach (var section in sectionsAfter)
            yield return section;

        yield break;
    }

    // if we get here, no LCS

As you can see, I added the same start and end parameters to this method as I did with the FindLongestCommonSubstring and CountEqual. I'm not going to bother with the overloads this time, you can add all the overloads you you see fit. Also notice that the method is not complete, I'm going to cover the rest below.

After having found the LCS, I calculate the sections before and after it. The sections before start at the same spot we originally started looking for a LCS, so we can reuse the start values, but the section before the LCS will of course stop where the LCS starts, so we now use the positions reported from FindLongestCommonSubstring as the end of the sections before the LCS. phew

Basically, we started with this:

  • firstStart ... firstEnd
  • secondStart ... secondEnd

And after finding a LCS, we operate on the section before, which is:

  • firstStart ... lcs.PositionInFirstCollection
  • secondStart ... lcs.PositionInSecondCollection

Likewise for the section after the LCS, it will start where the LCS stops, and it will stop where we originally stopped, so we go to this:

  • lcs.PositionInFirstCollection+lcs.Length ... firstEnd
  • lcs.PositionInSecondCollection+lcs.Length ... secondEnd

Broken down the method (so far) looks like this:

  1. Find the LCS in the current section we're analyzing
  2. If we find an LCS, first apply the Diff recursively to the section before the LCS, and output all sections found in that diff
  3. Then output the copy operation
  4. Then apply the Diff recursively to the section after the LCS, and output all sections found in that diff
  5. Then finally stop, there is some more code that follows that is only supposed to execute if we don't find a LCS, so we need to exit here

Hopefully the above code is not black magic. Since the method is an iterator method, and we call it recursively, what we get back from that call is yet another collection. So we enumerate over that collection yielding the elements in it.

Ok, so what if we don't find an LCS? Remember the 3 cases that result in either a delete operation, an insert operation, or a delete+insert combination? Now we need to deal with those cases.

We don't have to write code 3 times though, we can write code that handles the delete and code that handles the insert, and in the case of delete+insert we can let both those two code pieces run.

Thus, the final piece of the method looks like this:

    if (firstStart < firstEnd)
    {
        // we got content from first collection --> deleted
        yield return new DiffSection(
            DiffSectionType.Delete,
            firstEnd - firstStart);
    }
    if (secondStart < secondEnd)
    {
        // we got content from second collection --> inserted
        yield return new DiffSection(
            DiffSectionType.Insert,
            secondEnd - secondStart);
    }
}

The first if-statement is case #2, the second is case #1.

Let's take it for a spin:

string text1 = "This long piece of text will have a common part found by LCS.";
string text2 = "This extra long piece of text will have some common parts found by LCS.";

var result = Diff(
    text1.ToCharArray(), 0, text1.Length,
    text2.ToCharArray(), 0, text2.Length,
    EqualityComparer<char>.Default);

If you write out the result of the above call, which will be a collection (IEnumerable<T>) of DiffSection values, you will get the following:

  • Copy 4
  • Insert 6
  • Copy 30
  • Delete 1
  • Insert 4
  • Copy 12
  • Insert 1
  • Copy 14

If you compare this with the original list of operations from part 1:

  1. copy 4 characters, This
  2. insert 6 characters, ·extra
  3. copy 30 characters, ·long piece of text will have·
  4. insert 4 characters, some
  5. delete 1 character, a
  6. copy 12 characters, ·common part
  7. insert 1 character, s
  8. copy 14 character, ·found by LCS.

Then you will notice that operation #4 and #5 is in the opposite order, but as I said in part 1, that doesn't matter, the end result will be the same.

End of part 3

This concludes the third part of this series. In the fourth part I will look at some optimizations in order to speed up the implementation for larger collections. This article will not come as speedy as part 3 did, which I wrote the same evening as part 2, so don't hold your breath for it :)