Previously, on DevDirective...
If you haven't already read part 1, I strongly suggest you do so now and come back here afterwards. It will make this part, and subsequent ones, much easier to understand when you have all the context.
Article Series
This article series come in N parts:
- Part 1: Overview of the article series, outline of algorithms used.
- Part 2: The Longest Common Substring implementation.
- Part 3: The Diff implementation.
Repository
I've made the article markdown texts, as well as the code for these articles, available on my Kiln account.
You can download a LINQPad program that contains all the code in this article, in executable and testable form, so if you want to avoid having to type it all out, go download LINQPad (it's free, and excellent!) and the program file, and go experiment!
The source code can be downloaded here. Note that while the Kiln website is a Mercurial front-end, you do not need Mercurial in order to download the code. Simply click on the "Files" tab, view to the file you want, and click on the Download link out in the right side of the toolbar directly above the file content.
Overview
In this part I'm going to implement a very basic version of the Longest Common Substring algorithm. In a later part of this article series I will revisit this implementation in order to improve it, but to begin with, let's make it simple so that it is easy to explain, write, test, and not least, understand.
In the first article, I used the following two pieces of text:
This long piece of text will have a common part found by LCS. This extra long piece of text will have some common parts found by LCS.
I'm going to reuse those for this part and subsequent ones, so that we can see that the final outcome indeed matches the one I predicted in the first article in the series.
Also remember that I mentioned that the code I will write to implement the Diff algorithm should handle any type of objects. Therefore, though I will be using texts in these articles, and thus the string type in .NET for the examples, the code will be written using generics.
Goal
I will implement a LongestCommonSubstring method that can:
- Take two collections of type
T, whereTis any type. - Take a
IEqualityComparer<T>object, that will be used to determine if twoT's are the same. - Return the result, indicating whether the method was able to find a longest-common-substring or not, and if successful, where it can be found in the two collections and how long it is.
Return values
For the 3rd goal there I will create a new type, so let's start with that.
I will create a type that:
- Contains a
Successproperty, of typeboolthat indicates whether we found a LCS or not. - Two position properties, for the position of the LCS in the two respective collections.
- A property to hold the length of the LCS.
In the case of "not a success", the three properties for the positions and the length are undefined but we'll ensure these have the value 0 in the current implementation.
Here's how I implemented type:
public struct LongestCommonSubstringResult
{
private readonly bool _Success;
private readonly int _PositionInFirstCollection;
private readonly int _PositionInSecondCollection;
private readonly int _Length;
public LongestCommonSubstringResult(
int positionInFirstCollection,
int positionInSecondCollection,
int length)
{
_Success = true;
_PositionInFirstCollection = positionInFirstCollection;
_PositionInSecondCollection = positionInSecondCollection;
_Length = length;
}
public bool Success
{
get
{
return _Success;
}
}
public int PositionInFirstCollection
{
get
{
return _PositionInFirstCollection;
}
}
public int PositionInSecondCollection
{
get
{
return _PositionInSecondCollection;
}
}
public int Length
{
get
{
return _Length;
}
}
public override string ToString()
{
if (Success)
return string.Format(
"LCS ({0}, {1} x{2})",
PositionInFirstCollection,
PositionInSecondCollection,
Length);
else
return "LCS (-)";
}
}
}
I made this a struct since we will potentially have quite a few of these, and they're rather small (3 ints and a bool), so I didn't want to build up any pressure on the garbage collector. The type is also immutable.
Since it is a struct there will be an implicit public parameterless constructor that initializes the whole struct to zeroes, which means that the length and positions will all be zero, and the Success will be false. Perfect.
Thus, if the method that implements the longest common substring fails to find a common substring, here's what it would return:
return new LongestCommonSubstringResult();
and if it does find a common substring, here's what it would return instead:
return new LongestCommonSubstringResult(position1, position2, length);
Longest common substring method
Now on to the method that our Diff implementation needs, the one that implements the longest common substring algorithm.
As I said earlier, this will be a generic method, with a naive and simple implementation. It will be simple to understand, but it will perform rather horribly. Still, for this article series, at least until a bit later in the series, this is more than enough.
This is how I see the method as being defined:
public LongestCommonSubstringResult FindLongestCommonSubstring<T>(
IList<T> firstCollection,
IList<T> secondCollection,
IEqualityComparer<T> equalityComparer)
Since it is imperative that we can navigate a bit back and forth in this implementation, I made it clear right in the parameter list that I need something that is indexable, otherwise I would've declared the two collections as IEnumerable<T> instead. This has the unfortunate consequence, however, that we cannot call the method passing in two strings, as the string type does not implement IList<T>.
I will create a couple of overloads for the method, however, to make it simpler to call for the common cases:
public LongestCommonSubstringResult FindLongestCommonSubstring(
string firstText,
string secondText)
{
return FindLongestCommonSubstring(
firstText.ToCharArray(),
secondText.ToCharArray());
}
public LongestCommonSubstringResult FindLongestCommonSubstring<T>(
IList<T> firstCollection,
IList<T> secondCollection)
{
return FindLongestCommonSubstring(
firstCollection,
secondCollection,
EqualityComparer<T>.Default);
}
With these two overloads we can easily compare strings, and for the generic version we don't have to pass in the IEqualityComparer<T> object, as long as we can make do with the default implementation for the type T that is being used. Remember that these are overloads, which means we will have these two plus the first one that takes two collections and a comparer object. Three (3) variations in total.
With the above code in mind, let's write our super-naive and slow implementation of the longest common substring.
Basically, I will loop through the first collection, and for each position in the first collection, I will loop through the entire second collection. Every time I find an element in the two collections that compare equal, I will count how many consecutive elements compare equal, and keep the longest one found.
If we manage to get to the end of the loop without having found any equal elements, there is no common substring between the two collections.
The performance characteristics of this implementation is rather bad. At the very least we will do N*M comparisons, where N and M are the number of elements in the two respective collections. At the most, in a degenerative case (two collections consisting only of elements that are equal, ie. the same element repeated over and over again), we can end up doing N*(N+1)/2*M comparisons. Later in this article series I will speed up this somewhat, but this is an excellent place to take your optimization skills out for a spin.
Here's the code. I broke the code that counts the equal elements out into its own method to make the code easier to read:
public LongestCommonSubstringResult FindLongestCommonSubstring<T>(
IList<T> firstCollection,
IList<T> secondCollection,
IEqualityComparer<T> equalityComparer)
{
// default result, if we can't find anything
var result = new LongestCommonSubstringResult();
for (int index1 = 0; index1 < firstCollection.Count; index1++)
{
for (int index2 = 0; index2 < secondCollection.Count; index2++)
{
if (equalityComparer.Equals(
firstCollection[index1],
secondCollection[index2]))
{
int length = CountEqual(
firstCollection, index1,
secondCollection, index2,
equalityComparer);
// Is longer than what we already have --> record new LCS
if (length > result.Length)
{
result = new LongestCommonSubstringResult(
index1,
index2,
length);
}
}
}
}
return result;
}
public int CountEqual<T>(
IList<T> firstCollection, int firstPosition,
IList<T> secondCollection, int secondPosition,
IEqualityComparer<T> equalityComparer)
{
int length = 0;
while (firstPosition < firstCollection.Count
&& secondPosition < secondCollection.Count)
{
if (!equalityComparer.Equals(
firstCollection[firstPosition],
secondCollection[secondPosition]))
{
break;
}
firstPosition++;
secondPosition++;
length++;
}
return length;
}
Now, let's test this code.
string text1 = "This long piece of text will have a common part found by LCS.";
string text2 = "This extra long piece of text will have some common parts found by LCS.";
var result = FindLongestCommonSubstring(text1, text2);
The contents of result should now correspond to this:
Success = true
PositionInFirstCollection = 4
PositionInSecondCollection = 10
Length = 30
Does this match the two texts? Let's see:
v position 4 in the first text
This long piece of text will have a common part found by LCS.
This extra long piece of text will have some common parts found by LCS.
^- longest common substring -^
^ position 10 in the second text
^------ 30 characters -------^
Since collections are 0-based, the 5th and 11th characters are at position 4 and 10 respectively.
This matched quite perfect what we expected.
End of part 2
This concludes the second part of this series. In the third part I will implement the actual Diff algorithm.
If you went ahead and installed LINQPad and downloaded the program file that belongs to this part, feel free to take a crack at the Diff implementation yourself. You can read the overview of the implementation in the first part of the article series.