1. Original Entry + Comments2. Write a Comment3. Preview Comment
New comments for this entry are disabled.


May 10, 2004  |  Word frequency revisited  |  3015 hit(s)

Having finished up a pile of work-work, I can now return to the interesting suggestions raised by my recent word frequency entry. Simon suggested using a custom class that implements IComparable. That was new to me, so I gave that a try. It wasn't immediately obvious to me what to do, but with a little poking around, I found a number of examples, including in the .NET QuickStarts, who knew?

To jump ahead a moment, you can see the result of the effort first. I now have two word frequency pages, the original that uses a DataTable and a new one that uses the custom class and array sorting. Try them out:

Word frequency with data table
Word frequency with custom class implementing IComparable

The version with a data table is no longer interesting from an implementation POV, but I was curious about timings.[1]

The custom class is a simple class with a textbook implementation of CompareTo. The mildly fun twist is that I coded the CompareTo method to sort by two values (first frequency, descending, then word, ascending).

Eric had some other suggestions. One was to use a HashTable, which I couldn't figure out how to do that; putting instances of the custom class into a HashTable worked, but HashTable does not seem to support the Sort method. He's also got an implementation using generics, which is new in Whidbey, and thus not possible yet in 1.1. If you're curious, though, have a look at his second comment.

Anyway, here's the code:
Sub Button1_Click(sender As Object, e As EventArgs)


Dim startTime As DateTime = DateTime.Now
Dim endTime As DateTime


Dim i As Integer
Dim s As String


Dim punctuation() As Char = {".", ",", "!", "=", "-", _
", "_", ";", ":", "(", ")", "[", "]", """", "?", "/", "\", _
"@", "#", "$", "%", "&", "*", "=", "<", ">", "|", _
"~", "‘", "`"}
Dim t As String = TextBox1.Text
t = t.ToLower()
t = t.Trim()
For i = 0 to punctuation.Length - 1
t = t.Replace(punctuation(i), " ")
Next i
t = t.Replace(vbcrlf, " ")
t = t.Replace(vbtab, " ")


' Dumb old so-called smart quotes, grrr
t = t.Replace(Chr(145), " ")
t = t.Replace(Chr(146), "'") ' smart apostrophe
t = t.Replace(Chr(147), " ")
t = t.Replace(Chr(148), " ")
t = t.Replace(Chr(151), " ")


t = t.Replace(vbcrlf, " ")
t = t.Replace(vbtab, " ")


While t.indexOf(" ") > -1
t = t.Replace(" ", " ") ' double spaces
End While


' Create array of all words
Dim wordArray() As String
wordArray = t.split
Array.Sort(wordArray)


Dim WordsByCount As New ArrayList()
' Walk through word array, accumulating count of (sorted)
' words. When we run out of words, write word and accumulator
' to new array of custom WordFrequency objects.
Dim arrayLength AS Integer = wordArray.Length - 1
Dim accumulator As Integer = 0
Dim nextWord As String = ""
Dim currentWord As String = ""


For i = 0 to arrayLength
nextWord = wordArray(i)
If nextWord = currentWord Then
accumulator += 1
Else
If i > 0 Then
WordsByCount.Add(New WordFrequency(currentWord, accumulator))
End If
currentWord = nextWord
accumulator = 1
End If
Next
WordsByCount.Add(New WordFrequency(currentWord, accumulator))


' Sort method invokes custom comparison method of objects in array
WordsByCount.Sort()


' Display results
s = "<table cellpadding=4>"
For Each wf As WordFrequency in WordsByCount
s &= "<tr>"
s &= "<td>" & wf.Frequency & "</td>"
s &= "<td>" & wf.Word & "</td>"
s &= "</" & "tr>"
Next
s &= "</table>" Literal1.Text = s
labelWordCount.Text = wordarray.length
endTime = DateTime.Now
Dim timeDiff As TimeSpan = endTime.Subtract(startTime)
Dim totalSeconds As Double = (timeDiff.TotalMilliSeconds / 1000)
labelTime.text = totalSeconds.ToString("g")
End Sub


Class WordFrequency: Implements IComparable
Dim WordValue As String
Dim FrequencyValue As Integer


Public Sub New()
End Sub


Public Sub New(word As String, freq As Integer)
Me.Word = word
Me.Frequency = freq
End Sub


Public Property Word As String
Get
Return WordValue
End Get
Set (value As String)
WordValue = value
End Set
End Property


Public Property Frequency As Integer
Get
Return FrequencyValue
End Get


Set (value As Integer)
FrequencyValue = value
End Set
End Property


Public Function CompareTo (ByVal ObjectToCompare as Object) As Integer _
Implements IComparable.CompareTo
Dim WordFrequencyObject As WordFrequency = _
CType(ObjectToCompare, WordFrequency)
CompareTo = WordFrequencyObject.Frequency - Me.Frequency
If CompareTo = 0 Then
' Word frequencies are the same, so now compare words
If WordFrequencyObject.Word < Me.Word Then
CompareTo = 1
ElseIf WordFrequencyObject.Word > Me.Word
CompareTo = -1
ElseIf WordFrequencyObject.Word > Me.Word
CompareTo = 0
End If
End If
End Function
End Class

[1] The data table implementation seems to be marginally faster than the custom class, at least, as implemented by me. I used a test of 10,996 words (the first two chapters of Dickens's David Copperfield), and in three trials got these timings: Datatable - 7.23/6.0156/6.1093; Custom class - 8.1098/6.578/6.48.


The Stygian Depths Of Hacked-Together Scripts    (Fabulous Adventures In Coding)
Last time on FABULOUS ADVENTURES: Eric Lippert: The second term is asymptotically insignificant compared



Eric Lippert   10 May 04 - 10:40 AM

I think you're misunderstanding my solution. The last loop in my second solution IS a hash table sort, and what's more, it is an order-n sort. Any sort that uses IComparable is order n-log-n at best.

Let me go through that solution and annotate it:

char[] punctuation = new char[] {' ', '`', '\n', '\r', '\t', '?', '.', ',', '!', '=', '-', '_', ';', ':', '(', ')', '[', ']', '"', '\''};
string[] tokens = source.ToLower().Split(punctuation, true);

First off, your solution for eliminating punctuation is unnecessarily inefficient. Repeatedly searching and replacing is extremely expensive. You can just pass the text to Split and give a list of the characters you want to split upon, and if "true" is the last argument then it will automatically eliminate the "empty" entries that would be found between double spaces.

Suppose I had the string "The few, the proud, the generics."

You can think of a string array as a device that maps each of a finite number of consecutive integers starting with zero onto a string:

0 -> the
1 -> few
2 -> the
3 -> proud
4 -> the
5 -> generics


Now I have two parts to my strategy. First pass, create a mapping from word to word count. Second, put that in word-count order.

For my first pass, I'll need a table which maps strings onto integers.

Dictionary<string, int> firstPass = new Dictionary<string, int>();

For my second pass, I'll need a table which maps integers onto lists of strings.

Dictionary<int, List<string>> secondPass = new Dictionary<int, List<string>>();

It will be helpful after the second pass to know what is the most popular word frequency. We'll track that in the max variable.

int max = 1;



 
Eric Lippert   10 May 04 - 10:40 AM

OK, pass one. I'm going to go through my list of words above and check to see if they're in my table. If they are not, I'll add a map from the word to the number one -- so far, I've encountered one such word. If it is in the table then the word is mapped onto its count, so I will increase the count by one.

foreach(string token in tokens)
{
if (!firstPass.ContainsKey(token))
firstPass[token] = 1;
else
++firstPass[token];
}

firstPass now contains this map:

the -> 3
few -> 1
proud -> 1
generics -> 1

Now how the heck am I going to sort this thing by word frequency? I'm going to do a second pass. This time I'm going to go through this table and make a map from frequency to a list of words that have that frequency.

This is basically the same strategy as before. If I find a frequency which is not represented in my table, I'll add it and map it to a list containing its word. If it already exists, I'll add the new word to the end of the list:

foreach(string key in firstPass.Keys)
{
int count = firstPass[key];
if (count > max)
max = count;
if (!secondPass.ContainsKey(count))
secondPass[count] = new List<string>();
secondPass[count].Add(key);
}

The secondPass map now contains this:

3 -> [the]
1 -> [few, proud, generics]

and max -> 3.

How does that help? Well, now we can go from 1 to max and print out that table:

for (int i = 1 ; i <= max ; ++i)
{
if (secondPass.ContainsKey(i))
{
Console.WriteLine(i);
foreach(string s in secondPass[i])
{
Console.WriteLine(s);
}
}
}

and we print out

1
few
pround
generics
3
the

And we're done.


 
Eric Lippert   10 May 04 - 10:43 AM

Let's take a look at the order of this.

I don't know what algorithm the tokenizer uses, but there are single-character-delimiter tokenizer algorithms that are O(c + d) where c is the number of characters in the string and d is the number of delimiters. Let's assume that we have an optimal tokenizer and that it produces an array of n words.

Since adding and searching a hashtable is of constant order, clearly the first pass is O(n). And the second pass prints out all the unique words, of which there cannot possibly be more than n, so this must also be O(n). Thus, the algorithm is O(c + d + n), which is way better than the n log n you're going to get with a comparison sort.

How is that possible? Isn't it the case that the best possible sort is n log n? No, the best possible sort ON ARBITRARY DATA is n-log-n. Here we know something about the data, namely that we are sorting a finite set of integers where every integer is between 1 and max. That's highly constrained data, and therefore we can do better than the theoretical minimum on arbitrary data.