Word frequency revisited - mike's web log/comments

New comments for this entry are disabled.

May 10, 2004 | Word frequency revisited | 3015 hit(s)

Having finished up a pile of work-work, I can now return to the interesting suggestions raised by my recent word frequency entry. Simon suggested using a custom class that implements IComparable. That was new to me, so I gave that a try. It wasn't immediately obvious to me what to do, but with a little poking around, I found a number of examples, including in the .NET QuickStarts, who knew?

To jump ahead a moment, you can see the result of the effort first. I now have two word frequency pages, the original that uses a DataTable and a new one that uses the custom class and array sorting. Try them out:

Word frequency with data table
Word frequency with custom class implementing IComparable

The version with a data table is no longer interesting from an implementation POV, but I was curious about timings.[1]

The custom class is a simple class with a textbook implementation of CompareTo. The mildly fun twist is that I coded the CompareTo method to sort by two values (first frequency, descending, then word, ascending).

Eric had some other suggestions. One was to use a HashTable, which I couldn't figure out how to do that; putting instances of the custom class into a HashTable worked, but HashTable does not seem to support the Sort method. He's also got an implementation using generics, which is new in Whidbey, and thus not possible yet in 1.1. If you're curious, though, have a look at his second comment.

Anyway, here's the code:

Sub Button1_Click(sender As Object, e As EventArgs)

     Dim startTime As DateTime = DateTime.Now
     Dim endTime As DateTime

     Dim i As Integer
     Dim s As String

     Dim punctuation() As Char = {".", ",", "!", "=", "-", _
         ", "_", ";", ":", "(", ")", "[", "]", """", "?", "/", "\", _
         "@", "#", "$", "%", "&", "*", "=", "<", ">", "|", _
         "~", "‘", "`"}
     Dim t As String = TextBox1.Text
     t = t.ToLower()
     t = t.Trim()
     For i = 0 to punctuation.Length - 1
        t = t.Replace(punctuation(i), " ")
     Next i
     t = t.Replace(vbcrlf, " ")
     t = t.Replace(vbtab, " ")

     ' Dumb old so-called smart quotes, grrr
     t = t.Replace(Chr(145), " ")
     t = t.Replace(Chr(146), "'")   ' smart apostrophe
     t = t.Replace(Chr(147), " ")
     t = t.Replace(Chr(148), " ")
     t = t.Replace(Chr(151), " ")

     t = t.Replace(vbcrlf, " ")
     t = t.Replace(vbtab, " ")

     While t.indexOf("  ") > -1
        t = t.Replace("  ", " ") ' double spaces
     End While

     ' Create array of all words
     Dim wordArray() As String
     wordArray = t.split
     Array.Sort(wordArray)

     Dim WordsByCount As New ArrayList()
     ' Walk through word array, accumulating count of (sorted)
     ' words. When we run out of words, write word and accumulator
     ' to new array of custom WordFrequency objects.
     Dim arrayLength AS Integer = wordArray.Length - 1
     Dim accumulator As Integer = 0
     Dim nextWord As String = ""
     Dim currentWord As String = ""

     For i = 0 to arrayLength
        nextWord = wordArray(i)
        If nextWord = currentWord Then
           accumulator += 1
        Else
           If i > 0 Then
             WordsByCount.Add(New WordFrequency(currentWord, accumulator))
           End If
           currentWord = nextWord
           accumulator = 1
        End If
     Next
     WordsByCount.Add(New WordFrequency(currentWord, accumulator))

     ' Sort method invokes custom comparison method of objects in array
     WordsByCount.Sort()

     ' Display results
     s = "<table cellpadding=4>"
     For Each wf As WordFrequency in WordsByCount
        s &= "<tr>"
        s &= "<td>" & wf.Frequency & "</td>"
        s &= "<td>" & wf.Word & "</td>"
        s &= "</" & "tr>"
     Next
     s &= "</table>"     Literal1.Text = s
     labelWordCount.Text = wordarray.length
     endTime = DateTime.Now
     Dim timeDiff As TimeSpan = endTime.Subtract(startTime)
     Dim totalSeconds As Double = (timeDiff.TotalMilliSeconds / 1000)
     labelTime.text = totalSeconds.ToString("g")
End Sub

Class WordFrequency: Implements IComparable
   Dim WordValue As String
   Dim FrequencyValue As Integer

   Public Sub New()
   End Sub

   Public Sub New(word As String, freq As Integer)
       Me.Word = word
       Me.Frequency = freq
   End Sub

   Public Property Word As String
      Get
        Return WordValue
      End Get
      Set (value As String)
        WordValue = value
      End Set
   End Property

   Public Property Frequency As Integer
      Get
        Return FrequencyValue
      End Get

      Set (value As Integer)
        FrequencyValue = value
      End Set
   End Property

   Public Function CompareTo (ByVal ObjectToCompare as Object) As Integer _
         Implements IComparable.CompareTo
      Dim WordFrequencyObject As WordFrequency = _
           CType(ObjectToCompare, WordFrequency)
      CompareTo = WordFrequencyObject.Frequency - Me.Frequency
      If CompareTo = 0 Then
         ' Word frequencies are the same, so now compare words
         If WordFrequencyObject.Word < Me.Word Then
            CompareTo = 1
         ElseIf WordFrequencyObject.Word > Me.Word
            CompareTo = -1
         ElseIf WordFrequencyObject.Word > Me.Word
            CompareTo = 0
         End If
      End If
   End Function
End Class

[1] The data table implementation seems to be marginally faster than the custom class, at least, as implemented by me. I used a test of 10,996 words (the first two chapters of Dickens's David Copperfield), and in three trials got these timings: Datatable - 7.23/6.0156/6.1093; Custom class - 8.1098/6.578/6.48.

The Stygian Depths Of Hacked-Together Scripts (Fabulous Adventures In Coding)

Last time on FABULOUS ADVENTURES: Eric Lippert: The second term is asymptotically insignificant compared

Eric Lippert 10 May 04 - 10:40 AM

I think you're misunderstanding my solution. The last loop in my second solution IS a hash table sort, and what's more, it is an order-n sort. Any sort that uses IComparable is order n-log-n at best.

Let me go through that solution and annotate it:

char[] punctuation = new char[] {' ', '`', '\n', '\r', '\t', '?', '.', ',', '!', '=', '-', '_', ';', ':', '(', ')', '[', ']', '"', '\''};
string[] tokens = source.ToLower().Split(punctuation, true);

First off, your solution for eliminating punctuation is unnecessarily inefficient. Repeatedly searching and replacing is extremely expensive. You can just pass the text to Split and give a list of the characters you want to split upon, and if "true" is the last argument then it will automatically eliminate the "empty" entries that would be found between double spaces.

Suppose I had the string "The few, the proud, the generics."

You can think of a string array as a device that maps each of a finite number of consecutive integers starting with zero onto a string:

0 -> the
1 -> few
2 -> the
3 -> proud
4 -> the
5 -> generics

Now I have two parts to my strategy. First pass, create a mapping from word to word count. Second, put that in word-count order.

For my first pass, I'll need a table which maps strings onto integers.

Dictionary<string, int> firstPass = new Dictionary<string, int>();

For my second pass, I'll need a table which maps integers onto lists of strings.

Dictionary<int, List<string>> secondPass = new Dictionary<int, List<string>>();

It will be helpful after the second pass to know what is the most popular word frequency. We'll track that in the max variable.

int max = 1;

Eric Lippert 10 May 04 - 10:40 AM

OK, pass one. I'm going to go through my list of words above and check to see if they're in my table. If they are not, I'll add a map from the word to the number one -- so far, I've encountered one such word. If it is in the table then the word is mapped onto its count, so I will increase the count by one.

foreach(string token in tokens)
{
if (!firstPass.ContainsKey(token))
firstPass[token] = 1;
else
++firstPass[token];
}

firstPass now contains this map:

the -> 3
few -> 1
proud -> 1
generics -> 1

Now how the heck am I going to sort this thing by word frequency? I'm going to do a second pass. This time I'm going to go through this table and make a map from frequency to a list of words that have that frequency.

This is basically the same strategy as before. If I find a frequency which is not represented in my table, I'll add it and map it to a list containing its word. If it already exists, I'll add the new word to the end of the list:

foreach(string key in firstPass.Keys)
{
int count = firstPass[key];
if (count > max)
max = count;
if (!secondPass.ContainsKey(count))
secondPass[count] = new List<string>();
secondPass[count].Add(key);
}

The secondPass map now contains this:

3 -> [the]
1 -> [few, proud, generics]

and max -> 3.

How does that help? Well, now we can go from 1 to max and print out that table:

for (int i = 1 ; i <= max ; ++i)
{
if (secondPass.ContainsKey(i))
{
Console.WriteLine(i);
foreach(string s in secondPass[i])
{
Console.WriteLine(s);
}
}
}

and we print out

1
few
pround
generics
3
the

And we're done.

Eric Lippert 10 May 04 - 10:43 AM

Let's take a look at the order of this.

I don't know what algorithm the tokenizer uses, but there are single-character-delimiter tokenizer algorithms that are O(c + d) where c is the number of characters in the string and d is the number of delimiters. Let's assume that we have an optimal tokenizer and that it produces an array of n words.

Since adding and searching a hashtable is of constant order, clearly the first pass is O(n). And the second pass prints out all the unique words, of which there cannot possibly be more than n, so this must also be O(n). Thus, the algorithm is O(c + d + n), which is way better than the n log n you're going to get with a comparison sort.

How is that possible? Isn't it the case that the best possible sort is n log n? No, the best possible sort ON ARBITRARY DATA is n-log-n. Here we know something about the data, namely that we are sorting a finite set of integers where every integer is between 1 and max. That's highly constrained data, and therefore we can do better than the theoretical minimum on arbitrary data.