Word frequency - mike's web log/comments

New comments for this entry are disabled.

April 25, 2004 | Word frequency | 3384 hit(s)

I got interested today in an issue that involves counting word frequency in an arbitrary list of words, so I played around with it a little. I believe that the general algorithm is not very complex:

Parse text to find individual words.
Add words to list.
Sort list.
Walk list, counting words and accumulating totals.
Sort results by accumulated total.
Serve and enjoy.

I realized, however, that I wasn't clear on what .NET structures to use to implement this algorithm. Steps 1 I can do[2]; for Steps 2 and 3, I add the words to an array and then use Array.Sort(array).

Step 5 is the tricky one, it seems. You need a structure that will accommodate data like this:

the 5
jumped 4
fox 3
brown 2
quick 1
etc.

in other words, a two-field structure that allows sorting by one of the fields. The Array.Sort method supports only 1-dimensional arrays. SortedList looked promising, but it sorts only by key (word), not value (count), and you can't use count as key, since it's not unique.

The only structure that came to mind was DataTable, which supports a DataView that allows sorting. So that's what I've used. I'd love to hear from folks about better ways to accomplish this task.

You can give my primitive experiment a whirl here. Here's the code I'm using (except the sample formats the output slightly):

Dim i As Integer
Dim s As String

Dim punctuation() As Char = {".", ",", "!", "=", "-", "_", ";", ":", _
     "(", ")", "[", "]", """"}
Dim t As String = TextBox1.Text
t = t.ToLower()
t = t.Trim()
For i = 0 to punctuation.Length - 1
   t = t.Replace(punctuation(i), " ")
Next i
t = t.Replace(vbcrlf, " ")
t = t.Replace(vbtab, " ")

While t.indexOf("  ") > -1
   t = t.Replace("  ", " ") ' double spaces
End While

' Create array of all words
Dim wordArray() As String
wordArray = t.split
Array.Sort(wordArray)

' Create data table with two columns, word and count
Dim dt As New System.Data.DataTable("temp")
dt.Columns.Add("word", Type.GetType("System.String"))
dt.Columns.Add("count", Type.GetType("System.Int32"))

Dim dr As System.Data.DataRow

' Walk through word array, accumulating count of (sorted)
' words. As we get to each new word, write out the previous word 
' and its accumulator to a data table
Dim arrayLength As Integer = wordArray.Length - 1
Dim accumulator As Integer = 0
Dim nextWord As String = ""
Dim currentWord As String = ""

For i = 0 To arrayLength
   nextWord = wordArray(i)
   If nextWord = currentWord Then
	  accumulator += 1
   Else
	  If i > 0 Then
		dr = dt.NewRow
		dr("word") = currentWord
		dr("count") = accumulator
		dt.Rows.Add(dr)
	  End If
	  currentWord = nextWord
	  accumulator = 1
   End If
Next
' This should be in a sub, since it's repeated ... 
dr = dt.NewRow
dr("word") = currentWord
dr("count") = accumulator
dt.Rows.Add(dr)

' Sort entries in data table by count (desc), then word
dt.DefaultView.Sort = "count DESC,word"

' Display results
Dim drv As System.Data.DataRowView
For Each drv In dt.DefaultView
   s &= "<br>" & drv("count") & " = " & drv("word")
Next
Label1.text = s

[2] Corey will be disappointed, but I'm not using regular expressions for this ...

The Stygian Depths Of Hacked-Together Scripts (Fabulous Adventures In Coding)

Last time on FABULOUS ADVENTURES: Eric Lippert: The second term is asymptotically insignificant compared

Corey 25 Apr 04 - 8:40 AM

Regex's definitely would lend themselves to this sort of work!

Actually, I'm starting work on my life's masterpiece today, a book that I've been contemplating for the past few years, and one of the first steps is to parse arbitrary text into words.
I was fiddling the idea between regex's or just token parsing an incoming stream, sort of a "brute force" method like what you've implemented.
I'm going to be parsing enormous bits of text, and I'll probably want to be able to experiment with the token parsing, and what makes up a word, exactly, since I'm going to include punctuation into my analysis. I'll probably implement your method first, then see about the relative performance of a regex.

Simon 26 Apr 04 - 8:22 AM

I would have created a simple class that has a string for the word and an int for the count. I'd have then implemented IComparable for that class and used the built-in list sorting stuff.

Eric Lippert 26 Apr 04 - 1:36 PM

The data structure you want is a hash table.

char[] punctuation = new char[] {' ', '`', '\n', '\r', '\t', '?', '.', ',', '!', '=', '-', '_', ';', ':', '(', ')', '[', ']', '"', '\''};
string[] tokens = source.ToLower().Split(punctuation, true);
Hashtable table = new Hashtable();
foreach(string token in tokens)
{
if (!table.Contains(token))
table[token] = 1;
else
table[token] = 1 + (int)table[token];
}
foreach(DictionaryEntry entry in table)
{
Console.WriteLine(entry.Key + " " + entry.Value);
}

Putting the hash table in order is left as an exercise for the reader -- Simon's got a good idea above.

Eric Lippert 26 Apr 04 - 9:20 PM

Waiting for my carpool to show up. Bored. Here's another implementation -- this one uses generics, and has the nice property that it is order-n. (Any solution that requires sorting will be n log n at best -- seeing how to beat that time is instructive.)

Dictionary<string, int> firstPass = new Dictionary<string, int>();
Dictionary<int, List<string>> secondPass = new Dictionary<int, List<string>>();
int max = 1;
foreach(string token in tokens)
{
    if (!firstPass.ContainsKey(token))
        firstPass[token] = 1;
    else
        ++firstPass[token];
}
foreach(string key in firstPass.Keys)
{
    int count = firstPass[key];
    if (count > max)
        max = count;
    if (!secondPass.ContainsKey(count))
        secondPass[count] = new List<string>();
    secondPass[count].Add(key);
}
for (int i = 1 ; i <= max ; ++i)
{
    if (secondPass.ContainsKey(i))
    {
        Console.WriteLine(i);
        foreach(string s in secondPass[i])
        {
            Console.WriteLine(s);
        }
    }
}