I got interested today in an issue that involves counting word frequency in an arbitrary list of words, so I played around with it a little. I believe that the general algorithm is not very complex:
- Parse text to find individual words.
- Add words to list.
- Sort list.
- Walk list, counting words and accumulating totals.
- Sort results by accumulated total.
- Serve and enjoy.
I realized, however, that I wasn't clear on what .NET structures to use to implement this algorithm. Steps 1 I can do[2]; for Steps 2 and 3, I add the words to an array and then use Array.Sort(array).
Step 5 is the tricky one, it seems. You need a structure that will accommodate data like this:
the 5
jumped 4
fox 3
brown 2
quick 1
etc.
in other words, a two-field structure that allows sorting by one of the fields. The Array.Sort method supports only 1-dimensional arrays. SortedList looked promising, but it sorts only by key (word), not value (count), and you can't use count as key, since it's not unique.
The only structure that came to mind was DataTable, which supports a DataView that allows sorting. So that's what I've used. I'd love to hear from folks about better ways to accomplish this task.
You can give my primitive experiment a whirl here. Here's the code I'm using (except the sample formats the output slightly):
Dim i As Integer
Dim s As String
Dim punctuation() As Char = {".", ",", "!", "=", "-", "_", ";", ":", _
"(", ")", "[", "]", """"}
Dim t As String = TextBox1.Text
t = t.ToLower()
t = t.Trim()
For i = 0 to punctuation.Length - 1
t = t.Replace(punctuation(i), " ")
Next i
t = t.Replace(vbcrlf, " ")
t = t.Replace(vbtab, " ")
While t.indexOf(" ") > -1
t = t.Replace(" ", " ") ' double spaces
End While
' Create array of all words
Dim wordArray() As String
wordArray = t.split
Array.Sort(wordArray)
' Create data table with two columns, word and count
Dim dt As New System.Data.DataTable("temp")
dt.Columns.Add("word", Type.GetType("System.String"))
dt.Columns.Add("count", Type.GetType("System.Int32"))
Dim dr As System.Data.DataRow
' Walk through word array, accumulating count of (sorted)
' words. As we get to each new word, write out the previous word
' and its accumulator to a data table
Dim arrayLength As Integer = wordArray.Length - 1
Dim accumulator As Integer = 0
Dim nextWord As String = ""
Dim currentWord As String = ""
For i = 0 To arrayLength
nextWord = wordArray(i)
If nextWord = currentWord Then
accumulator += 1
Else
If i > 0 Then
dr = dt.NewRow
dr("word") = currentWord
dr("count") = accumulator
dt.Rows.Add(dr)
End If
currentWord = nextWord
accumulator = 1
End If
Next
' This should be in a sub, since it's repeated ...
dr = dt.NewRow
dr("word") = currentWord
dr("count") = accumulator
dt.Rows.Add(dr)
' Sort entries in data table by count (desc), then word
dt.DefaultView.Sort = "count DESC,word"
' Display results
Dim drv As System.Data.DataRowView
For Each drv In dt.DefaultView
s &= "<br>" & drv("count") & " = " & drv("word")
Next
Label1.text = s