Skip to main content Logo (IEC resistor symbol)logo

Quis custodiet ipsos custodes?
Home | About | All pages | RSS Feed | Gopher

Word occurrence counter and analyzer

Published: 07-03-2013 | Author: Remy van Elst | Text only version of this article

Table of Contents

With these commands you can analyze a text file. It will count all theoccurrences of all words and put out the stats. It is usefull for song lyrics,books, notes and everything. It helps me analyze my writing style, which wordsdo I use more often, where are my spelling errors and such. It is also nice towin an argument against someone over a dragonforce song. This example will uselyrics as example, but it is applicable to all text files.

If you like this article, consider sponsoring me by trying out a Digital OceanVPS. With this link you'll get $100 credit for 60 days). (referral link)

Get the Lyrics (text)

First get the lyrics, or the text you want to analyze into a text file. I'veheard nano, vi(m) and emacs are quite good with text. In this song I will use asong by Dragonforce. It does not matter which one because they're all full ofthe same words.

My lyrics file is named: df1.txt

Sanitize them

The tools we are going to use do not like all those comma's, colons, exclamationmarks and weird non-alphanumeric characters. So sanitize the file like this:

cat df1.txt | tr -cd '[:alnum:] [:space:]' > df1san.txt

What this does is pump the file through the tr command, that command (with thesearguments) strips everything which is not a-zA-Z0-9 or a space. Exactly what wewant.

Analyze it Now we do the magic:
sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' dfsan.txt | sort | uniq -c | sort -nr | head -n 20remy@vps8:~$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' dfsan.txt | sort | uniq -c | sort -nr | head -n 2072 the32 25 and22 of20 in17 we16 on14 our13 a8 were8 lost8 for7 will7 still7 light6 to6 so6 fire6 far5 through

Other Example

on my class notes about blood and the immune system

remy@vps8:~$ cat afweer.txt | tr -cd '[:alnum:] [:space:]' > afweersan.txt      remy@vps8:~$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' afweersan.txt | sort | uniq -c | sort -nr | head -n 20                 195 108 de80 een72 van65 het51 in46 is40 en24 zijn24 op24 afweer22 die20 vraag20 deze19 worden18 kan17 bij16 dit15 er14 of

After stripping it of the non-usefull words:

remy@vps8:~$ cat afwres.txt | head -n 1024 afweer14 cellen11 bacterin9 waar9 reactie9 antigeen8 specifieke7 milieu7 lymfocyten7 lichaam

Fabian Scherschels NanoWriMo 2011 Book: Nightwatch

GIT tree of the book & NaNoWiMo page Book is Creative CommonsAttribution-NonCommercial-ShareAlike 3.0 Unported License

1020 the454 he421 and418 of357 to347 had297 a267 was257 his241 that216 in132 it130 marc112 him108 as105 this105 they93 with90 but82 were82 from82 been82 at74 on70 would68 for68 could56 their56 be53 out51 into50 man49 all48 there48 so48 by47 looked46 not44 up44 them44 like

Analyzing IP and log files

Today I found another usefull use for this command. Analyzing IP adresses. FirstI grepped my entire lighttpd log file:

cat access.log | egrep -o'[[:digit:]]{1,3}.[[:digit:]]{1,3}.[[:digit:]]{1,3}.[[:digit:]]{1,3}' | tr[:space:] '\n' | grep -v "^\s*$" | sort | uniq -c | sort -bnr

(egrep -o spits out only the IP adress, not the whole line on which the IPadress is on)

That gives out this nice list (this list is made up, not real IP adresses):


Thanks to the wonderfull community at stackexchange

Tags: articles, awk, bash, log, lyrics, notes, sed, tr, word