Raymii.org

Quis custodiet ipsos custodes?
Home | About | All pages | Cluster Status | RSS Feed

Word occurrence counter and analyzer

Published: 07-03-2013 | Author: Remy van Elst | Text only version of this article

❗ This post is over twelve years old. It may no longer be up to date. Opinions may have changed.

Get the Lyrics (text)
Sanitize them
Analyze it Now we do the magic:

on my class notes about blood and the immune system
- Fabian Scherschels NanoWriMo 2011 Book: Nightwatch
- Analyzing IP and log files

With these commands you can analyze a text file. It will count all the occurrences of all words and put out the stats. It is usefull for song lyrics, books, notes and everything. It helps me analyze my writing style, which words do I use more often, where are my spelling errors and such. It is also nice to win an argument against someone over a dragonforce song. This example will use lyrics as example, but it is applicable to all text files.

Recently I removed all Google Ads from this site due to their invasive tracking, as well as Google Analytics. Please, if you found this content useful, consider a small donation using any of the options below. It means the world to me if you show your appreciation and you'll help pay the server costs:

GitHub Sponsorship

PCBWay referral link (You get $5, I get $20 after you've placed an order)

Digital Ocea referral link ($200 credit for 60 days. Spend $25 after your credit expires and I'll get $25!)

Get the Lyrics (text)

First get the lyrics, or the text you want to analyze into a text file. I've heard nano, vi(m) and emacs are quite good with text. In this song I will use a song by Dragonforce. It does not matter which one because they're all full of the same words.

My lyrics file is named: df1.txt

Sanitize them

The tools we are going to use do not like all those comma's, colons, exclamation marks and weird non-alphanumeric characters. So sanitize the file like this:

cat df1.txt | tr -cd '[:alnum:] [:space:]' > df1san.txt

What this does is pump the file through the tr command, that command (with these arguments) strips everything which is not a-zA-Z0-9 or a space. Exactly what we want.

Analyze it Now we do the magic:

sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' dfsan.txt | sort | uniq -c | sort -nr | head -n 20



remy@vps8:~$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' dfsan.txt | sort | uniq -c | sort -nr | head -n 20
72 the
32 
25 and
22 of
20 in
17 we
16 on
14 our
13 a
8 were
8 lost
8 for
7 will
7 still
7 light
6 to
6 so
6 fire
6 far
5 through

Other Example

on my class notes about blood and the immune system

remy@vps8:~$ cat afweer.txt | tr -cd '[:alnum:] [:space:]' > afweersan.txt      
remy@vps8:~$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' afweersan.txt | sort | uniq -c | sort -nr | head -n 20                 
195 
108 de
80 een
72 van
65 het
51 in
46 is
40 en
24 zijn
24 op
24 afweer
22 die
20 vraag
20 deze
19 worden
18 kan
17 bij
16 dit
15 er
14 of

After stripping it of the non-usefull words:

remy@vps8:~$ cat afwres.txt | head -n 10
24 afweer
14 cellen
11 bacterin
9 waar
9 reactie
9 antigeen
8 specifieke
7 milieu
7 lymfocyten
7 lichaam

Fabian Scherschels NanoWriMo 2011 Book: Nightwatch

GIT tree of the book & NaNoWiMo page Book is Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License

1020 the
454 he
421 and
418 of
357 to
347 had
297 a
267 was
257 his
241 that
216 in
132 it
130 marc
112 him
108 as
105 this
105 they
93 with
90 but
82 were
82 from
82 been
82 at
74 on
70 would
68 for
68 could
56 their
56 be
53 out
51 into
50 man
49 all
48 there
48 so
48 by
47 looked
46 not
44 up
44 them
44 like

Analyzing IP and log files

Today I found another usefull use for this command. Analyzing IP adresses. First I grepped my entire lighttpd log file:

cat access.log | egrep -o '[[:digit:]]{1,3}.[[:digit:]]{1,3}.[[:digit:]]{1,3}.[[:digit:]]{1,3}' | tr [:space:] '\n' | grep -v "^\s*$" | sort | uniq -c | sort -bnr

(egrep -o spits out only the IP adress, not the whole line on which the IP adress is on)

That gives out this nice list (this list is made up, not real IP adresses):

2 83.64.150.248
2 94.0.74.75
2 94.142.55.252
2 95.237.133.3
2 98.225.130.26
3 108.100.28.45
3 213.93.70.87
5 81.30.145.69
348 66.228.43.247
467 173.255.236.50

Thanks to the wonderfull community at stackexchange

Tags: articles , awk , bash , log , lyrics , notes , sed , tr , word