Skip to main content

Raymii.org Logo (IEC resistor symbol) logo

Quis custodiet ipsos custodes?
Home | About | All pages | RSS Feed | Gopher

Word occurrence counter and analyzer

Published: 07-03-2013 | Author: Remy van Elst | Text only version of this article


Table of Contents

  • Other Example

  • With these commands you can analyze a text file. It will count all the occurrences of all words and put out the stats. It is usefull for song lyrics, books, notes and everything. It helps me analyze my writing style, which words do I use more often, where are my spelling errors and such. It is also nice to win an argument against someone over a dragonforce song. This example will use lyrics as example, but it is applicable to all text files.

    If you like this article, consider sponsoring me by trying out a Digital Ocean VPS. With this link you'll get $100 credit for 60 days). (referral link)

    Get the Lyrics (text)

    First get the lyrics, or the text you want to analyze into a text file. I've heard nano, vi(m) and emacs are quite good with text. In this song I will use a song by Dragonforce. It does not matter which one because they're all full of the same words.

    My lyrics file is named: df1.txt

    Sanitize them

    The tools we are going to use do not like all those comma's, colons, exclamation marks and weird non-alphanumeric characters. So sanitize the file like this:

    cat df1.txt | tr -cd '[:alnum:] [:space:]' > df1san.txt
    

    What this does is pump the file through the tr command, that command (with these arguments) strips everything which is not a-zA-Z0-9 or a space. Exactly what we want.

    Analyze it Now we do the magic:
    sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' dfsan.txt | sort | uniq -c | sort -nr | head -n 20
    
    
    
    remy@vps8:~$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' dfsan.txt | sort | uniq -c | sort -nr | head -n 20
    72 the
    32 
    25 and
    22 of
    20 in
    17 we
    16 on
    14 our
    13 a
    8 were
    8 lost
    8 for
    7 will
    7 still
    7 light
    6 to
    6 so
    6 fire
    6 far
    5 through
    

    Other Example

    on my class notes about blood and the immune system

    remy@vps8:~$ cat afweer.txt | tr -cd '[:alnum:] [:space:]' > afweersan.txt      
    remy@vps8:~$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' afweersan.txt | sort | uniq -c | sort -nr | head -n 20                 
    195 
    108 de
    80 een
    72 van
    65 het
    51 in
    46 is
    40 en
    24 zijn
    24 op
    24 afweer
    22 die
    20 vraag
    20 deze
    19 worden
    18 kan
    17 bij
    16 dit
    15 er
    14 of
    

    After stripping it of the non-usefull words:

    remy@vps8:~$ cat afwres.txt | head -n 10
    24 afweer
    14 cellen
    11 bacterin
    9 waar
    9 reactie
    9 antigeen
    8 specifieke
    7 milieu
    7 lymfocyten
    7 lichaam
    

    Fabian Scherschels NanoWriMo 2011 Book: Nightwatch

    GIT tree of the book & NaNoWiMo page Book is Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License

    1020 the
    454 he
    421 and
    418 of
    357 to
    347 had
    297 a
    267 was
    257 his
    241 that
    216 in
    132 it
    130 marc
    112 him
    108 as
    105 this
    105 they
    93 with
    90 but
    82 were
    82 from
    82 been
    82 at
    74 on
    70 would
    68 for
    68 could
    56 their
    56 be
    53 out
    51 into
    50 man
    49 all
    48 there
    48 so
    48 by
    47 looked
    46 not
    44 up
    44 them
    44 like
    

    Analyzing IP and log files

    Today I found another usefull use for this command. Analyzing IP adresses. First I grepped my entire lighttpd log file:

    cat access.log | egrep -o '[[:digit:]]{1,3}.[[:digit:]]{1,3}.[[:digit:]]{1,3}.[[:digit:]]{1,3}' | tr [:space:] '\n' | grep -v "^\s*$" | sort | uniq -c | sort -bnr

    (egrep -o spits out only the IP adress, not the whole line on which the IP adress is on)

    That gives out this nice list (this list is made up, not real IP adresses):

    2 83.64.150.248
    2 94.0.74.75
    2 94.142.55.252
    2 95.237.133.3
    2 98.225.130.26
    3 108.100.28.45
    3 213.93.70.87
    5 81.30.145.69
    348 66.228.43.247
    467 173.255.236.50
    

    Thanks to the wonderfull community at stackexchange

    Tags: articles , awk , bash , log , lyrics , notes , sed , tr , word