Saturday, November 1, 2014

Count the Occurence of All Words in a Number of Text Files

Recently I was given the task to analyse a range of files (around 300) and count the occurrence of all the words in each file. So the aim was to put together a piece of code that goes through all files in a directory, reads in a file, lists all the words occurring in it and counting how many time each word has occurred.

I quickly found out that in case of a single file the process is rather simple, the following code does a fine job,

for w in `cat FILE.txt`; do echo $w;done|sort|uniq -c >> results.out
This code reads in FILE.txt and for each word in it counts its occurrence and the creates a list from it.

However putting this into a recursive script was a little more complicated. So I took another direction and found a piece of code using sed to do the same job on a single file. With this and some scripting knowledge I was able to put together just what I needed.

Additionally, I used the command basename to output the name of the file so I know which file was which.

The final piece of code looks like this,

for file in `ls /PATH/TO/DIRECTORY/`
do
basename /PATH/TO/DIRECTORY/FILE >>
results.out        sed s/' '/\\n/g /PATH/TO/DIRECTORY/FILE | sort | uniq -c | sort -nr >>  results.out
echo "" >>  results.out
done

This does a perfect job and creates a single file with the output containing,
  • File name of each file
  • Occurrence of each word in the file, sorted from high to low
  • Empty line to separate data from each other
If anyone has any other suggestion or comment I am happy to hear it!



No comments:

Post a Comment