Tuesday, January 1, 2013

Perl one-liners: File Extension Frequency

Perl Squirrel was curious about what types of files were on the Linux system.

Using a series of Linux commands to isolate just the extensions of each file, this output is piped to a Perl one-liner that stores in a hash, the file extension as the key and the occurrence count as the value.
The output is sorted by key (or file extension, alphabetical order).

This is all done as a one-liner.

Description and One-Liner Code:
## For all files,
## get basename of each file (meaning drop the directory portion of the string)
## except files starting with dot (meaning don't look at hidden files)
## except files containing comma (no strange files)
## awk uses period as field separator ( -F\. )
## for lines that have more than 1 field (NF > 1 meaning there is at least one period in the file)
## print the last field $NF (meaning the file extension)
## Display unique extension and number of occurances
find . -type f -print | xargs -I {} basename {} | grep -v "^\." | grep -v ',' | awk -F\. 'NF > 1 {print $NF}' | perl -e 'while(<>){chomp;$f{$_}++;}printf("%-40s  %12i\n",$_,$f{$_}) foreach sort keys %f;'


Bonus:   Frequency counter for numeric values

To get frequency values for numeric values, you'll want to use a slightly different kind of sort when outputting the contents of the hash keys and values so that the keys display in numerical order not alphabetical order.

This is just an example to show the numeric sort.
In a long directory listing (ls -l), the 7th field contains the day of the month.
Lets get a frequency distribution of the day of the month for each item in the directory.

One-Liner Code:
ls -l | awk '$7 >= 1 {print $7}' | perl -e 'while(<>){chomp;$f{$_}++;}printf("%-40s  %12i\n",$_,$f{$_}) foreach sort {$a<=>$b} keys %f;'


Happy New Year!!

1 comment:

  1. Perl makes it easier to write this one-liner using some of its command line options.

    perl -lne '$f{$_}++; END {printf("%-40s %12i\n",$_,$f{$_}) foreach sort keys %f}'

    -n provides you the while loop that reads the input. You need to supply only the body of the loop. The code to be run after the while loop is wrapped in the END block.

    -l turns on auto-chomping of the the input lines.

    To make it even shorter, you can use `for' instead of `foreach'. There is no difference between the two.

    ReplyDelete