Tuesday, 1 January 2013

Sorting From Back to Front


The other day I was presented with a list of names and email addresses, something like this one:





Fred Flinstone <flintstone@bedrock.sag>
Barney & Wilma Rubble <bambamsfolks@bedrock.sag>
Steadholder Honor Harrington <dutchess@harrington.mdc>
Count Miles Vorkosigan <auditor@vorkosigan.byr>
Dennis & Margaret Mitchell <stilltrouble@funnies.comics.net>
Homer Simpson <donuts@springfield.st.us>
Rudolph <rednose@reindeer.np>
Kimball Kinnison <kinnison@graylens.gp>
Wile E. Coyote <genius@acme.net>
Tiberius Claudius Drusus Nero Germanicus Julius Caesar <imperator@spqr.rm>
Gen. Jack O'Neill <jack.oneill@stargate.oml>




Except that it had around 100 names. What I wanted to do was to alphabetize this by last name, to make it easier to figure out who was missing from the list, but keep the final result as


     FirstName MiddleName(s) LastName <email>


since this was input to an email list in that format.




This would not difficult if each person had exactly two names, say


     FirstName LastName <email>


in which case we'd just run the command


     sort -k 2 < elist


and we'd be done.




Unfortunately each line contains between two and eight fields, counting the email address, and we want to sort on the next to last one. As far as I can tell, sort doesn't support searches from the end of the line in.




However, the awk (or gawk) command does. For example, the command


     awk '{print $NF}' < elist


would list just the email addresses from the above file, and


     awk '{print $(NF-1)}' < elist


would list the last names — no, I don't know why you use parenthesis, but you do.




So what we need is a way to have awk pull out the last name from the file, sort those, then put everything back together. It turns out we can do that with a one-liner. I found it on the web yesterday, but I've lost the link, so I can't give proper credit. I did save the command, or my modification of it, at least:





awk '{print $(NF-1), $0}' < elist | sort | cut -f2- -d' '




Let's look at that in detail:





  • awk '{print $(NF-1), $0}' < elist


    prints out the next to last column of each line, followed by the entire line ($0).



  • sort


    then sorts everything on the first column, e.g. the last name. Unfortunately, that leaves you with entries like this:


     Simpson Homer Simpson <donuts@springfield.st.us>


    To get rid of these, we need



  • cut -f2- -d' '


    which separates fields by whitespace (the -d' ') and prints everything out starting from the second column (-f2- . If we wanted just the second and third column it would be -f2-3).




And the correctly sorted output is:





Tiberius Claudius Drusus Nero Germanicus Julius Caesar <imperator@spqr.rm>
Wile E. Coyote <genius@acme.net>
Fred Flinstone <flintstone@bedrock.sag>
Steadholder Honor Harrington <dutchess@harrington.mdc>
Kimball Kinnison <kinnison@graylens.gp>
Dennis & Margaret Mitchell <stilltrouble@funnies.comics.net>
Gen. Jack O'Neill <jack.oneill@stargate.oml>
Barney & Wilma Rubble <bambamsfolks@bedrock.sag>
Rudolph <rednose@reindeer.np>
Homer Simpson <donuts@springfield.st.us>
Count Miles Vorkosigan <auditor@vorkosigan.byr>




Fairly simple, huh? I generalized it a bit, so that we can sort on an arbitrary column from the end:




#! /bin/bash

# Usage

# lastsort N filename
# Sorts the file filename of the field N columns from the end
# N=0 is last column of the file

awk '{print $(NF-'$1'), $0}' $2 | sort | cut -f2- -d' '



Note the single quotes around the $1 in the awk command, which passes the first argument of the calling command to awk. Without the quotes you get an error.




OK, this could have a few bells and whistles, but I'm not going to bother with that now.


No comments:

Post a comment