15.4. Text Processing Commands

Commands affecting text and text files

sort

File sort utility, often used as a filter in a pipe. This command sorts a text stream or file forwards or backwards, or according to various keys or character positions. Using the -m option, it merges presorted input files. The info page lists its many capabilities and options. See Example 10-9, Example 10-10, and Example A-8.

tsort

Topological sort, reading in pairs of whitespace-separated strings and sorting according to input patterns. The original purpose of tsort was to sort a list of dependencies for an obsolete version of the ld linker in an "ancient" version of UNIX.

The results of a tsort will usually differ markedly from those of the standard sort command, above.

uniq

This filter removes duplicate lines from a sorted file. It is often seen in a pipe coupled with sort.

   1 cat list-1 list-2 list-3 | sort | uniq > final.list
   2 # Concatenates the list files,
   3 # sorts them,
   4 # removes duplicate lines,
   5 # and finally writes the result to an output file.

The useful -c option prefixes each line of the input file with its number of occurrences.

 bash$ cat testfile
 This line occurs only once.
 This line occurs twice.
 This line occurs twice.
 This line occurs three times.
 This line occurs three times.
 This line occurs three times.
 
 
 bash$ uniq -c testfile
       1 This line occurs only once.
       2 This line occurs twice.
       3 This line occurs three times.
 
 
 bash$ sort testfile | uniq -c | sort -nr
       3 This line occurs three times.
       2 This line occurs twice.
       1 This line occurs only once.
 	      

The sort INPUTFILE | uniq -c | sort -nr command string produces a frequency of occurrence listing on the INPUTFILE file (the -nr options to sort cause a reverse numerical sort). This template finds use in analysis of log files and dictionary lists, and wherever the lexical structure of a document needs to be examined.


Example 15-12. Word Frequency Analysis

   1 #!/bin/bash
   2 # wf.sh: Crude word frequency analysis on a text file.
   3 # This is a more efficient version of the "wf2.sh" script.
   4 
   5 
   6 # Check for input file on command line.
   7 ARGS=1
   8 E_BADARGS=65
   9 E_NOFILE=66
  10 
  11 if [ $# -ne "$ARGS" ]  # Correct number of arguments passed to script?
  12 then
  13   echo "Usage: `basename $0` filename"
  14   exit $E_BADARGS
  15 fi
  16 
  17 if [ ! -f "$1" ]       # Check if file exists.
  18 then
  19   echo "File \"$1\" does not exist."
  20   exit $E_NOFILE
  21 fi
  22 
  23 
  24 
  25 ########################################################
  26 # main ()
  27 sed -e 's/\.//g'  -e 's/\,//g' -e 's/ /\
  28 /g' "$1" | tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr
  29 #                           =========================
  30 #                            Frequency of occurrence
  31 
  32 #  Filter out periods and commas, and
  33 #+ change space between words to linefeed,
  34 #+ then shift characters to lowercase, and
  35 #+ finally prefix occurrence count and sort numerically.
  36 
  37 #  Arun Giridhar suggests modifying the above to:
  38 #  . . . | sort | uniq -c | sort +1 [-f] | sort +0 -nr
  39 #  This adds a secondary sort key, so instances of
  40 #+ equal occurrence are sorted alphabetically.
  41 #  As he explains it:
  42 #  "This is effectively a radix sort, first on the
  43 #+ least significant column
  44 #+ (word or string, optionally case-insensitive)
  45 #+ and last on the most significant column (frequency)."
  46 #
  47 #  As Frank Wang explains, the above is equivalent to
  48 #+       . . . | sort | uniq -c | sort +0 -nr
  49 #+ and the following also works:
  50 #+       . . . | sort | uniq -c | sort -k1nr -k
  51 ########################################################
  52 
  53 exit 0
  54 
  55 # Exercises:
  56 # ---------
  57 # 1) Add 'sed' commands to filter out other punctuation,
  58 #+   such as semicolons.
  59 # 2) Modify the script to also filter out multiple spaces and
  60 #+   other whitespace.

 bash$ cat testfile
 This line occurs only once.
 This line occurs twice.
 This line occurs twice.
 This line occurs three times.
 This line occurs three times.
 This line occurs three times.
 
 
 bash$ ./wf.sh testfile
       6 this
       6 occurs
       6 line
       3 times
       3 three
       2 twice
       1 only
       1 once
 	       

expand, unexpand

The expand filter converts tabs to spaces. It is often used in a pipe.

The unexpand filter converts spaces to tabs. This reverses the effect of expand.

cut

A tool for extracting fields from files. It is similar to the print $N command set in awk, but more limited. It may be simpler to use cut in a script than awk. Particularly important are the -d (delimiter) and -f (field specifier) options.

Using cut to obtain a listing of the mounted filesystems:
   1 cut -d ' ' -f1,2 /etc/mtab

Using cut to list the OS and kernel version:
   1 uname -a | cut -d" " -f1,3,11,12

Using cut to extract message headers from an e-mail folder:
 bash$ grep '^Subject:' read-messages | cut -c10-80
 Re: Linux suitable for mission-critical apps?
 MAKE MILLIONS WORKING AT HOME!!!
 Spam complaint
 Re: Spam complaint

Using cut to parse a file:
   1 # List all the users in /etc/passwd.
   2 
   3 FILENAME=/etc/passwd
   4 
   5 for user in $(cut -d: -f1 $FILENAME)
   6 do
   7   echo $user
   8 done
   9 
  10 # Thanks, Oleg Philon for suggesting this.

cut -d ' ' -f2,3 filename is equivalent to awk -F'[ ]' '{ print $2, $3 }' filename

Note

It is even possible to specify a linefeed as a delimiter. The trick is to actually embed a linefeed (RETURN) in the command sequence.

 bash$ cut -d'
 ' -f3,7,19 testfile
 This is line 3 of testfile.
 This is line 7 of testfile.
 This is line 19 of testfile.
 	      

Thank you, Jaka Kranjc, for pointing this out.

See also Example 15-48.

paste

Tool for merging together different files into a single, multi-column file. In combination with cut, useful for creating system log files.

join

Consider this a special-purpose cousin of paste. This powerful utility allows merging two files in a meaningful fashion, which essentially creates a simple version of a relational database.

The join command operates on exactly two files, but pastes together only those lines with a common tagged field (usually a numerical label), and writes the result to stdout. The files to be joined should be sorted according to the tagged field for the matchups to work properly.

   1 File: 1.data
   2 
   3 100 Shoes
   4 200 Laces
   5 300 Socks

   1 File: 2.data
   2 
   3 100 $40.00
   4 200 $1.00
   5 300 $2.00

 bash$ join 1.data 2.data
 File: 1.data 2.data

 100 Shoes $40.00
 200 Laces $1.00
 300 Socks $2.00
 	      

Note

The tagged field appears only once in the output.

head

lists the beginning of a file to stdout. The default is 10 lines, but a different number can be specified. The command has a number of interesting options.


Example 15-13. Which files are scripts?

   1 #!/bin/bash
   2 # script-detector.sh: Detects scripts within a directory.
   3 
   4 TESTCHARS=2    # Test first 2 characters.
   5 SHABANG='#!'   # Scripts begin with a "sha-bang."
   6 
   7 for file in *  # Traverse all the files in current directory.
   8 do
   9   if [[ `head -c$TESTCHARS "$file"` = "$SHABANG" ]]
  10   #      head -c2                      #!
  11   #  The '-c' option to "head" outputs a specified
  12   #+ number of characters, rather than lines (the default).
  13   then
  14     echo "File \"$file\" is a script."
  15   else
  16     echo "File \"$file\" is *not* a script."
  17   fi
  18 done
  19   
  20 exit 0
  21 
  22 #  Exercises:
  23 #  ---------
  24 #  1) Modify this script to take as an optional argument
  25 #+    the directory to scan for scripts
  26 #+    (rather than just the current working directory).
  27 #
  28 #  2) As it stands, this script gives "false positives" for
  29 #+    Perl, awk, and other scripting language scripts.
  30 #     Correct this.


Example 15-14. Generating 10-digit random numbers

   1 #!/bin/bash
   2 # rnd.sh: Outputs a 10-digit random number
   3 
   4 # Script by Stephane Chazelas.
   5 
   6 head -c4 /dev/urandom | od -N4 -tu4 | sed -ne '1s/.* //p'
   7 
   8 
   9 # =================================================================== #
  10 
  11 # Analysis
  12 # --------
  13 
  14 # head:
  15 # -c4 option takes first 4 bytes.
  16 
  17 # od:
  18 # -N4 option limits output to 4 bytes.
  19 # -tu4 option selects unsigned decimal format for output.
  20 
  21 # sed: 
  22 # -n option, in combination with "p" flag to the "s" command,
  23 # outputs only matched lines.
  24 
  25 
  26 
  27 # The author of this script explains the action of 'sed', as follows.
  28 
  29 # head -c4 /dev/urandom | od -N4 -tu4 | sed -ne '1s/.* //p'
  30 # ----------------------------------> |
  31 
  32 # Assume output up to "sed" --------> |
  33 # is 0000000 1198195154\n
  34 
  35 #  sed begins reading characters: 0000000 1198195154\n.
  36 #  Here it finds a newline character,
  37 #+ so it is ready to process the first line (0000000 1198195154).
  38 #  It looks at its <range><action>s. The first and only one is
  39 
  40 #   range     action
  41 #   1         s/.* //p
  42 
  43 #  The line number is in the range, so it executes the action:
  44 #+ tries to substitute the longest string ending with a space in the line
  45 #  ("0000000 ") with nothing (//), and if it succeeds, prints the result
  46 #  ("p" is a flag to the "s" command here, this is different
  47 #+ from the "p" command).
  48 
  49 #  sed is now ready to continue reading its input. (Note that before
  50 #+ continuing, if -n option had not been passed, sed would have printed
  51 #+ the line once again).
  52 
  53 #  Now, sed reads the remainder of the characters, and finds the
  54 #+ end of the file.
  55 #  It is now ready to process its 2nd line (which is also numbered '$' as
  56 #+ it's the last one).
  57 #  It sees it is not matched by any <range>, so its job is done.
  58 
  59 #  In few word this sed commmand means:
  60 #  "On the first line only, remove any character up to the right-most space,
  61 #+ then print it."
  62 
  63 # A better way to do this would have been:
  64 #           sed -e 's/.* //;q'
  65 
  66 # Here, two <range><action>s (could have been written
  67 #           sed -e 's/.* //' -e q):
  68 
  69 #   range                    action
  70 #   nothing (matches line)   s/.* //
  71 #   nothing (matches line)   q (quit)
  72 
  73 #  Here, sed only reads its first line of input.
  74 #  It performs both actions, and prints the line (substituted) before
  75 #+ quitting (because of the "q" action) since the "-n" option is not passed.
  76 
  77 # =================================================================== #
  78 
  79 # An even simpler altenative to the above one-line script would be:
  80 #           head -c4 /dev/urandom| od -An -tu4
  81 
  82 exit 0

See also Example 15-39.

tail

lists the (tail) end of a file to stdout. The default is 10 lines, but this can be changed with the -n option. Commonly used to keep track of changes to a system logfile, using the -f option, which outputs lines appended to the file.


Example 15-15. Using tail to monitor the system log

   1 #!/bin/bash
   2 
   3 filename=sys.log
   4 
   5 cat /dev/null > $filename; echo "Creating / cleaning out file."
   6 #  Creates file if it does not already exist,
   7 #+ and truncates it to zero length if it does.
   8 #  : > filename   and   > filename also work.
   9 
  10 tail /var/log/messages > $filename  
  11 # /var/log/messages must have world read permission for this to work.
  12 
  13 echo "$filename contains tail end of system log."
  14 
  15 exit 0

Tip

To list a specific line of a text file, pipe the output of head to tail -n 1. For example head -n 8 database.txt | tail -n 1 lists the 8th line of the file database.txt.

To set a variable to a given block of a text file:
   1 var=$(head -n $m $filename | tail -n $n)
   2 
   3 # filename = name of file
   4 # m = from beginning of file, number of lines to end of block
   5 # n = number of lines to set variable to (trim from end of block)

Note

Newer implementations of tail deprecate the older tail -$LINES filename usage. The standard tail -n $LINES filename is correct.

See also Example 15-5, Example 15-39 and Example 29-6.

grep

A multi-purpose file search tool that uses Regular Expressions. It was originally a command/filter in the venerable ed line editor: g/re/p -- global - regular expression - print.

grep pattern [file...]

Search the target file(s) for occurrences of pattern, where pattern may be literal text or a Regular Expression.

 bash$ grep '[rst]ystem.$' osinfo.txt
 The GPL governs the distribution of the Linux operating system.
 	      

If no target file(s) specified, grep works as a filter on stdout, as in a pipe.

 bash$ ps ax | grep clock
 765 tty1     S      0:00 xclock
 901 pts/1    S      0:00 grep clock
 	      

The -i option causes a case-insensitive search.

The -w option matches only whole words.

The -l option lists only the files in which matches were found, but not the matching lines.

The -r (recursive) option searches files in the current working directory and all subdirectories below it.

The -n option lists the matching lines, together with line numbers.

 bash$ grep -n Linux osinfo.txt
 2:This is a file containing information about Linux.
 6:The GPL governs the distribution of the Linux operating system.
 	      

The -v (or --invert-match) option filters out matches.
   1 grep pattern1 *.txt | grep -v pattern2
   2 
   3 # Matches all lines in "*.txt" files containing "pattern1",
   4 # but ***not*** "pattern2".	      

The -c (--count) option gives a numerical count of matches, rather than actually listing the matches.
   1 grep -c txt *.sgml   # (number of occurrences of "txt" in "*.sgml" files)
   2 
   3 
   4 #   grep -cz .
   5 #            ^ dot
   6 # means count (-c) zero-separated (-z) items matching "."
   7 # that is, non-empty ones (containing at least 1 character).
   8 # 
   9 printf 'a b\nc  d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz .     # 3
  10 printf 'a b\nc  d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '$'   # 5
  11 printf 'a b\nc  d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -cz '^'   # 5
  12 #
  13 printf 'a b\nc  d\n\n\n\n\n\000\n\000e\000\000\nf' | grep -c '$'    # 9
  14 # By default, newline chars (\n) separate items to match. 
  15 
  16 # Note that the -z option is GNU "grep" specific.
  17 
  18 
  19 # Thanks, S.C.

The --color (or --colour) option marks the matching string in color (on the console or in an xterm window). Since grep prints out each entire line containing the matching pattern, this lets you see exactly what is being matched. See also the -o option, which shows only the matching portion of the line(s).


Example 15-16. Printing out the From lines in stored e-mail messages

   1 #!/bin/bash
   2 # from.sh
   3 
   4 #  Emulates the useful "from" utility in Solaris, BSD, etc.
   5 #  Echoes the "From" header line in all messages
   6 #+ in your e-mail directory.
   7 
   8 
   9 MAILDIR=~/mail/*               #  No quoting of variable. Why?
  10 GREP_OPTS="-H -A 5 --color"    #  Show file, plus extra context lines
  11                                #+ and display "From" in color.
  12 TARGETSTR="^From"              # "From" at beginning of line.
  13 
  14 for file in $MAILDIR           #  No quoting of variable.
  15 do
  16   grep $GREP_OPTS "$TARGETSTR" "$file"
  17   #    ^^^^^^^^^^              #  Again, do not quote this variable.
  18   echo
  19 done
  20 
  21 exit $?
  22 
  23 #  Might wish to pipe the output of this script to 'more' or
  24 #+ redirect it to a file . . .

When invoked with more than one target file given, grep specifies which file contains matches.

 bash$ grep Linux osinfo.txt misc.txt
 osinfo.txt:This is a file containing information about Linux.
 osinfo.txt:The GPL governs the distribution of the Linux operating system.
 misc.txt:The Linux operating system is steadily gaining in popularity.
 	      

Tip

To force grep to show the filename when searching only one target file, simply give /dev/null as the second file.

 bash$ grep Linux osinfo.txt /dev/null
 osinfo.txt:This is a file containing information about Linux.
 osinfo.txt:The GPL governs the distribution of the Linux operating system.
 	      

If there is a successful match, grep returns an exit status of 0, which makes it useful in a condition test in a script, especially in combination with the -q option to suppress output.
   1 SUCCESS=0                      # if grep lookup succeeds
   2 word=Linux
   3 filename=data.file
   4 
   5 grep -q "$word" "$filename"    #  The "-q" option
   6                                #+ causes nothing to echo to stdout.
   7 if [ $? -eq $SUCCESS ]
   8 # if grep -q "$word" "$filename"   can replace lines 5 - 7.
   9 then
  10   echo "$word found in $filename"
  11 else
  12   echo "$word not found in $filename"
  13 fi

Example 29-6 demonstrates how to use grep to search for a word pattern in a system logfile.


Example 15-17. Emulating grep in a script

   1 #!/bin/bash
   2 # grp.sh: Rudimentary reimplementation of grep.
   3 
   4 E_BADARGS=85
   5 
   6 if [ -z "$1" ]    # Check for argument to script.
   7 then
   8   echo "Usage: `basename $0` pattern"
   9   exit $E_BADARGS
  10 fi  
  11 
  12 echo
  13 
  14 for file in *     # Traverse all files in $PWD.
  15 do
  16   output=$(sed -n /"$1"/p $file)  # Command substitution.
  17 
  18   if [ ! -z "$output" ]           # What happens if "$output" is not quoted?
  19   then
  20     echo -n "$file: "
  21     echo "$output"
  22   fi              #  sed -ne "/$1/s|^|${file}: |p"  is equivalent to above.
  23 
  24   echo
  25 done  
  26 
  27 echo
  28 
  29 exit 0
  30 
  31 # Exercises:
  32 # ---------
  33 # 1) Add newlines to output, if more than one match in any given file.
  34 # 2) Add features.

How can grep search for two (or more) separate patterns? What if you want grep to display all lines in a file or files that contain both "pattern1" and "pattern2"?

One method is to pipe the result of grep pattern1 to grep pattern2.

For example, given the following file:

   1 # Filename: tstfile
   2 
   3 This is a sample file.
   4 This is an ordinary text file.
   5 This file does not contain any unusual text.
   6 This file is not unusual.
   7 Here is some text.

Now, let's search this file for lines containing both "file" and "text" . . .

 bash$ grep file tstfile
 # Filename: tstfile
 This is a sample file.
 This is an ordinary text file.
 This file does not contain any unusual text.
 This file is not unusual.
 
 bash$ grep file tstfile | grep text
 This is an ordinary text file.
 This file does not contain any unusual text.

Now, for an interesting recreational use of grep . . .


Example 15-18. Crossword puzzle solver

   1 #!/bin/bash
   2 # cw-solver.sh
   3 # This is actually a wrapper around a one-liner (line 46).
   4 
   5 #  Crossword puzzle and anagramming word game solver.
   6 #  You know *some* of the letters in the word you're looking for,
   7 #+ so you need a list of all valid words
   8 #+ with the known letters in given positions.
   9 #  For example: w...i....n
  10 #               1???5????10
  11 # w in position 1, 3 unknowns, i in the 5th, 4 unknowns, n at the end.
  12 # (See comments at end of script.)
  13 
  14 
  15 E_NOPATT=71
  16 DICT=/usr/share/dict/word.lst
  17 #                    ^^^^^^^^   Looks for word list here.
  18 #  ASCII word list, one word per line.
  19 #  If you happen to need an appropriate list,
  20 #+ download the author's "yawl" word list package.
  21 #  http://ibiblio.org/pub/Linux/libs/yawl-0.3.2.tar.gz
  22 #  or
  23 #  http://personal.riverusers.com/~thegrendel/yawl-0.3.2.tar.gz
  24 
  25 
  26 if [ -z "$1" ]   #  If no word pattern specified
  27 then             #+ as a command-line argument . . .
  28   echo           #+ . . . then . . .
  29   echo "Usage:"  #+ Usage message.
  30   echo
  31   echo ""$0" \"pattern,\""
  32   echo "where \"pattern\" is in the form"
  33   echo "xxx..x.x..."
  34   echo
  35   echo "The x's represent known letters,"
  36   echo "and the periods are unknown letters (blanks)."
  37   echo "Letters and periods can be in any position."
  38   echo "For example, try:   sh cw-solver.sh w...i....n"
  39   echo
  40   exit $E_NOPATT
  41 fi
  42 
  43 echo
  44 # ===============================================
  45 # This is where all the work gets done.
  46 grep ^"$1"$ "$DICT"   # Yes, only one line!
  47 #    |    |
  48 # ^ is start-of-word regex anchor.
  49 # $ is end-of-word regex anchor.
  50 
  51 #  From _Stupid Grep Tricks_, vol. 1,
  52 #+ a book the ABS Guide author may yet get around
  53 #+ to writing . . . one of these days . . .
  54 # ===============================================
  55 echo
  56 
  57 
  58 exit $?  # Script terminates here.
  59 #  If there are too many words generated,
  60 #+ redirect the output to a file.
  61 
  62 $ sh cw-solver.sh w...i....n
  63 
  64 wellington
  65 workingman
  66 workingmen

egrep -- extended grep -- is the same as grep -E. This uses a somewhat different, extended set of Regular Expressions, which can make the search a bit more flexible. It also allows the boolean | (or) operator.
 bash $ egrep 'matches|Matches' file.txt
 Line 1 matches.
 Line 3 Matches.
 Line 4 contains matches, but also Matches
               

fgrep -- fast grep -- is the same as grep -F. It does a literal string search (no Regular Expressions), which generally speeds things up a bit.

Note

On some Linux distros, egrep and fgrep are symbolic links to, or aliases for grep, but invoked with the -E and -F options, respectively.


Example 15-19. Looking up definitions in Webster's 1913 Dictionary

   1 #!/bin/bash
   2 # dict-lookup.sh
   3 
   4 #  This script looks up definitions in the 1913 Webster's Dictionary.
   5 #  This Public Domain dictionary is available for download
   6 #+ from various sites, including
   7 #+ Project Gutenberg (http://www.gutenberg.org/etext/247).
   8 #
   9 #  Convert it from DOS to UNIX format (only LF at end of line)
  10 #+ before using it with this script.
  11 #  Store the file in plain, uncompressed ASCII.
  12 #  Set DEFAULT_DICTFILE variable below to path/filename.
  13 
  14 
  15 E_BADARGS=65
  16 MAXCONTEXTLINES=50                        # Maximum number of lines to show.
  17 DEFAULT_DICTFILE="/usr/share/dict/webster1913-dict.txt"
  18                                           # Default dictionary file pathname.
  19                                           # Change this as necessary.
  20 #  Note:
  21 #  ----
  22 #  This particular edition of the 1913 Webster's
  23 #+ begins each entry with an uppercase letter
  24 #+ (lowercase for the remaining characters).
  25 #  Only the *very first line* of an entry begins this way,
  26 #+ and that's why the search algorithm below works.
  27 
  28 
  29 
  30 if [[ -z $(echo "$1" | sed -n '/^[A-Z]/p') ]]
  31 #  Must at least specify word to look up, and
  32 #+ it must start with an uppercase letter.
  33 then
  34   echo "Usage: `basename $0` Word-to-define [dictionary-file]"
  35   echo
  36   echo "Note: Word to look up must start with capital letter,"
  37   echo "with the rest of the word in lowercase."
  38   echo "--------------------------------------------"
  39   echo "Examples: Abandon, Dictionary, Marking, etc."
  40   exit $E_BADARGS
  41 fi
  42 
  43 
  44 if [ -z "$2" ]                            #  May specify different dictionary
  45                                           #+ as an argument to this script.
  46 then
  47   dictfile=$DEFAULT_DICTFILE
  48 else
  49   dictfile="$2"
  50 fi
  51 
  52 # ---------------------------------------------------------
  53 Definition=$(fgrep -A $MAXCONTEXTLINES "$1 \\" "$dictfile")
  54 #                  Definitions in form "Word \..."
  55 #
  56 #  And, yes, "fgrep" is fast enough
  57 #+ to search even a very large text file.
  58 
  59 
  60 # Now, snip out just the definition block.
  61 
  62 echo "$Definition" |
  63 sed -n '1,/^[A-Z]/p' |
  64 #  Print from first line of output
  65 #+ to the first line of the next entry.
  66 sed '$d' | sed '$d'
  67 #  Delete last two lines of output
  68 #+ (blank line and first line of next entry).
  69 # ---------------------------------------------------------
  70 
  71 exit 0
  72 
  73 # Exercises:
  74 # ---------
  75 # 1)  Modify the script to accept any type of alphabetic input
  76 #   + (uppercase, lowercase, mixed case), and convert it
  77 #   + to an acceptable format for processing.
  78 #
  79 # 2)  Convert the script to a GUI application,
  80 #   + using something like 'gdialog' or 'zenity' . . .
  81 #     The script will then no longer take its argument(s)
  82 #   + from the command line.
  83 #
  84 # 3)  Modify the script to parse one of the other available
  85 #   + Public Domain Dictionaries, such as the U.S. Census Bureau Gazetteer.

Note

See also Example A-43 for an example of speedy fgrep lookup on a large text file.

agrep (approximate grep) extends the capabilities of grep to approximate matching. The search string may differ by a specified number of characters from the resulting matches. This utility is not part of the core Linux distribution.

Tip

To search compressed files, use zgrep, zegrep, or zfgrep. These also work on non-compressed files, though slower than plain grep, egrep, fgrep. They are handy for searching through a mixed set of files, some compressed, some not.

To search bzipped files, use bzgrep.

look

The command look works like grep, but does a lookup on a "dictionary," a sorted word list. By default, look searches for a match in /usr/dict/words, but a different dictionary file may be specified.


Example 15-20. Checking words in a list for validity

   1 #!/bin/bash
   2 # lookup: Does a dictionary lookup on each word in a data file.
   3 
   4 file=words.data  # Data file from which to read words to test.
   5 
   6 echo
   7 
   8 while [ "$word" != end ]  # Last word in data file.
   9 do               # ^^^
  10   read word      # From data file, because of redirection at end of loop.
  11   look $word > /dev/null  # Don't want to display lines in dictionary file.
  12   lookup=$?      # Exit status of 'look' command.
  13 
  14   if [ "$lookup" -eq 0 ]
  15   then
  16     echo "\"$word\" is valid."
  17   else
  18     echo "\"$word\" is invalid."
  19   fi  
  20 
  21 done <"$file"    # Redirects stdin to $file, so "reads" come from there.
  22 
  23 echo
  24 
  25 exit 0
  26 
  27 # ----------------------------------------------------------------
  28 # Code below line will not execute because of "exit" command above.
  29 
  30 
  31 # Stephane Chazelas proposes the following, more concise alternative:
  32 
  33 while read word && [[ $word != end ]]
  34 do if look "$word" > /dev/null
  35    then echo "\"$word\" is valid."
  36    else echo "\"$word\" is invalid."
  37    fi
  38 done <"$file"
  39 
  40 exit 0

sed, awk

Scripting languages especially suited for parsing text files and command output. May be embedded singly or in combination in pipes and shell scripts.

sed

Non-interactive "stream editor", permits using many ex commands in batch mode. It finds many uses in shell scripts.

awk

Programmable file extractor and formatter, good for manipulating and/or extracting fields (columns) in structured text files. Its syntax is similar to C.

wc

wc gives a "word count" on a file or I/O stream:
 bash $ wc /usr/share/doc/sed-4.1.2/README
 13  70  447 README
 [13 lines  70 words  447 characters]

wc -w gives only the word count.

wc -l gives only the line count.

wc -c gives only the byte count.

wc -m gives only the character count.

wc -L gives only the length of the longest line.

Using wc to count how many .txt files are in current working directory:
   1 $ ls *.txt | wc -l
   2 #  Will work as long as none of the "*.txt" files
   3 #+ have a linefeed embedded in their name.
   4 
   5 #  Alternative ways of doing this are:
   6 #      find . -maxdepth 1 -name \*.txt -print0 | grep -cz .
   7 #      (shopt -s nullglob; set -- *.txt; echo $#)
   8 
   9 #  Thanks, S.C.

Using wc to total up the size of all the files whose names begin with letters in the range d - h
 bash$ wc [d-h]* | grep total | awk '{print $3}'
 71832
 	      

Using wc to count the instances of the word "Linux" in the main source file for this book.
 bash$ grep Linux abs-book.sgml | wc -l
 50
 	      

See also Example 15-39 and Example 19-8.

Certain commands include some of the functionality of wc as options.
   1 ... | grep foo | wc -l
   2 # This frequently used construct can be more concisely rendered.
   3 
   4 ... | grep -c foo
   5 # Just use the "-c" (or "--count") option of grep.
   6 
   7 # Thanks, S.C.

tr

character translation filter.

Caution

Must use quoting and/or brackets, as appropriate. Quotes prevent the shell from reinterpreting the special characters in tr command sequences. Brackets should be quoted to prevent expansion by the shell.

Either tr "A-Z" "*" <filename or tr A-Z \* <filename changes all the uppercase letters in filename to asterisks (writes to stdout). On some systems this may not work, but tr A-Z '[**]' will.

The -d option deletes a range of characters.
   1 echo "abcdef"                 # abcdef
   2 echo "abcdef" | tr -d b-d     # aef
   3 
   4 
   5 tr -d 0-9 <filename
   6 # Deletes all digits from the file "filename".

The --squeeze-repeats (or -s) option deletes all but the first instance of a string of consecutive characters. This option is useful for removing excess whitespace.
 bash$ echo "XXXXX" | tr --squeeze-repeats 'X'
 X

The -c "complement" option inverts the character set to match. With this option, tr acts only upon those characters not matching the specified set.

 bash$ echo "acfdeb123" | tr -c b-d +
 +c+d+b++++

Note that tr recognizes POSIX character classes. [1]

 bash$ echo "abcd2ef1" | tr '[:alpha:]' -
 ----2--1
 	      


Example 15-21. toupper: Transforms a file to all uppercase.

   1 #!/bin/bash
   2 # Changes a file to all uppercase.
   3 
   4 E_BADARGS=65
   5 
   6 if [ -z "$1" ]  # Standard check for command line arg.
   7 then
   8   echo "Usage: `basename $0` filename"
   9   exit $E_BADARGS
  10 fi  
  11 
  12 tr a-z A-Z <"$1"
  13 
  14 # Same effect as above, but using POSIX character set notation:
  15 #        tr '[:lower:]' '[:upper:]' <"$1"
  16 # Thanks, S.C.
  17 
  18 exit 0
  19 
  20 #  Exercise:
  21 #  Rewrite this script to give the option of changing a file
  22 #+ to *either* upper or lowercase.


Example 15-22. lowercase: Changes all filenames in working directory to lowercase.

   1 #!/bin/bash
   2 #
   3 #  Changes every filename in working directory to all lowercase.
   4 #
   5 #  Inspired by a script of John Dubois,
   6 #+ which was translated into Bash by Chet Ramey,
   7 #+ and considerably simplified by the author of the ABS Guide.
   8 
   9 
  10 for filename in *                # Traverse all files in directory.
  11 do
  12    fname=`basename $filename`
  13    n=`echo $fname | tr A-Z a-z`  # Change name to lowercase.
  14    if [ "$fname" != "$n" ]       # Rename only files not already lowercase.
  15    then
  16      mv $fname $n
  17    fi  
  18 done   
  19 
  20 exit $?
  21 
  22 
  23 # Code below this line will not execute because of "exit".
  24 #--------------------------------------------------------#
  25 # To run it, delete script above line.
  26 
  27 # The above script will not work on filenames containing blanks or newlines.
  28 # Stephane Chazelas therefore suggests the following alternative:
  29 
  30 
  31 for filename in *    # Not necessary to use basename,
  32                      # since "*" won't return any file containing "/".
  33 do n=`echo "$filename/" | tr '[:upper:]' '[:lower:]'`
  34 #                             POSIX char set notation.
  35 #                    Slash added so that trailing newlines are not
  36 #                    removed by command substitution.
  37    # Variable substitution:
  38    n=${n%/}          # Removes trailing slash, added above, from filename.
  39    [[ $filename == $n ]] || mv "$filename" "$n"
  40                      # Checks if filename already lowercase.
  41 done
  42 
  43 exit $?


Example 15-23. du: DOS to UNIX text file conversion.

   1 #!/bin/bash
   2 # Du.sh: DOS to UNIX text file converter.
   3 
   4 E_WRONGARGS=65
   5 
   6 if [ -z "$1" ]
   7 then
   8   echo "Usage: `basename $0` filename-to-convert"
   9   exit $E_WRONGARGS
  10 fi
  11 
  12 NEWFILENAME=$1.unx
  13 
  14 CR='\015'  # Carriage return.
  15            # 015 is octal ASCII code for CR.
  16            # Lines in a DOS text file end in CR-LF.
  17            # Lines in a UNIX text file end in LF only.
  18 
  19 tr -d $CR < $1 > $NEWFILENAME
  20 # Delete CR's and write to new file.
  21 
  22 echo "Original DOS text file is \"$1\"."
  23 echo "Converted UNIX text file is \"$NEWFILENAME\"."
  24 
  25 exit 0
  26 
  27 # Exercise:
  28 # --------
  29 # Change the above script to convert from UNIX to DOS.


Example 15-24. rot13: ultra-weak encryption.

   1 #!/bin/bash
   2 # rot13.sh: Classic rot13 algorithm,
   3 #           encryption that might fool a 3-year old.
   4 
   5 # Usage: ./rot13.sh filename
   6 # or     ./rot13.sh <filename
   7 # or     ./rot13.sh and supply keyboard input (stdin)
   8 
   9 cat "$@" | tr 'a-zA-Z' 'n-za-mN-ZA-M'   # "a" goes to "n", "b" to "o", etc.
  10 #  The 'cat "$@"' construction
  11 #+ permits getting input either from stdin or from files.
  12 
  13 exit 0


Example 15-25. Generating "Crypto-Quote" Puzzles

   1 #!/bin/bash
   2 # crypto-quote.sh: Encrypt quotes
   3 
   4 #  Will encrypt famous quotes in a simple monoalphabetic substitution.
   5 #  The result is similar to the "Crypto Quote" puzzles
   6 #+ seen in the Op Ed pages of the Sunday paper.
   7 
   8 
   9 key=ETAOINSHRDLUBCFGJMQPVWZYXK
  10 # The "key" is nothing more than a scrambled alphabet.
  11 # Changing the "key" changes the encryption.
  12 
  13 # The 'cat "$@"' construction gets input either from stdin or from files.
  14 # If using stdin, terminate input with a Control-D.
  15 # Otherwise, specify filename as command-line parameter.
  16 
  17 cat "$@" | tr "a-z" "A-Z" | tr "A-Z" "$key"
  18 #        |  to uppercase  |     encrypt       
  19 # Will work on lowercase, uppercase, or mixed-case quotes.
  20 # Passes non-alphabetic characters through unchanged.
  21 
  22 
  23 # Try this script with something like:
  24 # "Nothing so needs reforming as other people's habits."
  25 # --Mark Twain
  26 #
  27 # Output is:
  28 # "CFPHRCS QF CIIOQ MINFMBRCS EQ FPHIM GIFGUI'Q HETRPQ."
  29 # --BEML PZERC
  30 
  31 # To reverse the encryption:
  32 # cat "$@" | tr "$key" "A-Z"
  33 
  34 
  35 #  This simple-minded cipher can be broken by an average 12-year old
  36 #+ using only pencil and paper.
  37 
  38 exit 0
  39 
  40 #  Exercise:
  41 #  --------
  42 #  Modify the script so that it will either encrypt or decrypt,
  43 #+ depending on command-line argument(s).

fold

A filter that wraps lines of input to a specified width. This is especially useful with the -s option, which breaks lines at word spaces (see Example 15-26 and Example A-1).

fmt

Simple-minded file formatter, used as a filter in a pipe to "wrap" long lines of text output.


Example 15-26. Formatted file listing.

   1 #!/bin/bash
   2 
   3 WIDTH=40                    # 40 columns wide.
   4 
   5 b=`ls /usr/local/bin`       # Get a file listing...
   6 
   7 echo $b | fmt -w $WIDTH
   8 
   9 # Could also have been done by
  10 #    echo $b | fold - -s -w $WIDTH
  11  
  12 exit 0

See also Example 15-5.

Tip

A powerful alternative to fmt is Kamil Toman's par utility, available from http://www.cs.berkeley.edu/~amc/Par/.

col

This deceptively named filter removes reverse line feeds from an input stream. It also attempts to replace whitespace with equivalent tabs. The chief use of col is in filtering the output from certain text processing utilities, such as groff and tbl.

column

Column formatter. This filter transforms list-type text output into a "pretty-printed" table by inserting tabs at appropriate places.


Example 15-27. Using column to format a directory listing

   1 #!/bin/bash
   2 # colms.sh
   3 # A minor modification of the example file in the "column" man page.
   4 
   5 
   6 (printf "PERMISSIONS LINKS OWNER GROUP SIZE MONTH DAY HH:MM PROG-NAME\n" \
   7 ; ls -l | sed 1d) | column -t
   8 #         ^^^^^^           ^^
   9 
  10 #  The "sed 1d" in the pipe deletes the first line of output,
  11 #+ which would be "total        N",
  12 #+ where "N" is the total number of files found by "ls -l".
  13 
  14 # The -t option to "column" pretty-prints a table.
  15 
  16 exit 0

colrm

Column removal filter. This removes columns (characters) from a file and writes the file, lacking the range of specified columns, back to stdout. colrm 2 4 <filename removes the second through fourth characters from each line of the text file filename.

Caution

If the file contains tabs or nonprintable characters, this may cause unpredictable behavior. In such cases, consider using expand and unexpand in a pipe preceding colrm.

nl

Line numbering filter: nl filename lists filename to stdout, but inserts consecutive numbers at the beginning of each non-blank line. If filename omitted, operates on stdin.

The output of nl is very similar to cat -b, since, by default nl does not list blank lines.


Example 15-28. nl: A self-numbering script.

   1 #!/bin/bash
   2 # line-number.sh
   3 
   4 # This script echoes itself twice to stdout with its lines numbered.
   5 
   6 # 'nl' sees this as line 4 since it does not number blank lines.
   7 # 'cat -n' sees the above line as number 6.
   8 
   9 nl `basename $0`
  10 
  11 echo; echo  # Now, let's try it with 'cat -n'
  12 
  13 cat -n `basename $0`
  14 # The difference is that 'cat -n' numbers the blank lines.
  15 # Note that 'nl -ba' will also do so.
  16 
  17 exit 0
  18 # -----------------------------------------------------------------

pr

Print formatting filter. This will paginate files (or stdout) into sections suitable for hard copy printing or viewing on screen. Various options permit row and column manipulation, joining lines, setting margins, numbering lines, adding page headers, and merging files, among other things. The pr command combines much of the functionality of nl, paste, fold, column, and expand.

pr -o 5 --width=65 fileZZZ | more gives a nice paginated listing to screen of fileZZZ with margins set at 5 and 65.

A particularly useful option is -d, forcing double-spacing (same effect as sed -G).

gettext

The GNU gettext package is a set of utilities for localizing and translating the text output of programs into foreign languages. While originally intended for C programs, it now supports quite a number of programming and scripting languages.

The gettext program works on shell scripts. See the info page.

msgfmt

A program for generating binary message catalogs. It is used for localization.

iconv

A utility for converting file(s) to a different encoding (character set). Its chief use is for localization.

   1 # Convert a string from UTF-8 to UTF-16 and print to the BookList
   2 function write_utf8_string {
   3     STRING=$1
   4     BOOKLIST=$2
   5     echo -n "$STRING" | iconv -f UTF8 -t UTF16 | \
   6     cut -b 3- | tr -d \\n >> "$BOOKLIST"
   7 }
   8 
   9 #  From Peter Knowles' "booklistgen.sh" script
  10 #+ for converting files to Sony Librie/PRS-50X format.
  11 #  (http://booklistgensh.peterknowles.com)

recode

Consider this a fancier version of iconv, above. This very versatile utility for converting a file to a different encoding scheme. Note that recode is not part of the standard Linux installation.

TeX, gs

TeX and Postscript are text markup languages used for preparing copy for printing or formatted video display.

TeX is Donald Knuth's elaborate typsetting system. It is often convenient to write a shell script encapsulating all the options and arguments passed to one of these markup languages.

Ghostscript (gs) is a GPL-ed Postscript interpreter.

texexec

Utility for processing TeX and pdf files. Found in /usr/bin on many Linux distros, it is actually a shell wrapper that calls Perl to invoke Tex.

   1 texexec --pdfarrange --result=Concatenated.pdf *pdf
   2 
   3 #  Concatenates all the pdf files in the current working directory
   4 #+ into the merged file, Concatenated.pdf . . .
   5 #  (The --pdfarrange option repaginates a pdf file. See also --pdfcombine.)
   6 #  The above command line could be parameterized and put into a shell script.

enscript

Utility for converting plain text file to PostScript

For example, enscript filename.txt -p filename.ps produces the PostScript output file filename.ps.

groff, tbl, eqn

Yet another text markup and display formatting language is groff. This is the enhanced GNU version of the venerable UNIX roff/troff display and typesetting package. Manpages use groff.

The tbl table processing utility is considered part of groff, as its function is to convert table markup into groff commands.

The eqn equation processing utility is likewise part of groff, and its function is to convert equation markup into groff commands.


Example 15-29. manview: Viewing formatted manpages

   1 #!/bin/bash
   2 # manview.sh: Formats the source of a man page for viewing.
   3 
   4 #  This script is useful when writing man page source.
   5 #  It lets you look at the intermediate results on the fly
   6 #+ while working on it.
   7 
   8 E_WRONGARGS=85
   9 
  10 if [ -z "$1" ]
  11 then
  12   echo "Usage: `basename $0` filename"
  13   exit $E_WRONGARGS
  14 fi
  15 
  16 # ---------------------------
  17 groff -Tascii -man $1 | less
  18 # From the man page for groff.
  19 # ---------------------------
  20 
  21 #  If the man page includes tables and/or equations,
  22 #+ then the above code will barf.
  23 #  The following line can handle such cases.
  24 #
  25 #   gtbl < "$1" | geqn -Tlatin1 | groff -Tlatin1 -mtty-char -man
  26 #
  27 #   Thanks, S.C.
  28 
  29 exit 0

See also Example A-41.

lex, yacc

The lex lexical analyzer produces programs for pattern matching. This has been replaced by the nonproprietary flex on Linux systems.

The yacc utility creates a parser based on a set of specifications. This has been replaced by the nonproprietary bison on Linux systems.

Notes

[1]

This is only true of the GNU version of tr, not the generic version often found on commercial UNIX systems.