Wiki‎ > ‎

Tips and Tricks

Unix/Programming Tips and Tricks

This page is for adding various tips and tricks for programming. Please feel free to add anything you think would be helpful for others!

The Software Carpentry site has a nice collection of tutorials and tips regarding software development.

Handy Unix Commands

Downloading files remotely

Sometimes you are working remotely on the lab machines, and you want to download a file from somewhere that you found on a web page. Rather than downloading it to your home machine and then uploading it to the lab, use "wget":

jbaldrid@quiche:~/tmp$ wget

It can also be handy to use the ''lynx'' textual web browser in some contexts, which can also be used for such downloads. It is also often more useful when you are trying to access a journal publication that can be downloaded from UT machines but not when you are off campus. Just log on to the lab machines, run lynx in your terminal, go to the page and tab to the paper you want to download.

Reinvoking a previous command

If you have entered a bunch of commands and want to recall one that had a particular prefix, you can invoke it again using "!", as done to recall the first "cat ..." command with "!ca" in the following:

/groups/corpora/nltk-data/gutenberg$ cat austen-emma.txt | tr -cs 'A-Za-z' '\n' | sort | uniq -c | sort -nr | grep -E "\b[Hh]er?\b"
   2400 her
   1368 he
    443 He
     90 Her
/groups/corpora/nltk-data/gutenberg$ ls
austen-emma.txt        blake-songs.txt          README
austen-persuasion.txt  chesterton-ball.txt      shakespeare-caesar.txt
austen-sense.txt       chesterton-brown.txt     shakespeare-hamlet.txt
bible-kjv.txt          chesterton-thursday.txt  shakespeare-macbeth.txt
blake-poems.txt        milton-paradise.txt      whitman-leaves.txt
/groups/corpora/nltk-data/gutenberg$ wc austen-emma.txt
 17078 159826 914529 austen-emma.txt
/groups/corpora/nltk-data/gutenberg$ !ca
cat austen-emma.txt | tr -cs 'A-Za-z' '\n' | sort | uniq -c | sort -nr | grep -E "\b[Hh]er?\b"
   2400 her
   1368 he
    443 He
     90 Her

Command Line NLP Tools

Chris Brew (OSU) and Marc Moens wrote a draft of a book on NLP that has 40 or so pages that describe many UNIX shell commands which are useful for building n-grams, etc. Fred Hoyt extracted these pages to a single PDF.


Splitting a file into a series of files by taking every nth line

Say you have a file with a bunch of lines, and you want to put all even numbered lines into one file and odd numbered ones into another. Here's an easy way to do it:

  $ cat <file> | awk '{print > "file"2-NR%2}'

For three files with every third line, do:

  $ cat <file> | awk '{print > "file"3-NR%3}'

And so on.


Removing Lines from a File

If you want to remove all lines from a file which contain some character, the sed ('stream editor') utility is useful:

  $ cat <file> | sed '/<pattern>/d' > outfile

For example, if you have a LaTeX file and you want to remove all the lines containing comments:

  $ cat my_tex.tex | sed '/%/d' > my_commentless_tex.tex


Extracting Columns from a File

The awk utility is useful for extracting columns from a file that contains columnar data.

For example, the following pulls out the first column:

  $ awk '{print $1}' my_file > column_one

To pull out the 2nd and 3rd columns separated by a comma:

  $ awk '{print $2, "," $3}'


Performing Calculations on a Column in a File

You can also use awk to perform arithmetic operations on a column containing numbers.

For example, say you are extracting unigram counts for words in a text. You use the usual command-line tools to calculate word counts and print them to a file:

$ cat | tr '\s' '\123' | sort | uniq -c | sort -rn > fish.count.txt

(if you don't know what these commands mean, see the PDF file on command line NLP tools just above).

This produces output like the following (dislaying the results rather than printing to a file):

$ cat | tr '\s' '\123' | sort | uniq -c | sort -rn | less

100 fish
100 heads
81  yum
23  chomp

If you want to get the sum of all these counts, do the following:

$ cat fish.count.txt | awk '{total = total + $1} END {print total}'

The key is that both bracketed expressions in the awk command are within the quote marks. The first expression defines the variable total as the sum of its previous value (which starts at 0) with the value of the first column in the input (expressed by the $1 variable). The second expression tells it to print.


Leaving a process running while you log out

Say you have an experiment to run or code to compile and it's going to take a long time. You need to leave the lab, you don't want to tie up one of the machines by leaving it screen locked, and you want to be able to check the process from another machine.

A nice way to do this is by using the screen utility. It works like this:

(1) Log or ssh into the machine you are going to use.

(2) type:

    ''$ screen''

(3) You will see an intro window. Press any key and you will then be presented with a prompt. Start your process:

    ''$ my-process''

(4) Now you have to leave. Press ctrl-a followed by "d". This "detaches" you from the process. Now you can log out, lock up the lab, and go about your business.

(5) To re-attach to the process later, log or ssh back into the computer on which the process is running, and start screen again.

(6) Type:

    ''$ screen -ls''

This will show you a list of screen processes running:

    ''$ screen -ls
      There are screens on:
                28354.pts-0.odyssey     (Detached)
                28302.pts-0.odyssey     (Detached)
      2 Sockets in /var/run/screen/S-bubba.''

Say you want to re-attach to 28302. Type:

    ''$ screen -r 28302.pts-0.odyssey''

The process will re-appear in the terminal and you're ready to go.

For more details on how to use screen, see here


How to leave a process running without needing to see it again

If you want to leave a process running until it finishes, but you only need to see the output, you can use the ''nohup'' utility ("nohup" stands for "no hangup"). Say you want to run a program called '''', do the following:

     ''$ nohup > dog_treats.txt &''

Now you can log out and go about your business, and the process will keep running until it's through. Be sure to include the ampersand.


A Quick and Easy Command for Checking Disk Space

''du'' is a useful command-line utility for taking inventory of disk space usage. Say you're in your home directory, the following command returns a list of directory contents followed with their sizes in terms of bytes:

     ''$ du -h --max-depth=1''

If you use ''du'' without any options, it will operate recursively, giving a potentially long list. The -h argument returns the results in eye-friendly format, and the --max-depth=1 blocks recursion.

If you just want a sum of disk space usage for the contents of the directory, do the following:

     ''$ du -hs''

Revisioning systems

If you aren't using a revisioning system, you should start doing so now. The choice is simple: use git.

Shell Scripts for Fun and Profit

Automated LaTeX-ing
If you like generate your LaTeX output on the command line, here's some quick, simple scripts for streamlining the process.

Say you have a latex document with a bibliography. The following script generates the document once, generates the bibliography, then re-runs the latex twice so that the references all appear in the text, and then prints it first to .ps and then to .pdf. I call the script `':


A = $1
latex $A.tex
bibtex $A
latex $A.tex
latex $A.tex
dvips -t letter $A.dvi -o
ps2pdf $

For a simpler version, just eliminate the bibtex command and one or two of the latex commands.

Now, say you have a directory full of .tex documents, and you want to generate all of them. You can put a loop in the shell script:

for i in `ls *.tex`
    latex i
    dvips -t letter i -o
Bashing your files to rename them
Let's say you have a bunch of files that have complex names that you'd like to simplify, like changing "1997-Document1.txt" to "Document1.txt". For example, you have the following files in a directory:

$ ls
1997A-Document1.txt  1997A-Document2.txt  1997A-Document3.txt

You could do a bunch of ''mv'' commands that would handle each one individually. Or, you can do the following (using bash):

$ for i in $(ls); do  if [[ $i =~ '.*-(\w+\.txt)' ]]; then mv $i ${BASH_REMATCH[1]} ; fi; done

Now the names of the files have been changed:

$ ls
Document1.txt  Document2.txt  Document3.txt

You can enter that bash program on a single line, but if you'd like to make an actual script, you can spread things out a bit:

for i in $(ls); do
if [[ $i =~ '.*-(\w+\.txt)' ]];
  mv $i ${BASH_REMATCH[1]} ;

This example should be a good pointer for how to do lots of other similar things in bash.

In certain cases, the Unix command ''rename'' is a simpler way to accomplish such tasks. For example, to rename all files ending in ".txt" to ".foo", do the following:

$ rename 's/\.txt$/.foo/' *.txt
Making Emacs Look the Way You Want It
I like using emacs with a black background. This can be invoked on startup with the following command:

emacs -bg black -fg white

If you don't want to have to type all that each time you start emacs you can stick it in a script (which I call ''):

emacs -bg black -fg white $1


There are a number of ways to produce graphs. Probably the best thing to do if you are learning for the first time is to use the R language.

In the meantime, there are some old ways of doing it too: xgraph and gnuplot.
xgraph provides a simple way to create graphs. Here's how you can create a graph, quickly. Say you have an xgraph specification file like this:

TitleText: Sample Data

"Plot one"
1 2
2 3
3 4
4 5
5 6

"Plot two"
1 1
2 4
3 9
4 16
5 25

"Plot three"
1 10
2 8
3 6
4 4
5 2

This should be pretty self-explanatory: there are three different relationships being plotted, and we can name them by putting a string in quotes along with the block giving the data. The first column gives x values, the second gives y values.

To see the graph, save this to a file like ''foo.txt'' and do this:

$ xgraph foo.txt

Unfortunately, you have to either be logged onto the machine or have X-forward working in order for this to work. It seems that it is a known issue that xgraph segfaults when you try to output directly to a file, which you are supposed to be able to do this way:

$ xgraph -device ps -o graph.txt
Segmentation fault
Another alternative is gnuplot. This will work if you are working remotely. First, let's set it up to work with on a machine you are in front of, or if you have X-forwarding working.

Let's start by creating two data files 'numbers1.dat' and 'numbers2.dat'.

  * numbers1.dat

1 2
2 3
3 4
4 5
5 6

  * numbers2.dat

1 1
2 4
3 9
4 16
5 25

Now, create a file called '''' with the following contents (and in the same directory):

set xlabel 'My X-axis label'
set ylabel 'My Y-axis label'
plot 'numbers1.dat' title 'linear' with l, \
     'numbers2.dat' title 'squared' with l

To see the output of visualizing the data, do this:

$ gnuplot -persist

If you instead want to save the graph to a Postscript file (which you'll need to do if doing this remotely), define your gnuplot specifications in '''' as follows:

set xlabel 'My X-axis label'
set ylabel 'My Y-axis label'
plot 'numbers1.dat' title 'linear' with l, \
     'numbers2.dat' title 'squared' with l
set out ''
set terminal postscript landscape enhanced mono dashed lw 1 'Helvetica' 14

This saves the graph to the file ''''. You can then use ''scp'' to retrieve the file from the remote machine and look at it on yours. If you are on a *nix machine, you can do it as follows. Say your login name is johndoe, and you have '''' in your home directory ''/home/johndoe'':

$ scp .

That will securely copy '''' to your machine.

If your home machine is a Windows box, use PSCP from Putty to copy the file to your home machine.