Makefiles

Kevin Thornton
Advanced Informatics, week 4

The problem

  • You have a bunch of scripts expecting input and generating output
  • How do you quickly figure out if everthing is done?
  • How do you know every file is based on the latest code?

One solution

Manually inspect the time stamp of each any every file.

Hint: that doesn't really work well.

Another solution

Generate a “master” shell script that checks time stamps, etc., and only does the work if something has changed.

Hint: this is re-inventing an extremely complex wheel.

Another solution

  • A Makefile
  • The program “make” is that really complicated shell script that I just referred to.

What does it do?

  • It combines “targets”, “patterns”, “dependencies”, and “rules” into a workflow.
  • Individual steps only run when dependencies change.

Why was it created?

  • To compile source code into programs
  • This is a hugely repetitive task where the same pattern is used over an over

Why do this?

  • Because it tells you when you are done!
make 
make: Nothing to be done for `all'.

Example

all: Figs/fig1.pdf

Figs/fig1.pdf: R/fig1.R
  cd R;R --no-save --quiet < fig1.R

clean:
  find . -name '*.pdf' | xargs rm -f

Break it down

  • all is the default target executed by make.
  • Here, “all” means we make a single figure that'll end up in the subdirectory “Figs”
all: Figs/fig1.pdf

Break it down

  • This is the rule to make fig1.pdf
  • It depends on R/fig1.R
  • The command to run is given in the rule
Figs/fig1.pdf: R/fig1.R
  cd R;R --no-save --quiet < fig1.R

Break it down

  • clean is the name of another target
  • it deletes all the pdf files
  • Be really careful how you write your clean targets!!
clean:
  find . -name '*.pdf' | xargs rm -f

Target dependencies can be complex

Figs/fig2.pdf: R/fig2.R R/makefig2data.R data/fig2data.txt
  cd R;R --no-save --quiet < fig2.R

data/fig2data.txt: R/makefig2data.R
  cd R;R --no-save --quiet < fig2.R

Hmmm, this is getting repetitive...

all: Figs/fig1.pdf Figs/fig2.pdf

Figs/fig1.pdf: R/fig1.R
    cd R;R --no-save --quiet < fig1.R

Figs/fig2.pdf: R/fig2.R R/makefig2data.R data/fig2data.txt 
    cd R;R --no-save --quiet < fig2.R

data/fig2data.txt: R/makefig2data.R
    cd R;R --no-save --quiet < makefig2data.R

Patterns!!!

  • We can write rules that apply to a filename pattern.
Figs/%.pdf: R/%.R
  cd R;R --no-save --vanilla $(<F)
  • The '%' is a wildcard
  • $(<F) is an automatic variable which refers to the file part of the first dependency.

Getting more compact

all: Figs/fig1.pdf Figs/fig2.pdf

Figs/%.pdf: R/%.R
    cd R;R --no-save --quiet <  $(<F)

Figs/fig1.pdf: R/fig1.R

Figs/fig2.pdf: R/fig2.R R/makefig2data.R data/fig2data.txt 

data/fig2data.txt: R/makefig2data.R
    cd R;R --no-save --quiet < makefig2data.R

clean:
    find . -name '*.pdf' | xargs rm -f
    rm -f data/*

General comments

  • make tranlates to “make -f Makefile all” by default
  • The file name is Makefile not makefile. (This messes up OS X users all the time.)
  • Multilple Makefiles are totally fine (and in fact encouraged!!!):
make -f Makefile.process_intermediate_files
make -f Makefile.figures
make -f Makefile.latex

Use cases

  • Packagable, highly-repetitive tasks
  • Take a list of FASTQ files + a reference. Run the aligner.
  • Given a list of .bam files, do something with samtools.
  • Given a list of .vcf files, do something with GATK.

Pros

  • Reproducibility.
  • Forced organization. Your file names all gotta make sense.
  • Your project's work flow becomes self-documenting (to the extent that a Makefile is readable by mortals).
  • Automagic parallel computing:
make -j 8 Makefile.figs

Cons

  • It slows you down
  • It is whole lot of arcane Unix nerdiness
  • Advanced pattern/rule stuff is tricky.

More resources