Why Learn Unix/Shell/Bash/Linux for Bioinformatics?

As the common languages used in bioinformatics are Python and R, One would wonder why they still need to learn to use Linux. After all, anyone who wants to rename several thousand data files can easily do so interactively in the Python interpreter, and anyone who’s doing serious data analysis is probably going to do most of their work inside the IPython Notebook or R Studio. So why teach the shell?

Previous: Getting starting with Unix: Bioinformatics Beginners

  • Why do we learn to use the shell?
    • Allows users to automate repetitive tasks
    • And capture small data manipulation steps that are normally not recorded to make research reproducible
  • The Problem
    • Running the same workflow on several samples can be unnecessarily labour intensive
    • Manual manipulation of data files:
      • is often not captured in documentation
      • is hard to reproduce
      • is hard to troubleshoot, review, or improve
  • The Shell
    • Workflows can be automated through the use of shell scripts
    • Built-in commands allow for easy data manipulation (e.g. sort, grep, etc.)
    • Every step can be captured in the shell script and allow reproducibility and easy troubleshooting

“Because so much else depends on it.” Installing software, configuring your default editor, and controlling remote machines frequently assume a basic familiarity with the shell, and with related ideas like standard input and output. Many tools also use its terminology (for example, the %ls and %cd magic commands in IPython).

“Because it’s an easy way to introduce some fundamental ideas about how to use computers.” We learn to get the computer to repeat things (via tab completion, ! followed by a command number, and for loops) rather than repeating things themselves. We learn to take things they’ve discovered they do frequently and save them for later re-use (via shell scripts), to give things sensible names, and to write a little bit of documentation (like comment at the top of shell scripts) to make their future selves’ lives better.

“Because it enables the use of many domain-specific tools and computes resources researchers cannot access otherwise.” Familiarity with the shell is very useful for remote accessing machines, using high-performance computing infrastructure, and running new specialist tools in many disciplines. HPC or domain-specific skills are not taught here but lay the groundwork for further development of these skills. In particular, understanding the syntax of commands, flags, and help systems is useful for domain-specific tools and understanding the file system (and how to navigate it) is useful for remote access.

Learning the shell lets us learn to think about programming in terms of function composition. In the case of the shell, this takes the form of pipelines rather than nested function calls, but the core idea of “small pieces, loosely joined” is the same.

Next: Introduction to Shell: Bioinformatics