These skills are absolutely essential for day-to-day work in this lab. These skills are NOT a requirement to join this lab, and neither are they the primary or the major things you will learn in this lab. Rather, they are important pre-requisites for successful work on projects in your lab, and if you do not know them already, learning them will be one of your first tasks in this lab.

It is my job and yours to ensure that you know these skills well, and we will work together to get you from wherever you are right now to wherever you need to be so that you can succeed in your projects and ultimately your degree. What matters is that you are willing to learn these skills, and learn them well, if you want to succeed in this lab (or field, for that matter).

It is important for to you recognize that these skills/tools are what you will eventually be using on a day-to-day and hour-to-hour basis here, and thus in many ways captures the spirit of work in this lab (and this field). They are not just tedious/painful/necessary evils, or something you are going to pick up quickly to put on your CV and then put behind you, but rather will be part and parcel of your daily life for your entire tenure in this lab. These skills are to a computational biologist the equivalent of sequencing and sequence analysis skills to a molecular biologist, knowledge of anatomy to a surgeon, basic fighter manouevers to a fighter pilot, etc. So, if this lab (or, indeed, this field) are where you want to invest your precious time and energy, then you are going to need to not just learn these skills, but actually enjoy using them (at least some of them some of the time), or otherwise, frankly, you are not going to be happy here, much less succeed.

Note that these skills are just the beginning of what you will learn in this lab. You are going to learn a whole lot more, but the particularities of those are going to depend on your interests and projects, whereas the skills described are going to be important no matter what you do, because you will be definitely using them at least partially in anything and everything that you do.

Foundational

Working in a POSIX-like operating system shell: Bash

What does it mean to “work in the bash shell”? Here, it means being able to the use the shell to navigate around the filesystem (which, in turn, presupposes a conceptual understanding of the filesystem organization), manipulate (create, move/rename, copy, delete) files and directories in the filesystem, as well as carry out other general “house-keeping” tasks. In particular, it means that you know and understand the following concepts or operations in bash:

  • Current working directory
  • Relative vs. absolute paths
  • File/directory ownership and permissions
  • Command invocation and arguments, including running privileged vs. non-privileged commands
  • Basic filesystem navigation and manipulation:
    • How to see the files in a particular directory (current or otherwise), including getting information on file sizes, modification dates, types, etc.
    • How to create, copy, move/rename, open, edit, and delete files.
    • Wildcard usage
    • How to create and expand compressed archives of files/directories.
  • Variables, setting, printing, and variable expansion in commands.
  • Important environmental variables such: $PATH, $PWD, etc.

Resources to help you to learn this:

VS Code: A powerful idea-to-product-and=publication development enviroment

You will spend an inordinate — perhaps the majority — of time working in the text editor, and you need to be part of it as much as it is part of you. As such, this is a very personal choice. At the same time, though, there are some important criteria that your choice needs to fulfill for it to be a workable platform on which you can shine. Critical features for such a text editor include:

  • read/write plain text format
  • handle and/or convert between different line-endings
  • regular expression search/search-and-replace
  • display of and jumping to particular line numbers
  • show hidden characters
  • syntax highlighting

Examples of text editors that fulfill the above criteria are:

  • VS Code: Powerful integrated development environment, not just for writing one-off scripts, major coding projects, to manuscripts and websites, but also compiling, running, debugging, publishing etc. them.
  • NeoVim or Vim: Pure text editors—you can do all your writing and coding here, but need plugins to reach the comprehensive development and authoring support of VS Code.
    • Note: I had been using NeoVim/Vim for over 15 years, extending its capabilities not only with community-generated scripts as well as spending inordinate amounts of time developing my own, with a resulting ecosystem as capable as VS Code. However, once I discovered that VS Code had gained the ability to support a full NeoVim instance as its internal editor, almost overnight most my workflow had switched to using VS Code.
  • Emacs: An operating system with a text-editor.

Also, I should not need to say this, but just in case there is any confusion: Microsoft Word, NotePad, TextPad, Gedit, Nano, etc. all do not count as an appropriate text editors.

A Version Control System: Git.

You are free to learn and use other version control systems (CVS, SCN, Mercurial, Fossil, etc.), but you are going to need to be very comfortable using Git simply because of its prevalence in our field, and for most collaborative projects within the lab (with me, at least) or even with other labs/folks, you are going to find Git is the de facto standard. Note that “Git” is a distributed version control system, and “GitHub” is a cloud-based Git repository hosting service. They are not the one and the same, and you might find that quite a number of your Git-repositories are not hosted on GitHub.

Resources to help you learn Git:

Programming Languages

You will almost certainly be doing some sort of programming or scripting in this lab. The languages you might learn will be dependent on your particular project needs and requirements, but will certainly include at least some, and quite probably all, of the following:

Julia: For scientific, mathematical, and research computing or simulations

Julia is the language that I now recommend that all students adopt for their research projects, unless there are legacy reasons to use Python.

Why?

How?

Setting up
Learning
Extra-/Post-Curricular metacognition and practice

A great way to self-assess, practice, or otherwise have a LOT of fun, is to put your programming into practice on the:

Rosalind Challenges http://rosalind.info/problems/list-view/

Bash: Basic shell scripting

In addition to using bash to navigate and manipulate the file system, here you will also be using bash to script a lot of your work. Typically, this will be in a supporting or supplemental role rather than primary. That is, the main computing/calculations will usually be done in other languages or programs, and you will be using bash to set up jobs or carry out post-processing on the results.

Some resources to help you get started:

A high-performance programming language

Use Julia.

Your projects may not need the performance advantages of these languages sufficiently to justify the considerable overhead in time, effort, and care required to program in them over other languages such as Python.

If they do, however, then the choice of the language will be driven largely by a number of factors such as: are you starting from scratch or is there an existing code base that might be worth building on or otherwise modifying? are libraries/frameworks available in the language that make your life easier? which language appeals to you the most for whatever reason?

Data science: statistical analyses, modeling, and visualization

Use Julia.

If you have had any exposure to any scientific work in the past few years, you almost certainly have heard of R. R is a language statistical computing and visualization. You are definitely going to be using it in this lab and quite probably in your future careers. R is not always the best choice (or even a good choice) for many things that you might want to do (or, for that matter, unfortunately, many of the things that people do use R for), but for a number of other things, it is. Chances are it is going to be a supplemental analysis language for your projects rather than the primary one. Either way though, it is difficult to imagine a situation where you are going to be successful in this lab (or this field) without knowing R.

Regular Expressions: Arcane invocation of surgically precise yet powerful text manipulation

Regular expressions are a way of specifying patterns of strings. This may not sound terribly impressive, but if someone were to tell me that regular expressions are probably one of the coolest things to come out of the programming world after programming itself, I’m not sure I would necessarily object. The fact is that regular expressions are impressive, giving you an incredibly powerful, precise, flexible, and elegant way to search for and manipulate data. You will use them not just in your programming, but also when editing text and documents including your programs, which is why support for regular expressions is one of the mandatory criteria for a “good” text editor as I describe above.

Python: for interoperability and legacy code

We will use Julia for all our research computing, which supports integration Python libraries if a Julia equivalent does not exist (yet!).

We will use Julia for all our research computing, which supports integration with Python libraries if a Julia equivalent does not exist (yet!).

Python, however, remains useful—if not indispensable—in many areas of computational biology, especially where domain-specific tools, pretrained models, or data pipelines have been implemented only in Python. While you may not need to develop major software in Python during your time in this lab, you will very likely need to be able to read, modify, and interface with Python codebases, particularly in situations where:

existing methods or tools are only available in Python (e.g., for genomics, machine learning, NLP, image analysis, etc.),

collaborators or data providers supply scripts or packages written in Python,

or when using shared infrastructure such as Jupyter notebooks, pipeline management systems, or bioinformatics platforms built around the Python ecosystem.

Documentation and Authoring

Markdown: A lightweight markup language with heavyweight reach

Documentation (and good documentation at that) is a fundamental part of good programming. All projects that we work on in this lab are going to be well-documented. There are number of documentation languages out there, and it is quite possible that you are going to use more than one. The choices are often driven by the programming language or languages that a particular project is developed in.

Regardless of all this, you are almost certainly going to need to use Markdown (or one of its flavors) to document your project at some level — even if it is just in the design stage or the README at its publicatio stage. You will almost certainly use Markdown to communicate your ideas with me and other collaborators, both within the lab or outside.

Quarto: A scientific and technical publishing system from books to papers to websites.

Quarto is a Markdown-based publishing system that allows you to publish your writings in beautiful high-quality print or web formats, and, in conjunction with embedded LaTeX for mathematical expressions and the Pandoc-style citation system, provides all the support you need to express yourself scientifically, technically, and/or creatively.

Programming Best Practices

There is a lot more to methods/software development than just the programming language. Robust program design, testing (including unittests, functional test, integrative tests, etc.), version control, documentation, are all important aspects of our work here. You will learn not just how to code, but code well.

Supporting Tools and Concepts

  1. Python packaging and package creation, usage, and management.
  2. Python virtual environments, virtualenv and virtualenvwrapper: usage concepts, setup and installation, management, activating/deactivating.
  3. Makefiles and build tools. How to compile and install a pre-written C/C++ program, conceptually (i.e. the stages of building a C++ program, compiling vs. linking, library header file includes, static vs. dynamic library linking, etc.) as well as practically (e.g. “./configure --prefix=$HOME/Environment && make && make install").
  4. SSH/SCP usage. SSH keys, etc.
  5. Using a job scheduler (SGE, SLURM, etc.)
  1. Virtual machines: Vagrant

Useful

  1. Relational database design and usage (including SQL)
  2. JavaScript
  3. HTML, CSS
Copyright (C) 2018-2020 Jeet Sukumaran. All rights reserved.