These skills are absolutely essential for day-to-day work in this lab. These skills are NOT a requirement to join this lab, and neither are they the primary or the major things you will learn in this lab. Rather, they are important pre-requisites for successful work on projects in your lab, and if you do not know them already, learning them will be one of your first tasks in this lab.

It is my job and yours to ensure that you know these skills well, and we will work together to get you from wherever you are right now to wherever you need to be so that you can succeed in your projects and ultimately your degree. What matters is that you are willing to learn these skills, and learn them well, if you want to succeed in this lab (or field, for that matter).

It is important for to you recognize that these skills/tools are what you will eventually be using on a day-to-day and hour-to-hour basis here, and thus in many ways captures the spirit of work in this lab (and this field). They are not just tedious/painful/necessary evils, or something you are going to pick up quickly to put on your CV and then put behind you, but rather will be part and parcel of your daily life for your entire tenure in this lab. These skills are to a computational biologist the equivalent of sequencing and sequence analysis skills to a molecular biologist, knowledge of anatomy to a surgeon, basic fighter manouevers to a fighter pilot, etc. So, if this lab (or, indeed, this field) are where you want to invest your precious time and energy, then you are going to need to not just learn these skills, but actually enjoy using them (at least some of them some of the time), or otherwise, frankly, you are not going to be happy here, much less succeed.

Note that these skills are just the beginning of what you will learn in this lab. You are going to learn a whole lot more, but the particularities of those are going to depend on your interests and projects, whereas the skills described are going to be important no matter what you do, because you will be definitely using them at least partially in anything and everything that you do.

Foundational

Working in a POSIX-like operating system shell: Bash

What does it mean to “work in the bash shell”? Here, it means being able to the use the shell to navigate around the filesystem (which, in turn, presupposes a conceptual understanding of the filesystem organization), manipulate (create, move/rename, copy, delete) files and directories in the filesystem, as well as carry out other general “house-keeping” tasks. In particular, it means that you know and understand the following concepts or operations in bash:

  • Current working directory
  • Relative vs. absolute paths
  • File/directory ownership and permissions
  • Command invocation and arguments, including running privileged vs. non-privileged commands
  • Basic filesystem navigation and manipulation:
    • How to see the files in a particular directory (current or otherwise), including getting information on file sizes, modification dates, types, etc.
    • How to create, copy, move/rename, open, edit, and delete files.
    • Wildcard usage
    • How to create and expand compressed archives of files/directories.
  • Variables, setting, printing, and variable expansion in commands.
  • Important environmental variables such: $PATH, $PWD, etc.

Resources to help you to learn this:

You will spend an inordinate — perhaps the majority — of time working in the text editor, and you need to be part of it as much as it is part of you. As such, this is a very personal choice. At the same time, though, there are some important criteria that your choice needs to fulfill for it to be a workable platform on which you can shine. Critical features for such a text editor include:

  • read/write plain text format
  • handle and/or convert between different line-endings
  • regular expression search/search-and-replace
  • display of and jumping to particular line numbers
  • show hidden characters
  • syntax highlighting

Examples of text editors that fulfill the above criteria are:

You may decide that a dedicated Integrated Development Environment (IDE) such as PyCharm or RStudio may suit your needs better than a plain text editor. You may be right some of the time for some of your projects (assuming that you are already familiar with the nuts-and-bolts behind-the-scene details that the IDE’s hide from you), but this will not be the case all of the time for all of your projects. Being absolutely comfortable with a powerful text editor is a foundational skill that is not only indispensable but mandatory in this lab. So, whether or not you eventually use a fancy IDE for your work, you are going to need to (and want to) learn to use a plain text editor really well.

Also, I should not need to say this, but just in case there is any confusion: Microsoft Word, NotePad, TextPad, Gedit, Nano, etc. all do not count as an appropriate text editors.

A Version Control System: Git.

You are free to learn and use other version control systems (CVS, SCN, Mercurial, Fossil, etc.), but you are going to need to be very comfortable using Git simply because of its prevalence in our field, and for most collaborative projects within the lab (with me, at least) or even with other labs/folks, you are going to find Git is the de facto standard. Note that “Git” is a distributed version control system, and “GitHub” is a cloud-based Git repository hosting service. They are not the one and the same, and you might find that quite a number of your Git-repositories are not hosted on GitHub.

Resources to help you learn Git:

Programming Languages

You will almost certainly be doing some sort of programming or scripting in this lab. The languages you might learn will be dependent on your particular project needs and requirements, but will certainly include at least some, and quite probably all, of the following:

Basic Shell Scripting: Bash.

In addition to using bash to navigate and manipulate the file system, here you will also be using bash to script a lot of your work. Typically, this will be in a supporting or supplemental role rather than primary. That is, the main computing/calculations will usually be done in other languages or programs, and you will be using bash to set up jobs or carry out post-processing on the results.

Some resources to help you get started:

A Modern, Robust, Object-Oriented, And Scaleable Scripting Language: Python.

Resources to help you learn Python:

A High-Performance Programming Language: C++, Julia, etc.

Your projects may not need the performance advantages of these languages sufficiently to justify the considerable overhead in time, effort, and care required to program in them over other languages such as Python. If they do, however, then the choice of the language will be driven largely by a number of factors such as: are you starting from scratch or is there an existing code base that might be worth building on or otherwise modifying? are libraries/frameworks available in the language that make your life easier? which language appeals to you the most for whatever reason?

Regular Expressions

Regular expressions are a way of specifying patterns of strings. This may not sound terribly impressive, but if someone were to tell me that regular expressions are probably one of the coolest things to come out of the programming world after programming itself, I’m not sure I would necessarily object. The fact is that regular expressions are impressive, giving you an incredibly powerful, precise, flexible, and elegant way to search for and manipulate data. You will use them not just in your programming, but also when editing text and documents including your programs, which is why support for regular expressions is one of the mandatory criteria for a “good” text editor as I describe above.

A Statistical Analysis System: R

If you have had any exposure to any scientific work in the past few years, you almost certainly have heard of R. R is a language statistical computing and visualization. You are definitely going to be using it in this lab and quite probably in your future careers. R is not always the best choice (or even a good choice) for many things that you might want to do (or, for that matter, unfortunately, many of the things that people do use R for), but for a number of other things, it is. Chances are it is going to be a supplemental analysis language for your projects rather than the primary one. Either way though, it is difficult to imagine a situation where you are going to be successful in this lab (or this field) without knowing R.

Documentation Languages

A Lightweight Documentation Markup Language: Markdown, reStructuredText

Documentation (and good documentation at that) is a fundamental part of good programming. All projects that we work on in this lab are going to be well-documented. There are number of documentation languages out there, and it is quite possible that you are going to use more than one. The choices are often driven by the programming language or languages that a particular project is developed in.

Regardless of all this, you are almost certainly going to need to use Markdown to document your project at some level — even if it is just in the design stage or the README at its publicatio stage. You will almost certainly use Markdown to communicate your ideas with me and other collaborators, both within the lab or outside.

reStructuredText is the de facto standard documentation language for Python, in particular when using the de facto standard documentation engine, Sphinx. You might be able to get away with Markdown for the documentation of your project, Python or not, though Sphinx does provide a lot of features that may pay for itself in terms of your time if you are trying to document something complex (such as, for e.g., a programming library) that requires multiple pages with some more sophisticated features (such as, for e.g., cross-referencing concepts/terms, a glossary, etc.)

A Full-Fledged Documentation Preparation System Language: LaTeX

If we are going to collaborate to write anything together, whether it is designing or documenting a program, a paper, or, anything else, I am going to use LaTeX to do it. There are many reasons for this, but the most important are: (1) historical inertia; (2) it works well; (3) other document preparation languages do not work as well; (4) it lends itself well to version control and collaboration; and (5) I cannot stand using Microsoft Word.

There are a number of really nice tutorials and introductions to LaTeX available to help you get started, including:

Best of all, Overleaf (which hosts the first of the tutorials listed above) provides a platform not just to learn and experiment with LaTeX, but even, if you wish, to be used as your primary cloud-based collaborative version-controlled document writing platform.

Programming Best Practices

There is a lot more to methods/software development than just the programming language. Robust program design, testing (including unittests, functional test, integrative tests, etc.), version control, documentation, are all important aspects of our work here. You will learn not just how to code, but code well.

Supporting Tools and Concepts

  1. Python packaging and package creation, usage, and management.
  2. Python virtual environments, virtualenv and virtualenvwrapper: usage concepts, setup and installation, management, activating/deactivating.
  3. Makefiles and build tools. How to compile and install a pre-written C/C++ program, conceptually (i.e. the stages of building a C++ program, compiling vs. linking, library header file includes, static vs. dynamic library linking, etc.) as well as practically (e.g. “./configure --prefix=$HOME/Environment && make && make install”).
  4. SSH/SCP usage. SSH keys, etc.
  5. Using a job scheduler (SGE, SLURM, etc.)
  1. Virtual machines: Vagrant

Useful

  1. Relational database design and usage (including SQL)
  2. JavaScript
  3. HTML, CSS
Copyright (C) 2018 Jeet Sukumaran. All rights reserved.