These skills are absolutely essential for day-to-day work in this lab. These skills are NOT a requirement to join this lab, and neither are they the primary or the major things you will learn in this lab. Rather, they are important pre-requisites for successful work on projects in your lab, and if you do not know them already, learning them will be one of your first tasks in this lab.
It is my job and yours to ensure that you know these skills well, and we will work together to get you from wherever you are right now to wherever you need to be so that you can succeed in your projects and ultimately your degree. What matters is that you are willing to learn these skills, and learn them well, if you want to succeed in this lab (or field, for that matter).
It is important for to you recognize that these skills/tools are what you will eventually be using on a day-to-day and hour-to-hour basis here, and thus in many ways captures the spirit of work in this lab (and this field). They are not just tedious/painful/necessary evils, or something you are going to pick up quickly to put on your CV and then put behind you, but rather will be part and parcel of your daily life for your entire tenure in this lab. These skills are to a computational biologist the equivalent of sequencing and sequence analysis skills to a molecular biologist, knowledge of anatomy to a surgeon, basic fighter manouevers to a fighter pilot, etc. So, if this lab (or, indeed, this field) are where you want to invest your precious time and energy, then you are going to need to not just learn these skills, but actually enjoy using them (at least some of them some of the time), or otherwise, frankly, you are not going to be happy here, much less succeed.
Note that these skills are just the beginning of what you will learn in this lab. You are going to learn a whole lot more, but the particularities of those are going to depend on your interests and projects, whereas the skills described are going to be important no matter what you do, because you will be definitely using them at least partially in anything and everything that you do.
What does it mean to “work in the bash shell”? Here, it means being able to the use the shell to navigate around the filesystem (which, in turn, presupposes a conceptual understanding of the filesystem organization), manipulate (create, move/rename, copy, delete) files and directories in the filesystem, as well as carry out other general “house-keeping” tasks. In particular, it means that you know and understand the following concepts or operations in bash:
$PATH
, $PWD
, etc.Resources to help you to learn this:
You will spend an inordinate — perhaps the majority — of time working in the text editor, and you need to be part of it as much as it is part of you. As such, this is a very personal choice. At the same time, though, there are some important criteria that your choice needs to fulfill for it to be a workable platform on which you can shine. Critical features for such a text editor include:
Examples of text editors that fulfill the above criteria are:
Also, I should not need to say this, but just in case there is any confusion: Microsoft Word, NotePad, TextPad, Gedit, Nano, etc. all do not count as an appropriate text editors.
You are free to learn and use other version control systems (CVS, SCN, Mercurial, Fossil, etc.), but you are going to need to be very comfortable using Git simply because of its prevalence in our field, and for most collaborative projects within the lab (with me, at least) or even with other labs/folks, you are going to find Git is the de facto standard. Note that “Git” is a distributed version control system, and “GitHub” is a cloud-based Git repository hosting service. They are not the one and the same, and you might find that quite a number of your Git-repositories are not hosted on GitHub.
Resources to help you learn Git:
You will almost certainly be doing some sort of programming or scripting in this lab. The languages you might learn will be dependent on your particular project needs and requirements, but will certainly include at least some, and quite probably all, of the following:
Julia is the language that I now recommend that all students adopt for their research projects, unless there are legacy reasons to use Python.
A great way to self-assess, practice, or otherwise have a LOT of fun, is to put your programming into practice on the:
In addition to using bash to navigate and manipulate the file system, here you will also be using bash to script a lot of your work. Typically, this will be in a supporting or supplemental role rather than primary. That is, the main computing/calculations will usually be done in other languages or programs, and you will be using bash to set up jobs or carry out post-processing on the results.
Some resources to help you get started:
Use Julia.
Your projects may not need the performance advantages of these languages sufficiently to justify the considerable overhead in time, effort, and care required to program in them over other languages such as Python.
If they do, however, then the choice of the language will be driven largely by a number of factors such as: are you starting from scratch or is there an existing code base that might be worth building on or otherwise modifying? are libraries/frameworks available in the language that make your life easier? which language appeals to you the most for whatever reason?
Use Julia.
If you have had any exposure to any scientific work in the past few years, you almost certainly have heard of R.
R is a language statistical computing and visualization.
You are definitely going to be using it in this lab and quite probably in your future careers.
R is not always the best choice (or even a good choice) for many things that you might want to do (or, for that matter, unfortunately, many of the things that people do use R for), but for a number of other things, it is.
Chances are it is going to be a supplemental analysis language for your projects rather than the primary one.
Either way though, it is difficult to imagine a situation where you are going to be successful in this lab (or this field) without knowing R.
Regular expressions are a way of specifying patterns of strings. This may not sound terribly impressive, but if someone were to tell me that regular expressions are probably one of the coolest things to come out of the programming world after programming itself, I’m not sure I would necessarily object. The fact is that regular expressions are impressive, giving you an incredibly powerful, precise, flexible, and elegant way to search for and manipulate data. You will use them not just in your programming, but also when editing text and documents including your programs, which is why support for regular expressions is one of the mandatory criteria for a “good” text editor as I describe above.
We will use Julia for all our research computing, which supports integration Python libraries if a Julia equivalent does not exist (yet!).
We will use Julia for all our research computing, which supports integration with Python libraries if a Julia equivalent does not exist (yet!).
Python, however, remains useful—if not indispensable—in many areas of computational biology, especially where domain-specific tools, pretrained models, or data pipelines have been implemented only in Python. While you may not need to develop major software in Python during your time in this lab, you will very likely need to be able to read, modify, and interface with Python codebases, particularly in situations where:
existing methods or tools are only available in Python (e.g., for genomics, machine learning, NLP, image analysis, etc.),
collaborators or data providers supply scripts or packages written in Python,
or when using shared infrastructure such as Jupyter notebooks, pipeline management systems, or bioinformatics platforms built around the Python ecosystem.
Documentation (and good documentation at that) is a fundamental part of good programming. All projects that we work on in this lab are going to be well-documented. There are number of documentation languages out there, and it is quite possible that you are going to use more than one. The choices are often driven by the programming language or languages that a particular project is developed in.
Regardless of all this, you are almost certainly going to need to use Markdown (or one of its flavors) to document your project at some level — even if it is just in the design stage or the README at its publicatio stage. You will almost certainly use Markdown to communicate your ideas with me and other collaborators, both within the lab or outside.
Quarto is a Markdown-based publishing system that allows you to publish your writings in beautiful high-quality print or web formats, and, in conjunction with embedded LaTeX for mathematical expressions and the Pandoc-style citation system, provides all the support you need to express yourself scientifically, technically, and/or creatively.
There is a lot more to methods/software development than just the programming language. Robust program design, testing (including unittests, functional test, integrative tests, etc.), version control, documentation, are all important aspects of our work here. You will learn not just how to code, but code well.
./configure --prefix=$HOME/Environment && make && make install
").