Programming for Scientists (Autumn 2012)

Dates & Times

Session I: Sep 14 (15h00-17h00). All day (9h00-18h00) Sep 17 & 24

Session II: Oct 11 (15h00-17h00). All day (9h00-18h00) Oct 15 & 22

There will be a break from 12h30 to 14h00 in the all day sessions (if you want to attend the Monday seminar).

Tentative Schedule

This is subject to changes (up until class time)

I added slide links below, but the presentations rely heavily on live coding demonstration and dialogue with the students. Therefore, if you did not attend the class, this will not be very helpful.

Intro Session

  • Introduction to course. Instructor.
  • Syllabus.
  • Discussion of necessary software.
  • Introduction to programming (Python I).

[slides 01]

[slides 02]

Day I

Python II [slides 03]
Numpy & matplotlib [slides 04]
Guided exercises [slides 05 ]

Break (you can go to the Monday Lecture at 12h30)

Python III [slides 06]
Numeric issues [slides 07]
File formats & parsing. FastQ example [slides 08]

Day II

HW Review/In-class Quiz [slides 09 and possible solution]
Open Source Software [slides 10]
Unit tests [slides 11]

Break (you can go to the Monday Lecture at 12h30)

Python IV [slides 12]
Guided exercises [slides 13]
Review [slides 14] Alternative (image processing) Ending [slides 14]: images used: DNA and protein


The homework is optional (there are no grades in this class). However, if you do turn it in, you will get feedback on it (i.e., if you want to try it as a learning experience, I will give you feedback). Also, at the start of day II, I will go over my solution.

I will discuss the assignment in class, but here it is for reference:

  1. Download the FastQ data (or compressed).
  2. Write a Python script that reads in the file and plots the average quality and standard deviation per base position (like FastQC does, if you are familiar with that tool).
  3. Write a second Python script (or extend the one above) that trims & filters the sequence (see below).
  4. Write a third Python script (or extend the one above) that plots a histogram of sequence sizes after trimming (before trimming all sequences have the same size).

Use the following trim & filter criteria: given a minimum base quality (you can use the value 24 as a baseline), look for the longest substring such that all bases in that substring are (strictly) above the minimum. If the resulting string is too small (you can use the value 30 as the minimum number of basepairs), discard the sequence.

Here are some possible variations:

  • Can you write the script so that the inputs, outputs, and thresholds are given on the command line?
  • Can you do it all in one pass through the data (and in a single script file)? Why would this be better/worse?
  • Semi-Advanced: can you do this using HTSeq (by the way,HTSeq is a very good package for NGS processing)?


Do I need to bring computer?

The course will take place in the computer labs and I have asked that the necessary software be installed there. However, you will be asked to install it on your own computer and, if you want to bring a laptop to class, that is a good idea too.

I will discuss this in the first session, but if you want to go ahead, please install the following (for Windows):

For Mac OS, please install EPD:

(It is free of cost for academics).

For Linux, just use your own distributions' package manager. You will need the following packages:

  • python
  • python-numpy
  • python-matplotlib

Why are you teaching Python?

Python is a modern language which is increasingly used for scientific programming. It is open-source and has a thriving open-source community of scientific software.