Introduction to Cluster Usage

Job Arrays

A job array is a way to take advantage of many machines with the same script.
Clusters are ideal for embarassingly parallel problems, which characterize many settings in science (examples from biology):
- Applying the same analysis to all images in a screen.
- BLASTing a large set of genes against the same database
- Parsing all abstracts in Pub Med Central
- ...

For small things, just run separate processes

#/bin/bash

input=$1
grep -c mouse $input > ${input}.counts

And now run it many times, using a loop on the shell:

for f in data/*; do
    qsub ./script.sh $f
done

How do job arrays work

Write a script.
Submit it as a job array.
The script is run multiple times with a different index
Use the index to decide what to do!

Detour: environmental variables

Do you know what they are?

Environmental variables are variables that scripts can set & access.
Example: $PBS_O_HOST

PBS uses variables to communicate with your script

PBS_ARRAY_INDEX: This is the job index
...: Check documentation

Exercise: write and submit a job for this process

Input is a series of files named x00, x01, ..., x09
Task is to run the same script on each and save results to output0, output1, ... output9
In our case, the task is to count the number of occurrences of the word mouse

In particular,

please copy the directory cluster/data/by-number to your home directory
write a script which will execute for all outputs
```
grep -c mouse $input > $input.out
```
Actually, you can start with the script count.mouse.sh that is already there.

Rarely is the input organized in such a nice fashion

Here is a more realistic scenario (1)

Your input is a huge single file.
Use split to break it up.

Rarely is the input organized in such a nice fashion

Here is a more realistic scenario (2)

Your input is a list of files, but they have arbitrary names
A few helpful shell commands:
1. ls -1 > file-list.txt
2. To get the fourth line of a file sed -n "4p" file-list.txt
please copy the directory cluster-training/data/unordered to your home directory and write a script to count the number of mice in each of the files. Again, a script count.mouse.sh is present if you need to start somewhere.

Fail well

Common Unix strategy:

Write your output to file.tmp, preferably on the same directory
Call sync (!)
Move to the final location

Unix guarantees that the move is atomic.

Rewrite the mouse count script to use the temp-move strategy

Remember to allocate resources

CPUs (same machine or different machines)
Memory
GPU (graphical processing units)
Time
Disk
Software licenses
Network usage

How can you check how much memory your process uses?

Guess-timate
Measure (look at top)

Job dependencies

You can schedule a job after another job has finished.
Common setting:
1. Extract some information from a large set of inputs (parallel)
2. Summarise this information (textual/plot/&c)
In our case, we summarize the mouse counts.

Shameless plug for jug

If you use Python, you may want to look at my package jug which can make running jobs on clusters easier

(Only makes sense if you're using Python.)

Introduction to Cluster Usage

What is a Cluster?

Queuing systems

Euclid uses PBS (GE or SGE are very similar)

First Step: Let's all SSH to the head node

Using an interactive session

Running our first job on the queue

Checking up on your jobs

Do not compute on the head node

Test your jobs before submitting!

Advanced Cluster Usage

Job Arrays

For small things, just run separate processes

How do job arrays work

Detour: environmental variables

PBS uses variables to communicate with your script

Exercise: write and submit a job for this process

Rarely is the input organized in such a nice fashion

Rarely is the input organized in such a nice fashion

Fail well

Remember to allocate resources

How can you check how much memory your process uses?

Job dependencies

Shameless plug for jug

If you get stuck

Thank You