Advanced Cluster Usage
- Job arrays
- Allocating resources
- Job dependencies
Job Arrays
- A job array is a way to take advantage of many machines with the same script.
- Clusters are ideal for embarassingly parallel
problems, which characterize many settings in
science (examples from biology):
- Applying the same analysis to all images in a screen.
- BLASTing a large set of genes against the same database
- Parsing all abstracts in Pub Med Central
- ...
For small things, just run separate processes
#!/bin/bash
input=$1
grep -c mouse $input > ${input}.counts
And now run it many times, using a loop on the shell:
for f in data/*; do
qsub ./script.sh $f
done
How do job arrays work
- Write a script.
- Submit it as a job array.
- The script is run multiple times with a different index
- Use the index to decide what to do!
Detour: environmental variables
Do you know what they are?
- Environmental variables are variables that scripts can set & access.
- Example: $SGE_O_HOST
SGE uses variables to communicate with your script
- SGE_TASK_ID
- This is the job index
- ...
- Check documentation
Exercise: write and submit a job for this process
- Input is a series of files named x00, x01, ..., x09
- Task is to run the same script on each and save results to output0, output1, ... output9
- In our case, the task is to count the number of occurrences of the word mouse
In particular,
- please copy the directory cluster/data/by-number to your home directory
- write a script which will execute for all outputs
grep -c mouse $input > $input.out
- Actually, you can start with the script count.mouse.sh that is already there.
Rarely is the input organized in such a nice fashion
Here is a more realistic scenario (1)
- Your input is a huge single file.
- Use split to break it up.
Rarely is the input organized in such a nice fashion
Here is a more realistic scenario (2)
- Your input is a list of files, but they have arbitrary names
- A few helpful shell commands:
- ls -1 > file-list.txt
- To get the fourth line of a file sed -n "4p" file-list.txt
- please copy the directory
cluster-training/data/unordered to
your home directory and write a script to count the
number of mice in each of the files. Again, a script
count.mouse.sh is present if you need to start
somewhere.
Fail well
Common Unix strategy:
- Write your output to file.tmp, preferably on the same directory
- Call sync (!)
- Move to the final location
Unix guarantees that the move is atomic.
Rewrite the mouse count script to use the temp-move strategy
Remember to allocate resources
- CPUs (same machine or different machines)
- Memory
- GPU (graphical processing units)
- Time
- Disk
- Software licenses
- Network usage
How can you check how much memory your process uses?
- Guess-timate
- Measure (look at top)
Job dependencies
- You can schedule a job after another job has finished.
- Common setting:
- Extract some information from a large set of inputs (parallel)
- Summarise this information (textual/plot/&c)
- In our case, we summarize the mouse counts.
Shameless plug for jug
If you use Python, you may want to look at my
package jug
which can make running jobs on clusters easier
(Only makes sense if you're using Python.)