Large Scale Analysis of Bioimages Using Python

Luis Pedro Coelho
On twitter: @luispedrocoelho

Python has a good ecosystem for data analysis

  • NumPy
  • Matplotlib
  • IPython
  • Scikit-learn
  • Mahotas

Python has a growing ecosystem of scientific packages around numpy

Numpy provides basic data types (arrays, matrices).
Packages provide intelligence.

The wider ecosystem

The wider ecosystem

Multiple packages act together

Mahotas can rely on pre-existing functionality

  1. An image type (numpy array).
  2. Types to hold computed data (numpy array again).
  3. Plotting & displaying (matplotlib).
  4. Machine learning (sklearn or milk).

Modularity is good software engineering

  • Improvements to one package benefit all.
  • Separation of concerns.

Consistency also helps human users

  • Single type for many uses.
  • Many simple operations can be done in numpy.
  • Same basic conventions.
  • No copying/conversion of data between packages.

Mahotas: computer vision in Python

  • Works with NumPy arrays
  • Basic blocks necessary for image processing
  • Sister package: mahotas-imread for parsing image files
  • Modern open-source: github, automated tests, up-to-date documentation...


  1. Load an image
  2. Basic smoothing & thresholding


import mahotas as mh
from matplotlib import pyplot as plt

im = mh.imread('image_stretched.jpeg')
sigma = 2.3
imf = mh.gaussian_filter(im.mean(2), sigma)
binary = (imf > imf.mean())
labeled, _ = mh.label(binary)


Implementation is in C++

  • Fast C++ code with a Python interface
  • C++ templates allow for specialization
  • Hand-written interface code

template<typename T>
void bbox(const numpy::aligned_array<T> array,
                    numpy::index_type* extrema) {
    gil_release nogil;
    const int N = array.size();
    typename numpy::aligned_array<T>::const_iterator pos = array.begin();
    for (int i = 0; i != N; ++i, ++pos) {
        if (*pos) {
            numpy::position where = pos.position();
            for (int j = 0; j != array.ndims(); ++j) {
                extrema[2*j] = std::min<numpy::index_type>(
                                        extrema[2*j], where[j]);
                extrema[2*j+1] = std::max<numpy::index_type>(
                                        extrema[2*j+1], where[j]+1);
            } } } }

Result is fast, type-safe & flexible

  • Fast as it is compiled without runtime type checks
  • Type-safe due to template wrappers
  • Flexible as the interface is in Python
  • Compilation in debug mode inserts many run-time checks

Interfaces are hand-written

  • Automated generators exist (swig)
  • They work very well, but give poor error messages
  • I find good error messages important

Scikit-image is another good alternative

  • Uses Cython instead of C++
  • More heavily focused on natural images
    (Mahotas is heavily influenced by scientific images)

Jug for large scale analysis

  • Reproducibility
  • Parallelism
  • Memoization

Jug use cases

  • Parameter sweeps
  • Preprocessing of large data
  • Embarassingly parallel problems
  • Coarse grained parallelism

Jug Tasks

  • A Task is a Python function & its arguments
  • Arguments can be either values or results of other Tasks
  • Thus, Tasks implicitly define a dependency graph
  • A Task is identified by its hash
  • Hash of function name + arguments (recursively)

Design Decisions

  • Code is not taken into account for hash.
  • This means that changing the code will not trigger recomputation
  • Explicit invalidate operation triggers recomputation of all dependent tasks

Jug execution loop

for task in alltasks:
    if backend.has_data(task):
        print("Task is already done")
    elif backend.has_data(task.dependencies()):
        if backend.lock(task):
            r = task.execute()
            backend.write(task.hash(), r)

Processes communicate through backend

  • All communication between processes is through the backend
  • User must start processes manually.
  • Backend handles all locking issues
    (jobs are separate processes).
  • Jobs may start at different times.
  • Backend is very simple: key/value store + locking
  • Two backends available: filesystem & Redis

Example (demo)

  • Some images that you want to segment.
  • Compare to a gold standard (hand-segmentation).

import mahotas as mh
def method1(image, sigma):
    image = mh.imread(image)[:,:,0]
    image  = mh.gaussian_filter(image, sigma)
    binimage = (image > image.mean())
    labeled, _ = mh.label(binimage)
    return labeled

Now, we do a live demo...

import mahotas as mh
from jug import TaskGenerator
from glob import glob
def method1(image, sigma):
    return labeled

def print_results(...):

inputs = glob('images/*.jpg')
results = []
for im in inputs:
    m1 = method1(im, 2)
    m2 = method2(im, 4)
    ref = im.replace('images','references').replace('jpg','png')
    v1 = compare(m1, ref)
    v2 = compare(m2, ref)
    results.append( (v1,v2) )


Jugfile is like Python except for TaskGenerator

Decorator magic, but without decorators it is still simple:

from jug import TaskGenerator
def method1(image, sigma):
    return labeled

m1 = method1(im, 2)

is equivalent to

from jug import Task
def method1(image, sigma):
    return labeled

m1 = Task(method1, im, 2)

Some Details

  • Anything that can be serialized by Python (pickle) will work fine.
  • Special support for NumPy arrays (speed/disk usage)
  • Works with filesystem (including working well on NFS).
  • Alternatively, use Redis

To run on a cluster, use the cluster interface



jug execute

Then use your cluster interface:


Some more operations

  • jug cleanup: run garbage collection on the backend
  • jug sleep-until: wait until all tasks are finished

Example Application

  • Quantification of Neutrophil Extracellular Traps (NETs)
  • Neutrophils physically ensnare bacteria by exploding and using their DNA fibers to build a net

(This is work currently under review.)

Can we use a supervised approach?

Initial remarks

  • We do not need perfect segmentation
    (similar to Learning to Count framework)
  • We want to try several methods

Different submodules lead to different methods

  • How to break up image?
  • Which features to compute?
  • Which regression module to use?
    (we always used random forests).

Different ways to break up image

  1. Oversegmentation
  2. Regular grid
  3. Interest point detection

Different ways to compute features

  • Texture features
  • Image filterings
  • SURF features (different scales)

Initial (raw) results

Can we combine the methods?

Above diagonal, we show a scatter plot of residuals.
Below diagonal, we show Pearson correlation of residuals.

Linear regression for combination

How good is 93 percent?

Well, how good are humans (the baseline method)?

Comparison example

Label it twice

  • If your method uses human labeled data,
    label it twice
  • Many cases, humans are not that great
  • In our case, operator bias is major issue
    but bias seems consistent

Cross-validate by experiment, not by image!

  • Each experiment resulted in several images from the same slide.
  • Using images from the same experiment in training/testing
    results in a 3-6% Q² boost!
  • Other work in related problems shows analogous effects.

Whole computation is managed with jug

  • Prototype to publication to reproduction without breaks
  • Easy to reproduce
  • Smallish problem (100 CPU hours)
  • We will make the code available when paper is accepted

Live demo time again...


  • Python is great for data analysis
  • Partially because it makes it easy to use non-Python
  • Partially because of the ecosystem of packages
  • Jug makes it easier to manage computations
  • IPython is great for communication
  • When using human-labeled data, label it twice
  • Beware the cross-validation schedule


  • The people who contributed patches, reports to my projects
  • Catarina Pato
  • Ana Friães
  • Ariane Neumann
  • Mário Ramirez
  • Marien von Köckritz-Blickwede
  • João Carriço

Thank You