Parallel Mean & Variance Calculation

class parallel_statistics.ParallelMeanVariance(size, sparse=False)[source]

ParallelMeanVariance is a parallel and incremental calculator for mean and variance statistics. “Incremental” means that it does not need to read the entire data set at once, and requires only a single pass through the data.

The calculator is designed to work on data in a collection of different bins, for example a map (where the bins are pixels).

The usual life-cycle of this class is:

  • create an instance of the class (on each process if in parallel)

  • repeatedly call add_data or add_datum on it to add new data points

  • call collect, (supplying in MPI communicator if in parallel)

You can also call the run method with an iterator to combine these.

If only a few indices in the data are expected to be used, the sparse option can be set to change how data is represented and returned to a sparse form which will use less memory and be faster below a certain size.

Bins which have no objects in will be given weight=0, mean=nan, and var=nan.

The algorithm here is basd on Schubert & Gertz 2018, Numerically Stable Parallel Computation of (Co-)Variance

By default the module looks for the package “Numba” and uses its just-in-time compilation to speed up this class. To disable this, export the environment variable PAR_STATS_NO_JIT=1

Attributes
size: int

number of pixels or bins

sparse: bool

whether are using sparse representations of arrays

Methods

add_data(bin, values[, weights])

Add a chunk of data in the same bin.

add_datum(bin, value[, weight])

Add a single data point to the sum.

collect([comm, mode])

Finalize the statistics calculation, collecting togther results from multiple processes.

run(iterator[, comm, mode])

Run the whole life cycle on an iterator returning data chunks.

add_data(bin, values, weights=None)[source]

Add a chunk of data in the same bin.

Add a set of values assinged to a given bin or pixel. Weights may be supplied, and if they are not will be set to 1.

Parameters
bin: int

The bin or pixel for these values

values: sequence

A sequence (e.g. array or list) of values assigned to this bin

weights: sequence, optional

A sequence (e.g. array or list) of weights per value

add_datum(bin, value, weight=1)[source]

Add a single data point to the sum.

Parameters
bin: int

Index of bin or pixel these value apply to

value: float

Value for this bin to accumulate

weight: float

Optional, default=1, a weight for this data point

collect(comm=None, mode='gather')[source]

Finalize the statistics calculation, collecting togther results from multiple processes.

If mode is set to “allgather” then every calling process will return the same data. Otherwise the non-root processes will return None for all the values.

You can only call this once, when you’ve finished calling add_data. After that internal data is deleted.

Parameters
comm: MPI Communicator, optional
mode: string, optional

‘gather’ (default), or ‘allgather’

Returns
weight: array or SparseArray

The total weight or count in each bin

mean: array or SparseArray

An array of the computed mean for each bin

variance: array or SparseArray

An array of the computed variance for each bin

run(iterator, comm=None, mode='gather')[source]

Run the whole life cycle on an iterator returning data chunks.

This is equivalent to calling add_data repeatedly and then collect.

Parameters
iterator: iterator

Iterator yieding (bin, values) or (bin, values, weights)

comm: MPI comm, optional

The comm, or None for serial

mode: str, optional

“gather” or “allgather”

Returns
weight: array or SparseArray

The total weight or count in each bin

mean: array or SparseArray

An array of the computed mean for each bin

variance: array or SparseArray

An array of the computed variance for each bin