Parallel Mean & Variance Calculation¶

class parallel_statistics.ParallelMeanVariance(size, sparse=False)[source]¶

ParallelMeanVariance is a parallel and incremental calculator for mean and variance statistics. “Incremental” means that it does not need to read the entire data set at once, and requires only a single pass through the data.

The calculator is designed to work on data in a collection of different bins, for example a map (where the bins are pixels).

The usual life-cycle of this class is:

create an instance of the class (on each process if in parallel)
repeatedly call add_data or add_datum on it to add new data points
call collect, (supplying in MPI communicator if in parallel)

You can also call the run method with an iterator to combine these.

If only a few indices in the data are expected to be used, the sparse option can be set to change how data is represented and returned to a sparse form which will use less memory and be faster below a certain size.

Bins which have no objects in will be given weight=0, mean=nan, and var=nan.

The algorithm here is basd on Schubert & Gertz 2018, Numerically Stable Parallel Computation of (Co-)Variance

By default the module looks for the package “Numba” and uses its just-in-time compilation to speed up this class. To disable this, export the environment variable PAR_STATS_NO_JIT=1

Attributes

size: int: number of pixels or bins
sparse: bool: whether are using sparse representations of arrays

Methods

`add_data`(bin, values[, weights])	Add a chunk of data in the same bin.
`add_datum`(bin, value[, weight])	Add a single data point to the sum.
`collect`([comm, mode])	Finalize the statistics calculation, collecting togther results from multiple processes.
`run`(iterator[, comm, mode])	Run the whole life cycle on an iterator returning data chunks.

add_data(bin, values, weights=None)[source]¶

Add a chunk of data in the same bin.

Add a set of values assinged to a given bin or pixel. Weights may be supplied, and if they are not will be set to 1.

Parameters

bin: int: The bin or pixel for these values
values: sequence: A sequence (e.g. array or list) of values assigned to this bin
weights: sequence, optional: A sequence (e.g. array or list) of weights per value

add_datum(bin, value, weight=1)[source]¶

Add a single data point to the sum.

Parameters

bin: int: Index of bin or pixel these value apply to
value: float: Value for this bin to accumulate
weight: float: Optional, default=1, a weight for this data point

collect(comm=None, mode='gather')[source]¶

Finalize the statistics calculation, collecting togther results from multiple processes.

If mode is set to “allgather” then every calling process will return the same data. Otherwise the non-root processes will return None for all the values.

You can only call this once, when you’ve finished calling add_data. After that internal data is deleted.

Parameters

comm: MPI Communicator, optional
mode: string, optional: ‘gather’ (default), or ‘allgather’

Returns

weight: array or SparseArray: The total weight or count in each bin
mean: array or SparseArray: An array of the computed mean for each bin
variance: array or SparseArray: An array of the computed variance for each bin

run(iterator, comm=None, mode='gather')[source]¶

Run the whole life cycle on an iterator returning data chunks.

This is equivalent to calling add_data repeatedly and then collect.

Parameters

iterator: iterator: Iterator yieding (bin, values) or (bin, values, weights)
comm: MPI comm, optional: The comm, or None for serial
mode: str, optional: “gather” or “allgather”

Returns

weight: array or SparseArray: The total weight or count in each bin
mean: array or SparseArray: An array of the computed mean for each bin
variance: array or SparseArray: An array of the computed variance for each bin