Parallel Mean & Variance Calculation¶
- class parallel_statistics.ParallelMeanVariance(size, sparse=False)[source]¶
ParallelMeanVariance
is a parallel and incremental calculator for mean and variance statistics. “Incremental” means that it does not need to read the entire data set at once, and requires only a single pass through the data.The calculator is designed to work on data in a collection of different bins, for example a map (where the bins are pixels).
The usual life-cycle of this class is:
create an instance of the class (on each process if in parallel)
repeatedly call
add_data
oradd_datum
on it to add new data pointscall
collect
, (supplying in MPI communicator if in parallel)
You can also call the
run
method with an iterator to combine these.If only a few indices in the data are expected to be used, the sparse option can be set to change how data is represented and returned to a sparse form which will use less memory and be faster below a certain size.
Bins which have no objects in will be given weight=0, mean=nan, and var=nan.
The algorithm here is basd on Schubert & Gertz 2018, Numerically Stable Parallel Computation of (Co-)Variance
By default the module looks for the package “Numba” and uses its just-in-time compilation to speed up this class. To disable this, export the environment variable PAR_STATS_NO_JIT=1
- Attributes
- size: int
number of pixels or bins
- sparse: bool
whether are using sparse representations of arrays
Methods
add_data
(bin, values[, weights])Add a chunk of data in the same bin.
add_datum
(bin, value[, weight])Add a single data point to the sum.
collect
([comm, mode])Finalize the statistics calculation, collecting togther results from multiple processes.
run
(iterator[, comm, mode])Run the whole life cycle on an iterator returning data chunks.
- add_data(bin, values, weights=None)[source]¶
Add a chunk of data in the same bin.
Add a set of values assinged to a given bin or pixel. Weights may be supplied, and if they are not will be set to 1.
- Parameters
- bin: int
The bin or pixel for these values
- values: sequence
A sequence (e.g. array or list) of values assigned to this bin
- weights: sequence, optional
A sequence (e.g. array or list) of weights per value
- add_datum(bin, value, weight=1)[source]¶
Add a single data point to the sum.
- Parameters
- bin: int
Index of bin or pixel these value apply to
- value: float
Value for this bin to accumulate
- weight: float
Optional, default=1, a weight for this data point
- collect(comm=None, mode='gather')[source]¶
Finalize the statistics calculation, collecting togther results from multiple processes.
If mode is set to “allgather” then every calling process will return the same data. Otherwise the non-root processes will return None for all the values.
You can only call this once, when you’ve finished calling add_data. After that internal data is deleted.
- Parameters
- comm: MPI Communicator, optional
- mode: string, optional
‘gather’ (default), or ‘allgather’
- Returns
- weight: array or SparseArray
The total weight or count in each bin
- mean: array or SparseArray
An array of the computed mean for each bin
- variance: array or SparseArray
An array of the computed variance for each bin
- run(iterator, comm=None, mode='gather')[source]¶
Run the whole life cycle on an iterator returning data chunks.
This is equivalent to calling add_data repeatedly and then collect.
- Parameters
- iterator: iterator
Iterator yieding (bin, values) or (bin, values, weights)
- comm: MPI comm, optional
The comm, or None for serial
- mode: str, optional
“gather” or “allgather”
- Returns
- weight: array or SparseArray
The total weight or count in each bin
- mean: array or SparseArray
An array of the computed mean for each bin
- variance: array or SparseArray
An array of the computed variance for each bin