pyinterp.StreamingHistogram

class pyinterp.StreamingHistogram(values: Union[dask.array.core.Array, numpy.ndarray], weights: Optional[Union[dask.array.core.Array, numpy.ndarray]] = None, axis: Optional[Union[int, Iterable[int]]] = None, bin_count: Optional[int] = None, dtype: Optional[numpy.dtype] = None)[source]

Bases: object

Streaming histogram.

The bins in the histogram have no predefined size, so that as values are pushed into the histogram, bins are added and merged as soon as their numbers exceed the maximum allowed capacity. A particularly interesting feature of streaming histograms is that they can be used to approximate quantiles without sorting (or even storing) values individually. The histograms can be constructed independently and merged, making them usable with Dask.

See also

Yael Ben-Haim and Elad Tom-Tov, A Streaming Parallel Decision Tree Algorithm, Journal of Machine Learning Research, 11, 28, 849-872, http://jmlr.org/papers/v11/ben-haim10a.html

Note

If you do not want to estimate the quantiles of the dataset, use the class DescriptiveStatistics which will give you more accurate results.

__init__(values: Union[dask.array.core.Array, numpy.ndarray], weights: Optional[Union[dask.array.core.Array, numpy.ndarray]] = None, axis: Optional[Union[int, Iterable[int]]] = None, bin_count: Optional[int] = None, dtype: Optional[numpy.dtype] = None) None[source]

Initializes a new histogram.

Parameters
  • values (numpy.ndarray, dask.Array) –

    Array containing numbers whose statistics are desired.

    Note

    NaNs are automatically ignored.

  • weights (numpy.ndarray, dask.Array, optional) – An array of weights associated with the values. If not provided, all values are assumed to have equal weight.

  • axis (int, iterable, optional) – Axis or axes along which to compute the statistics. If not provided, the statistics are computed over the flattened array.

  • bin_count (int, optional) – The maximum number of bins to use in the histogram. If the number of bins exceeds the number of values, the histogram will be trimmed. Default is None, which will set the number of bins to 100.

  • dtype (numpy.dtype, optional) – Data type of the returned array. By default, the data type is numpy.float64.

Methods

StreamingHistogram.bins()

Returns the histogram bins.

StreamingHistogram.count()

Returns the count of samples.

StreamingHistogram.kurtosis()

Returns the kurtosis of samples.

StreamingHistogram.max()

Returns the maximum of samples.

StreamingHistogram.mean()

Returns the mean of samples.

StreamingHistogram.min()

Returns the minimum of samples.

StreamingHistogram.quantile([q])

Returns the q quantile of samples.

StreamingHistogram.size()

Returns the number of bins allocated to calculate the histogram.

StreamingHistogram.skewness()

Returns the skewness of samples.

StreamingHistogram.std()

Returns the standard deviation of samples.

StreamingHistogram.sum_of_weights()

Returns the sum of weights.

StreamingHistogram.var()

Returns the variance of samples.

StreamingHistogram.__iadd__(other)

Adds a new histogram to the current one.