Descriptive Statistics

Numpy offers many statistical functions, but if you want to obtain several statistical variables from the same array, it’s necessary to process the data several times to calculate the various parameters. This example shows how to use the DescriptiveStatistics class to obtain several statistical variables with a single calculation. Also, the calculation algorithm is incremental and is more numerically stable.

Note

Pébay, P., Terriberry, T.B., Kolla, H. et al. Numerically stable, scalable formulas for parallel and online computation of higher-order multivariate central moments with arbitrary weights. Comput Stat 31, 1305–1325, 2016, https://doi.org/10.1007/s00180-015-0637-z

import dask.array
import numpy
import pyinterp

Create a random array

Create a DescriptiveStatistics object.

ds = pyinterp.DescriptiveStatistics(values)

The constructor will calculate the statistical variables on the provided data. The calculated variables are stored in the instance and can be accessed using different methods: * mean * var * std * skewness * kurtosis * min * max * sum * sum_of_weights * count

ds.count()

Out:

array([384], dtype=uint64)
ds.mean()

Out:

array([0.50090009])

It’s possible to get a structured numpy array containing the different statistical variables calculated.

ds.array()

Out:

array([(384, -1.22688568, 0.98888678, 0.50090009, 0.00047593, -0.06125767, 384., 192.34563289, 0.08178515)],
      dtype=[('count', '<u8'), ('kurtosis', '<f8'), ('max', '<f8'), ('mean', '<f8'), ('min', '<f8'), ('skewness', '<f8'), ('sum_of_weights', '<f8'), ('sum', '<f8'), ('var', '<f8')])

Like numpy, it’s possible to compute statistics along axis.

ds = pyinterp.DescriptiveStatistics(values, axis=(1, 2))
ds.mean()

Out:

array([[0.46615104, 0.48072785, 0.4498534 , 0.55317595, 0.53770549,
        0.53880215, 0.43562813, 0.47080196],
       [0.47058377, 0.48209916, 0.4537675 , 0.57201367, 0.58647016,
        0.61856793, 0.4369592 , 0.46109401]])

The class can also process a dask array. In this case, the call to the constructor triggers the calculation.

ds = pyinterp.DescriptiveStatistics(dask.array.from_array(values,
                                                          chunks=(2, 2, 2, 2)),
                                    axis=(1, 2))
ds.mean()

Out:

array([[0.46615104, 0.48072785, 0.4498534 , 0.55317595, 0.53770549,
        0.53880215, 0.43562813, 0.47080196],
       [0.47058377, 0.48209916, 0.4537675 , 0.57201367, 0.58647016,
        0.61856793, 0.4369592 , 0.46109401]])

Finally, it’s possible to calculate weighted statistics.

weights = numpy.random.random_sample((2, 4, 6, 8))
ds = pyinterp.DescriptiveStatistics(values, weights=weights, axis=(1, 2))
ds.mean()

Out:

array([[0.46168515, 0.38710785, 0.47463726, 0.57904657, 0.52838791,
        0.52150582, 0.44790048, 0.48613341],
       [0.44901263, 0.45773856, 0.4094013 , 0.59061284, 0.58009124,
        0.60937434, 0.44758103, 0.48530659]])

Total running time of the script: ( 0 minutes 0.011 seconds)

Gallery generated by Sphinx-Gallery