GitHub - heal-research/vstat: SIMD-enabled descriptive statistics (mean, variance, covariance, correlation)

Vectorized statistics using SIMD primitives

Introduction

is a C++17 library of computationally efficient methods for calculating sample statistics (mean, variance, covariance, correlation).

the implementation builds upon the SIMD abstraction layer provided by the EVE [1]
it uses a data-parallel Youngs and Cramer [2] algorithm for numerically stable computations of sums and sums-of-squares.
the results from independent data partitions are combined with the approach by Schubert and Gertz [3].
the methods are validated for correctness against statistical methods from the GNU Scientific Library [4].

Usage

To use this library you simply need to copy the contents of the include folder inside your project, and then #include <vstat.hpp>. Defining VSTAT_NAMESPACE before inclusion will allow you to set a custom namespace for the library.

Two convenience methods are provided for batch data:

univariate::accumulate for univariate statistics (mean, variance, standard deviation)
bivariate::accumulate for bivariate statistics (covariance, correlation)

The methods return a statistics object which contains all the stat values. For example:

std::vector<float> values{ 1.0, 2.0, 3.0, 4.0 };
std::vector<float> weights{ 2.0, 4.0, 6.0, 8.0 };

// unweighted data
auto stats = univariate::accumulate<float>(values.begin(), values.end());
std::cout << "stats:\n" << stats << "\n";

count:                  4
sum:                    10
ssr:                    5
mean:                   2.5
variance:               1.25
sample variance:        1.66667

// weighted data
auto stats = univariate::accumulate<float>(values.begin(), values.end(), weights.begin());
std::cout << "stats:\n" << stats << "\n";

count:                  20
sum:                    60
ssr:                    20
mean:                   3
variance:               1
sample variance:        1.05263

Besides iterators, it is also possible to provide raw pointers:

float x[] = { 1., 1., 2., 6. };
float y[] = { 2., 4., 3., 1. };
size_t n = std::size(x);

auto stats = bivariate::accumulate<float>(x, y, n);
std::cout << "stats:\n" << stats << "\n";

// results
count:                  4
sum_x:                  10
ssr_x:                  17
mean_x:                 2.5
variance_x:             4.25
sample variance_x:      5.66667
sum_y:                  10
ssr_y:                  5
mean_y:                 2.5
variance_y:             1.25
sample variance_y:      1.66667
correlation:            -0.759257
covariance:             -1.75
sample covariance:      -2.33333

It is also possible to use projections to aggregate stats over object properties:

struct Foo {
    float value;
};

Foo foos[] = { {1}, {3}, {5}, {2}, {8} };
auto stats = univariate::accumulate<float>(foos, std::size(foos), [](auto const& foo) { return foo.value; });
std::cout << "stats:\n" << stats << "\n";

// results
count:                  5
sum:                    19
ssr:                    30.8
mean:                   3.8
variance:               6.16
sample variance:        7.7

struct Foo {
    float value;
};

struct Bar {
    int value;
};

Foo foos[] = { {1}, {3}, {5}, {2}, {8} };
Bar bars[] = { {3}, {2}, {1}, {4}, {11} };

auto stats = bivariate::accumulate<float>(foos, bars, std::size(foos), [](auto const& foo) { return foo.value; },
                                                                       [](auto const& bar) { return bar.value; });
std::cout << "stats:\n" << stats << "\n";

// results
count:                  5
sum_x:                  19
ssr_x:                  30.8
mean_x:                 3.8
variance_x:             6.16
sample variance_x:      7.7
sum_y:                  21
ssr_y:                  62.8
mean_y:                 4.2
variance_y:             12.56
sample variance_y:      15.7
correlation:            0.686676
covariance:             6.04
sample covariance:      7.55

The methods above accept a batch of data and calculate relevant statistics. If the data is streaming, then one can also use accumulators. The accumulator is a lower-level object that is able to perform calculations online as new data arrives:

univariate_accumulator<float> acc;
acc(1.0);
acc(2.0);
acc(3.0);
acc(4.0);
auto stats = univariate_statistics(acc);
std::cout << "stats:\n" << stats << "\n";

Count:                  4
Sum:                    10
Sum of squares:         5
Mean:                   2.5
Variance:               1.25
Sample variance:        1.66667

The template parameter tells the accumulator how to represent data internally.

if a scalar type is provided (e.g. float or double), the accumulator will perform all operations with scalars (i.e., no SIMD).
if a SIMD-type is provided (e.g., eve::wide) then the accumulator will perform data-parallel operations

This allows the user to combine accumulators, for example using a SIMD-enabled accumulator to process the bulk of the data and a scalar accumulator for the left-over points.

Available statistics

univariate

struct univariate_statistics {
    double count;
    double sum;
    double ssr;
    double mean;
    double variance;
    double sample_variance;
};

bivariate

struct bivariate_statistics {
    double count;
    double sum_x;
    double sum_y;
    double ssr_x;
    double ssr_y;
    double sum_xy;
    double mean_x;
    double mean_y;
    double variance_x;
    double variance_y;
    double sample_variance_x;
    double sample_variance_y;
    double correlation;
    double covariance;
    double sample_covariance;
};

Benchmarks

The following libraries have been used for performance comparison in the univariate (variance) and bivariate (covariance) case:

Methodology

we generate 1M values uniformly distributed between [-1, 1] and save them into a double and a float array
increase the data size in 100k increments and benchmark the performance for each method using nanobench

Notes

we did not use MKL as a backend for numpy and gsl (expect MKL performance to be higher)
linasm methods for variance and covariance require precomputed array means, so means computation is factored into the benchmarks
hardware: Ryzen 9 5950X

Acknowledgements

[1] Expressive Vector Engine

[2] Youngs and Cramer - Some Results Relevant to Choice of Sum and Sum-of-Product Algorithms

[3] Schubert and Gertz - Numerically stable parallel computation of (co-)variance

[4] GNU Scientific Library

Name	Name	Last commit message	Last commit date
Latest commit foolnotion update flake Apr 6, 2025 413cbd8 · Apr 6, 2025 History 90 Commits
.github/workflows	.github/workflows	CI: add macos build	Jun 2, 2024
cmake	cmake	vendor in CPM.cmake for easier deps management	Jun 3, 2024
docs	docs	add documentation support using doxygen-awesome	May 29, 2024
include/vstat	include/vstat	add some small doxygen comments	Jun 1, 2024
logo	logo	implement accumulators, update readme and tests	Mar 24, 2021
src	src	implement python module using nanobind	May 29, 2024
test	test	update unit tests	May 29, 2024
vstat	vstat	use scikit-build to build wheels	Jun 2, 2024
.clang-format	.clang-format	restructure project	Nov 15, 2021
.clang-tidy	.clang-tidy	simplify usage of accumulators	Jul 21, 2022
.codespellrc	.codespellrc	restructure project	Nov 15, 2021
.envrc	.envrc	update .envrc and .gitignore	May 29, 2024
.gitignore	.gitignore	update .envrc and .gitignore	May 29, 2024
BUILDING.md	BUILDING.md	restructure project	Nov 15, 2021
CMakeLists.txt	CMakeLists.txt	update build, use CPM.cmake to manage deps	Jun 2, 2024
CMakePresets.json	CMakePresets.json	cmake: update presets version	May 29, 2024
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md	restructure project	Nov 15, 2021
CONTRIBUTING.md	CONTRIBUTING.md	restructure project	Nov 15, 2021
HACKING.md	HACKING.md	restructure project	Nov 15, 2021
LICENSE	LICENSE	update license headers and some unit tests	May 12, 2021
README.md	README.md	README.md: add CI badges	Jun 2, 2024
flake.lock	flake.lock	update flake	Apr 6, 2025
flake.nix	flake.nix	update flake	Apr 6, 2025
pyproject.toml	pyproject.toml	use scikit-build to build wheels	Jun 2, 2024
setup.py	setup.py	use scikit-build to build wheels	Jun 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vectorized statistics using SIMD primitives

Introduction

Usage

Available statistics

Benchmarks

Methodology

Notes

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

heal-research/vstat

Folders and files

Latest commit

History

Repository files navigation

Vectorized statistics using SIMD primitives

Introduction

Usage

Available statistics

Benchmarks

Methodology

Notes

Acknowledgements

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages