Skip to content

Commit ab7a454

Browse files
tbenthompsonxhochy
andauthored
Rename (#124)
* Reorganize documentation. * Benchmarks docs and update changelog. * API docs upgrade, improved the motivation in the README. * visualize_benchmark jupytext * Fixing plotting in benchmarks notebook. * Unrestricted sparse matrix optimization. * Changelog * Handle out parameter in unrestricted. * Headline figure. * Changelog and fix test errors. * Fix doctest. * Rename. * Changelog. * Add conda-forge channel to macos tests. * Ensure the correct macos deployment target is set (#123) Co-authored-by: Uwe L. Korn <[email protected]>
1 parent 2ca4723 commit ab7a454

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+857
-673
lines changed

.flake8

+2
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@ ignore =
1111
D104,
1212
D100,
1313
D105, # missing docstring in magic method, unecessary
14+
D205,
15+
D400
1416
max-line-length = 88
1517
max-complexity = 18
1618
select = B,C,E,F,W,T4,B9,D

.github/CODEOWNERS

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
* @tbenthompson @MarcAntoineSchmidtQC
33

44
# Core
5-
/src/quantcore/matrix/ @MarcAntoineSchmidtQC
5+
/src/tabmat/ @MarcAntoineSchmidtQC
66

77
# Cython / C++
88
*.pyx @tbenthompson

.github/workflows/macos.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ set -exo pipefail
55
source ~/.profile
66
mamba install -y yq jq
77

8-
mamba install -y yq
98
yq -Y ". + {dependencies: [.dependencies[], \"python=${PYTHON_VERSION}\"] }" environment.yml > /tmp/environment.yml
109
mamba env create -f /tmp/environment.yml
1110
conda activate $(yq -r .name environment.yml)
11+
export MACOSX_DEPLOYMENT_TARGET=10.9
1212
pip install --no-use-pep517 --no-deps --disable-pip-version-check -e .
1313
pytest tests --doctest-modules src/

.github/workflows/tests-macos.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ jobs:
2525
- uses: conda-incubator/setup-miniconda@35d1405e78aa3f784fe3ce9a2eb378d5eeb62169
2626
with:
2727
miniforge-variant: Mambaforge
28-
miniforge-version: 4.10.0-0
28+
miniforge-version: 4.10.3-6
2929
use-mamba: true
3030
- name: Run Unit Tests
3131
shell: bash -l {0}

.github/workflows/tests-win-master.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ jobs:
2929
miniforge-version: 4.10.0-0
3030
use-mamba: true
3131
environment-file: environment-win.yml
32-
activate-environment: quantcore.matrix
32+
activate-environment: tabmat
3333
- name: Run Unit Tests
3434
shell: pwsh
3535
run: |

.github/workflows/tests-win.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ jobs:
3131
miniforge-version: 4.10.0-0
3232
use-mamba: true
3333
environment-file: environment-win.yml
34-
activate-environment: quantcore.matrix
34+
activate-environment: tabmat
3535
- name: Run Unit Tests
3636
shell: pwsh
3737
run: |

CHANGELOG.rst

+16-7
Original file line numberDiff line numberDiff line change
@@ -10,18 +10,27 @@ Changelog
1010
Unreleased
1111
----------
1212

13+
**Breaking changes**:
14+
15+
- The package has been renamed to ``tabmat``. CELEBRATE!
16+
- The :func:`one_over_var_inf_to_val` function has been made private.
17+
- The :func:`csc_to_split` function has been re-named to :func:`tabmat.from_csc` to match the :func:`tabmat.from_pandas` function.
18+
- The :meth:`tabmat.MatrixBase.get_col_means` and :meth:`tabmat.MatrixBase.get_col_stds` methods have been made private.
19+
- The :meth:`cross_sandwich` method has also been made private.
20+
1321
**Bug fix**
1422

1523
- :func:`StandardizedMatrix.transpose_matvec` was giving the wrong answer when the `out` parameter was provided. This is now fixed.
1624
- :func:`SplitMatrix.__repr__` now calls the `__repr__` method of component matrices instead of `__str__`.
1725

1826
**Other changes**
1927

20-
- Optimized the :meth:`quantcore.matrix.SparseMatrix.matvec` and :meth:`quantcore.matrix.SparseMatrix.tranpose_matvec` for when ``rows`` and ``cols`` are None.
28+
- Optimized the :meth:`tabmat.SparseMatrix.matvec` and :meth:`tabmat.SparseMatrix.tranpose_matvec` for when ``rows`` and ``cols`` are None.
2129
- Implemented :func:`CategoricalMatrix.__rmul__`
30+
- Reorganizing the documentation and updating the text to match the current API.
2231
- Enable indexing the rows of a ``CategoricalMatrix``. Previously :func:`CategoricalMatrix.__getitem__` only supported column indexing.
2332
- Allow creating a ``SplitMatrix`` from a list of any ``MatrixBase`` objects including another ``SplitMatrix``.
24-
- Reduced memory usage in :meth:`quantcore.matrix.SplitMatrix.matvec`.
33+
- Reduced memory usage in :meth:`tabmat.SplitMatrix.matvec`.
2534

2635
2.0.3 - 2021-07-15
2736
------------------
@@ -57,16 +66,16 @@ Split matrices now also work on Windows.
5766

5867
**Breaking changes**:
5968

60-
We renamed several public functions to make them private. These include functions in :mod:`quantcore.matrix.benchmark` that are unlikely to be used outside of this package as well as
69+
We renamed several public functions to make them private. These include functions in :mod:`tabmat.benchmark` that are unlikely to be used outside of this package as well as
6170

62-
- :func:`quantcore.matrix.dense_matrix._matvec_helper`
63-
- :func:`quantcore.matrix.sparse_matrix._matvec_helper`.
64-
- :func:`quantcore.matrix.split_matrix._prepare_out_array`.
71+
- :func:`tabmat.dense_matrix._matvec_helper`
72+
- :func:`tabmat.sparse_matrix._matvec_helper`.
73+
- :func:`tabmat.split_matrix._prepare_out_array`.
6574

6675

6776
**Other changes**:
6877

69-
- We removed the dependency on ``sparse_dot_mkl``. We now use :func:`scipy.sparse.csr_matvec` instead of :func:`sparse_dot_mkl.dot_product_mkl` on all platforms, because the former suffered from poor performance, especially on narrow problems. This also means that we removed the function :func:`quantcore.matrix.sparse_matrix._dot_product_maybe_mkl`.
78+
- We removed the dependency on ``sparse_dot_mkl``. We now use :func:`scipy.sparse.csr_matvec` instead of :func:`sparse_dot_mkl.dot_product_mkl` on all platforms, because the former suffered from poor performance, especially on narrow problems. This also means that we removed the function :func:`tabmat.sparse_matrix._dot_product_maybe_mkl`.
7079
- We updated the pre-commit hooks and made sure the code is line with the new hooks.
7180

7281

README.md

+33-210
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Efficient matrix representations for working with tabular data
22

3-
![CI](https://github.com/Quantco/quantcore.matrix/workflows/CI/badge.svg)
3+
![CI](https://github.com/Quantco/tabmat/workflows/CI/badge.svg)
44

55
## Installation
66
For development, you should do an editable installation:
@@ -11,240 +11,63 @@ conda config --add channels conda-forge
1111
# And install pre-commit
1212
conda install -y pre-commit
1313

14-
git clone [email protected]:Quantco/quantcore.matrix.git
15-
cd quantcore.matrix
14+
git clone [email protected]:Quantco/tabmat.git
15+
cd tabmat
1616

1717
# Set up our pre-commit hooks for black, mypy, isort and flake8.
1818
pre-commit install
1919

20-
# Set up the ***REMOVED*** conda channel. For the password, substitute in the correct password. You should be able to get the password by searching around on slack
21-
conda config --system --prepend channels ***REMOVED***
22-
conda config --system --set custom_channels.***REMOVED*** https://***REMOVED***:password@conda.***REMOVED***
23-
24-
# Set up a conda environment with name "quantcore.matrix"
20+
# Set up a conda environment with name "tabmat"
2521
conda install mamba=0.2.12
2622
mamba env create
2723

2824
# Install this package in editable mode.
29-
conda activate quantcore.matrix
25+
conda activate tabmat
3026
pip install --no-use-pep517 --disable-pip-version-check -e .
3127
```
3228

29+
<img src="docs/_static/headline.png" width="600px">
30+
3331
## Use case
34-
Data used in economics, actuarial science, and many other fields is often tabular,
35-
containing rows and columns. Further properties are also common:
36-
- Tabular data often contains categorical data,
37-
often represented after processing as many columns of indicator values
38-
created by "one-hot encoding."
39-
- It often contains a mix of dense columns and sparse columns,
40-
perhaps due to one-hot encoding.
41-
- It often is very sparse.
42-
43-
High-performance statistical applications often require fast computation of certain
44-
operations, such as
45-
- Operating on one column at a time
46-
- Computing "sandwich products" of the data, `transpose(X) @ diag(d) @ X`. A sandwich
47-
product shows up in the solution to Weighted Least Squares, as well as in the Hessian
48-
of the likelihood in Generalized Linear Models such as Poisson regression.
49-
- Matrix-vector products
50-
51-
Additionally, it is often desirable to normalize predictors for greater optimizer
52-
efficiency and numerical stability in
53-
Coordinate Descent and in other machine learning algorithms.
32+
33+
TL;DR: We provide matrix classes for efficiently building statistical algorithms with data that is partially dense, partially sparse and partially categorical.
34+
35+
Data used in economics, actuarial science, and many other fields is often tabular, containing rows and columns. Further common properties are also common:
36+
- It often is **very sparse**.
37+
- It often contains **a mix of dense and sparse** columns.
38+
- It often contains **categorical data**, processed into many columns of indicator values created by "one-hot encoding."
39+
40+
High-performance statistical applications often require fast computation of certain operations, such as
41+
- Computing **sandwich products** of the data, ``transpose(X) @ diag(d) @ X``. A sandwich product shows up in the solution to weighted least squares, as well as in the Hessian of the likelihood in generalized linear models such as Poisson regression.
42+
- **Matrix-vector products**, possibly on only a subset of the rows or columns. For example, when limiting computation to an "active set" in a L1-penalized coordinate descent implementation, we may only need to compute a matrix-vector product on a small subset of the columns.
43+
- Computing all operations on **standardized predictors** which have mean zero and standard deviation one. This helps with numerical stability and optimizer efficiency in a wide range of machine learning algorithms.
5444

5545
## This library and its design
5646

57-
We designed this library with these use cases in mind. We built this library first for
58-
estimating Generalized Linear Models, but expect it will be useful in a variety of
59-
econometric and statistical use cases. This library was borne out of our need for
60-
speed, and its unified API is motivated by the annoyance by having to write repeated
61-
checks for which type of matrix-like object you are operating on.
47+
We designed this library with the above use cases in mind. We built this library first for estimating generalized linear models, but expect it will be useful in a variety of econometric and statistical use cases. This library was borne out of our need for speed, and its unified API is motivated by the desire to work with a unified matrix API internal to our statistical algorithms.
6248

6349
Design principles:
6450
- Speed and memory efficiency are paramount.
65-
- You don't need to sacrifice functionality by using this library: DenseMatrix
66-
and SparseMatrix subclass Numpy arrays and Scipy csc sparse matrices, respectively,
67-
and inherit their behavior wherever it is not improved on.
68-
- As much as possible, syntax follows Numpy syntax, and dimension-reducing
69-
operations (like `sum`) return Numpy arrays, following Numpy dimensions
70-
about the dimensions of results. The aim is to make these classes
71-
as close as possible to being drop-in replacements for numpy ndarray.
72-
This is not always possible, however, due to the differing APIs of numpy ndarray
73-
and scipy sparse.
51+
- You don't need to sacrifice functionality by using this library: `DenseMatrix` and `SparseMatrix` subclass `np.ndarray` and `scipy.sparse.csc_matrix` respectively, and inherit behavior from those classes wherever it is not improved on.
52+
- As much as possible, syntax follows NumPy syntax, and dimension-reducing operations (like `sum`) return NumPy arrays, following NumPy dimensions about the dimensions of results. The aim is to make these classes as close as possible to being drop-in replacements for ``numpy.ndarray``. This is not always possible, however, due to the differing APIs of ``numpy.ndarray`` and ``scipy.sparse``.
7453
- Other operations, such as `toarray`, mimic Scipy sparse syntax.
75-
- All matrix classes support matrix products, sandwich products, and `getcol`.
54+
- All matrix classes support matrix-vector products, sandwich products, and `getcol`.
55+
7656
Individual subclasses may support significantly more operations.
7757

7858
## Matrix types
79-
- `DenseMatrix` represents dense matrices, subclassing numpy nparray.
80-
It additionally supports methods `getcol`, `toarray`, `sandwich`, `standardize`,
81-
and `unstandardize`.
82-
- `SparseMatrix` represents column-major sparse data, subclassing
83-
`scipy.sparse.csc_matrix`. It additionally supports methods `sandwich`
84-
and `standardize`, and it's `dot` method (e.g. `@`) calls MKL's sparse dot product
85-
in the case of matrix-vector products, which is faster.
86-
- `ColScaledSpMat` represents the sum of an n x k sparse matrix and a matrix
87-
of the form `ones((n, 1)) x shift`, where `shift` is `1 x k`. In other words,
88-
a matrix with a column-specific shifter applied. Such a matrix is dense, but
89-
`ColScaledSpMat` represents the sparse matrix and `shift` separately, allowing for
90-
efficient storage and computations.
91-
- `SplitMatrix` represents matrices with both sparse and dense parts, allowing for
92-
a significant speedup in matrix multiplications.
93-
94-
## Benchmarks
95-
To generate the data to run all benchmarks, run
96-
`python src/quantcore/matrix/benchmark/generate_matrices.py`.
97-
98-
For more info on the benchmark CLI:
99-
`python src/quantcore/matrix/benchmark/main.py --help`.
100-
101-
## Categorical data
102-
One-hot encoding a feature creates a sparse matrix that has some special properties:
103-
All of its nonzero elements are ones, and since each element starts a new row, it's `indptr`,
104-
which indicates where rows start and end, will increment by 1 every time.
105-
106-
### Storage
107-
#### csr
108-
```
109-
>>> import numpy as np
110-
>>> from scipy import sparse
111-
>>> import pandas as pd
112-
113-
>>> arr = [1, 0, 1]
114-
>>> dummies = pd.get_dummies(arr)
115-
>>> csr = sparse.csr_matrix(dummies.values)
116-
>>> csr.data
117-
array([1, 1, 1], dtype=uint8)
118-
>>> csr.indices
119-
array([1, 0, 1], dtype=int32)
120-
>>> csr.indptr
121-
array([0, 1, 2, 3], dtype=int32)
122-
```
123-
124-
The size of this matrix, if the original array is of length `n`, is `n` bytes for the
125-
data (stored as quarter-precision integers), `4n` for `indices`, and `4(n+1)` for
126-
`indptr`. However, if we know the matrix results from one-hot encoding, we only need to
127-
store the `indices`, so we can reduce memory usage to slightly less than 4/9 of the
128-
original.
129-
130-
#### csc storage
131-
The case is not quite so simple for csc (column-major) sparse matrices.
132-
However, we still do not need to store the data.
133-
134-
```
135-
>>> import numpy as np
136-
>>> from scipy import sparse
137-
>>> import pandas as pd
138-
139-
>>> arr = [1, 0, 1]
140-
>>> dummies = pd.get_dummies(arr)
141-
>>> csc = sparse.csc_matrix(dummies.values)
142-
>>> csc.data
143-
array([1, 1, 1], dtype=uint8)
144-
>>> csc.indices
145-
array([1, 0, 2], dtype=int32)
146-
>>> csc.indptr
147-
array([0, 1, 3], dtype=int32)
148-
```
149-
150-
### Computations
151-
152-
#### Matrix multiplication
153-
154-
A general sparse CSR matrix-vector products in psedocode,
155-
modeled on [scipy sparse](https://github.com/scipy/scipy/blob/1dc960a33b000b95b1e399582c154efc0360a576/scipy/sparse/sparsetools/csr.h#L1120):
156-
```
157-
>>> def matvec(mat, vec):
158-
>>> n_row = mat.shape[0]
159-
>>> res = np.zeros(n_row)
160-
>>> for i in range(n_row):
161-
>>> for j in range(mat.indptr[i], mat.indptr[i+1]):
162-
>>> res[i] += mat.data[j] * vec[mat.indices[j]]
163-
>>> return res
164-
```
165-
With a CSR categorical matrix, `data` is all 1 and `j` always equals `i`, so we can
166-
simplify this function to be
167-
```
168-
>>> def matvec(mat, vec):
169-
>>> n_row = mat.shape[0]
170-
>>> res = np.zeros(n_row)
171-
>>> for i in range(n_row):
172-
>>> res[i] = vec[mat.indices[j]]
173-
>>> return res
174-
```
175-
The original function involved `6N` lookups, `N` multiplications, and `N` additions,
176-
while the new function involves only `3N` lookups. It thus has the potential to be
177-
significantly faster.
178-
#### sandwich: X.T @ diag(d) @ X
59+
- `DenseMatrix` represents dense matrices, subclassing numpy nparray. It additionally supports methods `getcol`, `toarray`, `sandwich`, `standardize`, and `unstandardize`.
60+
- `SparseMatrix` represents column-major sparse data, subclassing `scipy.sparse.csc_matrix`. It additionally supports methods `sandwich` and `standardize`.
61+
- `CategoricalMatrix` represents one-hot encoded categorical matrices. Because all the non-zeros in these matrices are ones and because each row has only one non-zero, the data can be represented and multiplied much more efficiently than a generic sparse matrix.
62+
- `SplitMatrix` represents matrices with both dense, sparse and categorical parts, allowing for a significant speedup in matrix multiplications.
63+
- `StandardizedMatrix` efficiently and sparsely represents a matrix that has had its column normalized to have mean zero and variance one. Even if the underlying matrix is sparse, such a normalized matrix will be dense. However, by storing the scaling and shifting factors separately, `StandardizedMatrix` retains the original matrix sparsity.
17964

180-
![Narrow data set](images/narrow_data_sandwich.png)
181-
![Medium-width data set](images/intermediate_data_sandwich.png)
18265
![Wide data set](images/wide_data_sandwich.png)
18366

184-
Sandwich products can be computed very efficiently.
185-
```
186-
sandwich(X, d)[i, j] = sum_k X[k, i] d[k] X[k, j]
187-
```
188-
If `i != j`, `sum_k X[k, i] d[k] X[k, j]` = 0. In other words, since
189-
categorical matrices have only one nonzero per row, the sandwich product is diagonal.
190-
If `i = j`,
191-
```
192-
sandwich(X, d)[i, j] = sum_k X[k, i] d[k] X[k, i]
193-
= sum_k X[k, i] d[k]
194-
= d[X[:, i]].sum()
195-
= (X.T @ d)[i]
196-
```
197-
198-
So `sandwich(X, d) = diag(X.T @ d)`. This will be especially efficient if `X` is
199-
available in CSC format. Pseudocode for this sandwich product is
200-
```
201-
res = np.zeros(n_cols)
202-
for i in range(n_cols):
203-
for j in range(X.indptr[i], X.indptr[i + 1]):
204-
val += d[indices[j]]
205-
return np.diag(res)
206-
```
207-
208-
This function is ext/categorical/sandwich_categorical
209-
210-
#### Cross-sandwich: X.T @ diag(d) @ Y, Y categorical
211-
If X and Y are different categorical matrices in csr format,
212-
X.T @ diag(d) @ Y is given by
213-
```
214-
res = np.zeros((X.shape[1], Y.shape[1]))
215-
for k in range(len(d)):
216-
res[X.indices[k], Y.indices[k]] += d[k]
217-
```
218-
So the result will be sparse with at most N elements.
219-
This function is given by `ext/split/_sandwich_cat_cat`.
220-
221-
#### Cross-sandwich: X.T @ diag(d) @ Y, Y dense
222-
```
223-
res = np.zeros((X.shape[1], Y.shape[1]))
224-
for k in range(n_rows):
225-
for j in range(Y.shape[1]):
226-
res[X.indices[k], j] += d[k] * Y[k, j]
227-
```
228-
This is `ext/split/sandwich_cat_dense`
229-
230-
231-
## Performance
232-
Dense matrix, 100k x 1k:
233-
234-
![dense_bm](src/quantcore/matrix/benchmark/dense_times.png)
235-
236-
One-hot encoded categorical variable, 1M x 100k:
237-
238-
![cat_bm](src/quantcore/matrix/benchmark/one_cat_times.png)
239-
240-
Sparse matrix, 1M x 1k:
241-
242-
![sparse_bm](src/quantcore/matrix/benchmark/sparse_times.png)
243-
244-
Two categorical matrices, 1M x 2k:
67+
## Benchmarks
24568

246-
![two_cat_bm](src/quantcore/matrix/benchmark/two_cat_times.png)
69+
[See here for detailed benchmarking.](https://docs.dev.***REMOVED***/***REMOVED***/Quantco/tabmat/latest/benchmarks.html)
24770

248-
Two categorical matrices plus a dense matrix, 1M x 2k+:
71+
## API documentation
24972

250-
![two_cat_plus_dense_bm](src/quantcore/matrix/benchmark/dense_cat_times.png)
73+
[See here for detailed API documentation.](https://docs.dev.***REMOVED***/***REMOVED***/Quantco/tabmat/latest/api/modules.html)

build_and_launch

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,6 @@ set -e
44
CONDA_BASE=$(conda info --base)
55
source ${CONDA_BASE}/etc/profile.d/conda.sh
66

7-
conda activate quantcore.matrix
7+
conda activate tabmat
88
python setup.py build_ext --inplace
99
exec "$@"

0 commit comments

Comments
 (0)