You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+16-7
Original file line number
Diff line number
Diff line change
@@ -10,18 +10,27 @@ Changelog
10
10
Unreleased
11
11
----------
12
12
13
+
**Breaking changes**:
14
+
15
+
- The package has been renamed to ``tabmat``. CELEBRATE!
16
+
- The :func:`one_over_var_inf_to_val` function has been made private.
17
+
- The :func:`csc_to_split` function has been re-named to :func:`tabmat.from_csc` to match the :func:`tabmat.from_pandas` function.
18
+
- The :meth:`tabmat.MatrixBase.get_col_means` and :meth:`tabmat.MatrixBase.get_col_stds` methods have been made private.
19
+
- The :meth:`cross_sandwich` method has also been made private.
20
+
13
21
**Bug fix**
14
22
15
23
- :func:`StandardizedMatrix.transpose_matvec` was giving the wrong answer when the `out` parameter was provided. This is now fixed.
16
24
- :func:`SplitMatrix.__repr__` now calls the `__repr__` method of component matrices instead of `__str__`.
17
25
18
26
**Other changes**
19
27
20
-
- Optimized the :meth:`quantcore.matrix.SparseMatrix.matvec` and :meth:`quantcore.matrix.SparseMatrix.tranpose_matvec` for when ``rows`` and ``cols`` are None.
28
+
- Optimized the :meth:`tabmat.SparseMatrix.matvec` and :meth:`tabmat.SparseMatrix.tranpose_matvec` for when ``rows`` and ``cols`` are None.
21
29
- Implemented :func:`CategoricalMatrix.__rmul__`
30
+
- Reorganizing the documentation and updating the text to match the current API.
22
31
- Enable indexing the rows of a ``CategoricalMatrix``. Previously :func:`CategoricalMatrix.__getitem__` only supported column indexing.
23
32
- Allow creating a ``SplitMatrix`` from a list of any ``MatrixBase`` objects including another ``SplitMatrix``.
24
-
- Reduced memory usage in :meth:`quantcore.matrix.SplitMatrix.matvec`.
33
+
- Reduced memory usage in :meth:`tabmat.SplitMatrix.matvec`.
25
34
26
35
2.0.3 - 2021-07-15
27
36
------------------
@@ -57,16 +66,16 @@ Split matrices now also work on Windows.
57
66
58
67
**Breaking changes**:
59
68
60
-
We renamed several public functions to make them private. These include functions in :mod:`quantcore.matrix.benchmark` that are unlikely to be used outside of this package as well as
69
+
We renamed several public functions to make them private. These include functions in :mod:`tabmat.benchmark` that are unlikely to be used outside of this package as well as
- We removed the dependency on ``sparse_dot_mkl``. We now use :func:`scipy.sparse.csr_matvec` instead of :func:`sparse_dot_mkl.dot_product_mkl` on all platforms, because the former suffered from poor performance, especially on narrow problems. This also means that we removed the function :func:`quantcore.matrix.sparse_matrix._dot_product_maybe_mkl`.
78
+
- We removed the dependency on ``sparse_dot_mkl``. We now use :func:`scipy.sparse.csr_matvec` instead of :func:`sparse_dot_mkl.dot_product_mkl` on all platforms, because the former suffered from poor performance, especially on narrow problems. This also means that we removed the function :func:`tabmat.sparse_matrix._dot_product_maybe_mkl`.
70
79
- We updated the pre-commit hooks and made sure the code is line with the new hooks.
# Set up our pre-commit hooks for black, mypy, isort and flake8.
18
18
pre-commit install
19
19
20
-
# Set up the ***REMOVED*** conda channel. For the password, substitute in the correct password. You should be able to get the password by searching around on slack
Data used in economics, actuarial science, and many other fields is often tabular,
35
-
containing rows and columns. Further properties are also common:
36
-
- Tabular data often contains categorical data,
37
-
often represented after processing as many columns of indicator values
38
-
created by "one-hot encoding."
39
-
- It often contains a mix of dense columns and sparse columns,
40
-
perhaps due to one-hot encoding.
41
-
- It often is very sparse.
42
-
43
-
High-performance statistical applications often require fast computation of certain
44
-
operations, such as
45
-
- Operating on one column at a time
46
-
- Computing "sandwich products" of the data, `transpose(X) @ diag(d) @ X`. A sandwich
47
-
product shows up in the solution to Weighted Least Squares, as well as in the Hessian
48
-
of the likelihood in Generalized Linear Models such as Poisson regression.
49
-
- Matrix-vector products
50
-
51
-
Additionally, it is often desirable to normalize predictors for greater optimizer
52
-
efficiency and numerical stability in
53
-
Coordinate Descent and in other machine learning algorithms.
32
+
33
+
TL;DR: We provide matrix classes for efficiently building statistical algorithms with data that is partially dense, partially sparse and partially categorical.
34
+
35
+
Data used in economics, actuarial science, and many other fields is often tabular, containing rows and columns. Further common properties are also common:
36
+
- It often is **very sparse**.
37
+
- It often contains **a mix of dense and sparse** columns.
38
+
- It often contains **categorical data**, processed into many columns of indicator values created by "one-hot encoding."
39
+
40
+
High-performance statistical applications often require fast computation of certain operations, such as
41
+
- Computing **sandwich products** of the data, ``transpose(X) @ diag(d) @ X``. A sandwich product shows up in the solution to weighted least squares, as well as in the Hessian of the likelihood in generalized linear models such as Poisson regression.
42
+
-**Matrix-vector products**, possibly on only a subset of the rows or columns. For example, when limiting computation to an "active set" in a L1-penalized coordinate descent implementation, we may only need to compute a matrix-vector product on a small subset of the columns.
43
+
- Computing all operations on **standardized predictors** which have mean zero and standard deviation one. This helps with numerical stability and optimizer efficiency in a wide range of machine learning algorithms.
54
44
55
45
## This library and its design
56
46
57
-
We designed this library with these use cases in mind. We built this library first for
58
-
estimating Generalized Linear Models, but expect it will be useful in a variety of
59
-
econometric and statistical use cases. This library was borne out of our need for
60
-
speed, and its unified API is motivated by the annoyance by having to write repeated
61
-
checks for which type of matrix-like object you are operating on.
47
+
We designed this library with the above use cases in mind. We built this library first for estimating generalized linear models, but expect it will be useful in a variety of econometric and statistical use cases. This library was borne out of our need for speed, and its unified API is motivated by the desire to work with a unified matrix API internal to our statistical algorithms.
62
48
63
49
Design principles:
64
50
- Speed and memory efficiency are paramount.
65
-
- You don't need to sacrifice functionality by using this library: DenseMatrix
66
-
and SparseMatrix subclass Numpy arrays and Scipy csc sparse matrices, respectively,
67
-
and inherit their behavior wherever it is not improved on.
68
-
- As much as possible, syntax follows Numpy syntax, and dimension-reducing
69
-
operations (like `sum`) return Numpy arrays, following Numpy dimensions
70
-
about the dimensions of results. The aim is to make these classes
71
-
as close as possible to being drop-in replacements for numpy ndarray.
72
-
This is not always possible, however, due to the differing APIs of numpy ndarray
73
-
and scipy sparse.
51
+
- You don't need to sacrifice functionality by using this library: `DenseMatrix` and `SparseMatrix` subclass `np.ndarray` and `scipy.sparse.csc_matrix` respectively, and inherit behavior from those classes wherever it is not improved on.
52
+
- As much as possible, syntax follows NumPy syntax, and dimension-reducing operations (like `sum`) return NumPy arrays, following NumPy dimensions about the dimensions of results. The aim is to make these classes as close as possible to being drop-in replacements for ``numpy.ndarray``. This is not always possible, however, due to the differing APIs of ``numpy.ndarray`` and ``scipy.sparse``.
74
53
- Other operations, such as `toarray`, mimic Scipy sparse syntax.
75
-
- All matrix classes support matrix products, sandwich products, and `getcol`.
54
+
- All matrix classes support matrix-vector products, sandwich products, and `getcol`.
55
+
76
56
Individual subclasses may support significantly more operations.
One-hot encoding a feature creates a sparse matrix that has some special properties:
103
-
All of its nonzero elements are ones, and since each element starts a new row, it's `indptr`,
104
-
which indicates where rows start and end, will increment by 1 every time.
105
-
106
-
### Storage
107
-
#### csr
108
-
```
109
-
>>> import numpy as np
110
-
>>> from scipy import sparse
111
-
>>> import pandas as pd
112
-
113
-
>>> arr = [1, 0, 1]
114
-
>>> dummies = pd.get_dummies(arr)
115
-
>>> csr = sparse.csr_matrix(dummies.values)
116
-
>>> csr.data
117
-
array([1, 1, 1], dtype=uint8)
118
-
>>> csr.indices
119
-
array([1, 0, 1], dtype=int32)
120
-
>>> csr.indptr
121
-
array([0, 1, 2, 3], dtype=int32)
122
-
```
123
-
124
-
The size of this matrix, if the original array is of length `n`, is `n` bytes for the
125
-
data (stored as quarter-precision integers), `4n` for `indices`, and `4(n+1)` for
126
-
`indptr`. However, if we know the matrix results from one-hot encoding, we only need to
127
-
store the `indices`, so we can reduce memory usage to slightly less than 4/9 of the
128
-
original.
129
-
130
-
#### csc storage
131
-
The case is not quite so simple for csc (column-major) sparse matrices.
132
-
However, we still do not need to store the data.
133
-
134
-
```
135
-
>>> import numpy as np
136
-
>>> from scipy import sparse
137
-
>>> import pandas as pd
138
-
139
-
>>> arr = [1, 0, 1]
140
-
>>> dummies = pd.get_dummies(arr)
141
-
>>> csc = sparse.csc_matrix(dummies.values)
142
-
>>> csc.data
143
-
array([1, 1, 1], dtype=uint8)
144
-
>>> csc.indices
145
-
array([1, 0, 2], dtype=int32)
146
-
>>> csc.indptr
147
-
array([0, 1, 3], dtype=int32)
148
-
```
149
-
150
-
### Computations
151
-
152
-
#### Matrix multiplication
153
-
154
-
A general sparse CSR matrix-vector products in psedocode,
155
-
modeled on [scipy sparse](https://github.com/scipy/scipy/blob/1dc960a33b000b95b1e399582c154efc0360a576/scipy/sparse/sparsetools/csr.h#L1120):
156
-
```
157
-
>>> def matvec(mat, vec):
158
-
>>> n_row = mat.shape[0]
159
-
>>> res = np.zeros(n_row)
160
-
>>> for i in range(n_row):
161
-
>>> for j in range(mat.indptr[i], mat.indptr[i+1]):
162
-
>>> res[i] += mat.data[j] * vec[mat.indices[j]]
163
-
>>> return res
164
-
```
165
-
With a CSR categorical matrix, `data` is all 1 and `j` always equals `i`, so we can
166
-
simplify this function to be
167
-
```
168
-
>>> def matvec(mat, vec):
169
-
>>> n_row = mat.shape[0]
170
-
>>> res = np.zeros(n_row)
171
-
>>> for i in range(n_row):
172
-
>>> res[i] = vec[mat.indices[j]]
173
-
>>> return res
174
-
```
175
-
The original function involved `6N` lookups, `N` multiplications, and `N` additions,
176
-
while the new function involves only `3N` lookups. It thus has the potential to be
177
-
significantly faster.
178
-
#### sandwich: X.T @ diag(d) @ X
59
+
-`DenseMatrix` represents dense matrices, subclassing numpy nparray. It additionally supports methods `getcol`, `toarray`, `sandwich`, `standardize`, and `unstandardize`.
60
+
-`SparseMatrix` represents column-major sparse data, subclassing `scipy.sparse.csc_matrix`. It additionally supports methods `sandwich` and `standardize`.
61
+
-`CategoricalMatrix` represents one-hot encoded categorical matrices. Because all the non-zeros in these matrices are ones and because each row has only one non-zero, the data can be represented and multiplied much more efficiently than a generic sparse matrix.
62
+
-`SplitMatrix` represents matrices with both dense, sparse and categorical parts, allowing for a significant speedup in matrix multiplications.
63
+
-`StandardizedMatrix` efficiently and sparsely represents a matrix that has had its column normalized to have mean zero and variance one. Even if the underlying matrix is sparse, such a normalized matrix will be dense. However, by storing the scaling and shifting factors separately, `StandardizedMatrix` retains the original matrix sparsity.
179
64
180
-

181
-

182
65

183
66
184
-
Sandwich products can be computed very efficiently.
185
-
```
186
-
sandwich(X, d)[i, j] = sum_k X[k, i] d[k] X[k, j]
187
-
```
188
-
If `i != j`, `sum_k X[k, i] d[k] X[k, j]` = 0. In other words, since
189
-
categorical matrices have only one nonzero per row, the sandwich product is diagonal.
190
-
If `i = j`,
191
-
```
192
-
sandwich(X, d)[i, j] = sum_k X[k, i] d[k] X[k, i]
193
-
= sum_k X[k, i] d[k]
194
-
= d[X[:, i]].sum()
195
-
= (X.T @ d)[i]
196
-
```
197
-
198
-
So `sandwich(X, d) = diag(X.T @ d)`. This will be especially efficient if `X` is
199
-
available in CSC format. Pseudocode for this sandwich product is
200
-
```
201
-
res = np.zeros(n_cols)
202
-
for i in range(n_cols):
203
-
for j in range(X.indptr[i], X.indptr[i + 1]):
204
-
val += d[indices[j]]
205
-
return np.diag(res)
206
-
```
207
-
208
-
This function is ext/categorical/sandwich_categorical
209
-
210
-
#### Cross-sandwich: X.T @ diag(d) @ Y, Y categorical
211
-
If X and Y are different categorical matrices in csr format,
212
-
X.T @ diag(d) @ Y is given by
213
-
```
214
-
res = np.zeros((X.shape[1], Y.shape[1]))
215
-
for k in range(len(d)):
216
-
res[X.indices[k], Y.indices[k]] += d[k]
217
-
```
218
-
So the result will be sparse with at most N elements.
219
-
This function is given by `ext/split/_sandwich_cat_cat`.
0 commit comments