[Book 1] Ch.1 potential typo: Normalizing TF-IDF per row vs column

On p.25 of book 1 (in the latest available online version dated back June 2023), in Section 1.5.4.2, it is stated that we often normalize each row of the TF-IDF matrix. According to the definition of TF-IDF in the book, i.e., $(TF-IDF)_{ij}$ refers to the frequency of the $i$-th term in the $j$-th document, normalizing each row corresponds to comparing (the occurrences of) all the words on the same scale.

Just wonder whether we actually want to normalize each *column*, instead of each *row*, of TF-IDF? This corresponds to comparing all the documents on the same scale, regardless of their lengths.

![Screenshot 2024-04-23 at 11 45 23](https://github.com/probml/pyprobml/assets/89930807/1419478c-fc2c-4b67-ad79-09d5ba84d714)

Also, there is some minute notation inconsistency in the following Sec. 1.5.4.3. Previous, the size of the vocabulary was denoted by $D$ (as what we do in most of the book), while here we switch to the undefined $V$.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Book 1] Ch.1 potential typo: Normalizing TF-IDF per row vs column #1115

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Book 1] Ch.1 potential typo: Normalizing TF-IDF per row vs column #1115

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions