Skip to content

Count tokens when embedding #77

Open
@enjalot

Description

@enjalot

It could be helpful to the user to understand how many tokens are in their dataset (and how many tokens are in a given cluster).

We can just capture the tokens encoded during the embedding step.

We will need to consider that someone importing embeddings may not have recorded the token counts so surfacing it in the UI would be optional.

The token count could be stored in a parallel array in the h5 file, and later turned into a column in the scope parquet.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions