Description
Some ideas for a potential "smart data augmentation" tool that could be built on top of Open Datasets.
The idea is to pass your data through a set of "checks" or "matches". You get back a bunch of extra columns that might be relevant. These are derived from all the open datasets.
The matching is done by an LLM. It receives a every column name and a sample of values, and tries to match it with known relevant columns¹.
Additionally, suggest some LLMs-derived columns from existing ones (e.g: Country column passed through a "Capital" prompt) or let the user set a custom prompt to "augment" one of the columns. This won't use any real data but could be useful (e.g: to classify a text sentiment).
¹ To make it fast, each column could run in parallel. The samples could be embedded and used to retrieve similar columns in the Open Datasets space. Same could be done at the column level.