|
| 1 | +--- |
| 2 | +title: "Taking a random cube for a walk and making it talk" |
| 3 | +author: "Cody Peterson" |
| 4 | +date: "2024-09-26" |
| 5 | +image: thumbnail.png |
| 6 | +categories: |
| 7 | + - blog |
| 8 | + - duckdb |
| 9 | + - udfs |
| 10 | +--- |
| 11 | + |
| 12 | +***Synthetic data with Ibis, DuckDB, Python UDFs, and Faker.*** |
| 13 | + |
| 14 | +To follow along, install the required libraries: |
| 15 | + |
| 16 | +```bash |
| 17 | +pip install 'ibis-framework[duckdb]' faker plotly |
| 18 | +``` |
| 19 | + |
| 20 | +## A random cube |
| 21 | + |
| 22 | +We'll generate a random cube of data with Ibis (default DuckDB backend) and |
| 23 | +visualize it as a 3D line plot: |
| 24 | + |
| 25 | +```{python} |
| 26 | +#| code-fold: true |
| 27 | +#| code-summary: "Show me the code!" |
| 28 | +import ibis # <1> |
| 29 | +import ibis.selectors as s |
| 30 | +import plotly.express as px # <1> |
| 31 | +
|
| 32 | +ibis.options.interactive = True # <2> |
| 33 | +ibis.options.repr.interactive.max_rows = 5 # <2> |
| 34 | +
|
| 35 | +con = ibis.connect("duckdb://synthetic.ddb") # <3> |
| 36 | +
|
| 37 | +if "source" in con.list_tables(): |
| 38 | + t = con.table("source") # <4> |
| 39 | +else: |
| 40 | + lookback = ibis.interval(days=1) # <5> |
| 41 | + step = ibis.interval(seconds=1) # <5> |
| 42 | +
|
| 43 | + t = ( |
| 44 | + ( |
| 45 | + ibis.range( # <6> |
| 46 | + ibis.now() - lookback, |
| 47 | + ibis.now(), |
| 48 | + step=step, |
| 49 | + ) # <6> |
| 50 | + .unnest() # <7> |
| 51 | + .name("timestamp") # <8> |
| 52 | + .as_table() # <9> |
| 53 | + ) |
| 54 | + .mutate( |
| 55 | + index=(ibis.row_number().over(order_by="timestamp")), # <10> |
| 56 | + **{col: 2 * (ibis.random() - 0.5) for col in ["a", "b", "c"]}, # <11> |
| 57 | + ) |
| 58 | + .mutate(color=ibis._["index"].histogram(nbins=8)) # <12> |
| 59 | + .drop("index") # <13> |
| 60 | + .relocate("timestamp", "color") # <14> |
| 61 | + .order_by("timestamp") # <15> |
| 62 | + ) |
| 63 | +
|
| 64 | + t = con.create_table("source", t.to_pyarrow()) # <16> |
| 65 | +
|
| 66 | +c = px.line_3d( # <17> |
| 67 | + t, |
| 68 | + x="a", |
| 69 | + y="b", |
| 70 | + z="c", |
| 71 | + color="color", |
| 72 | + hover_data=["timestamp"], |
| 73 | +) # <17> |
| 74 | +c |
| 75 | +``` |
| 76 | + |
| 77 | +1. Import the necessary libraries. |
| 78 | +2. Enable interactive mode for Ibis. |
| 79 | +3. Connect to an on-disk DuckDB database. |
| 80 | +4. Load the table if it already exists. |
| 81 | +5. Define the time range and step for the data. |
| 82 | +6. Create the array of timestamps. |
| 83 | +7. Unnest the array to a column. |
| 84 | +8. Name the column "timestamp". |
| 85 | +9. Convert the column into a table. |
| 86 | +10. Create a monotonically increasing index column. |
| 87 | +11. Create three columns of random numbers. |
| 88 | +12. Create a color column based on the index (help visualize the time series). |
| 89 | +13. Drop the index column. |
| 90 | +14. Rearrange the columns. |
| 91 | +15. Order the table by timestamp. |
| 92 | +16. Store the table in the on-disk database. |
| 93 | +17. Create a 3D line plot of the data. |
| 94 | + |
| 95 | +## Walking |
| 96 | + |
| 97 | +We have a random cube of data: |
| 98 | + |
| 99 | +```{python} |
| 100 | +t |
| 101 | +``` |
| 102 | + |
| 103 | +But we need to make it [walk](https://en.wikipedia.org/wiki/Random_walk). We'll |
| 104 | +use a window function to calculate the cumulative sum of each column: |
| 105 | + |
| 106 | +::: {.panel-tabset} |
| 107 | + |
| 108 | +## Without column selectors |
| 109 | + |
| 110 | +```{python} |
| 111 | +window = ibis.window(order_by="timestamp", preceding=None, following=0) |
| 112 | +walked = t.select( |
| 113 | + "timestamp", |
| 114 | + "color", |
| 115 | + a=t["a"].sum().over(window), |
| 116 | + b=t["b"].sum().over(window), |
| 117 | + c=t["c"].sum().over(window), |
| 118 | +).order_by("timestamp") |
| 119 | +walked |
| 120 | +``` |
| 121 | + |
| 122 | +## With column selectors |
| 123 | + |
| 124 | +```{python} |
| 125 | +window = ibis.window(order_by="timestamp", preceding=None, following=0) |
| 126 | +walked = t.select( |
| 127 | + "timestamp", |
| 128 | + "color", |
| 129 | + s.across( |
| 130 | + s.c("a", "b", "c"), # <1> |
| 131 | + ibis._.sum().over(window), # <2> |
| 132 | + ), |
| 133 | +).order_by("timestamp") |
| 134 | +walked |
| 135 | +``` |
| 136 | + |
| 137 | +1. Alternatively, you can use `s.of_type(float)` to select all float columns. |
| 138 | +2. Use the `ibis._` selector to reference a deferred column expression. |
| 139 | + |
| 140 | +::: |
| 141 | + |
| 142 | +While the first few rows may look similar to the cube, the 3D line plot does |
| 143 | +not: |
| 144 | + |
| 145 | +```{python} |
| 146 | +#| code-fold: true |
| 147 | +#| code-summary: "Show me the code!" |
| 148 | +c = px.line_3d( |
| 149 | + walked, |
| 150 | + x="a", |
| 151 | + y="b", |
| 152 | + z="c", |
| 153 | + color="color", |
| 154 | + hover_data=["timestamp"], |
| 155 | +) |
| 156 | +c |
| 157 | +``` |
| 158 | + |
| 159 | +## Talking |
| 160 | + |
| 161 | +We've made our random cube and we've made it walk, but now we want to make it |
| 162 | +talk. At this point, you might be questioning the utility of this blog post -- |
| 163 | +what are we doing and why? The purpose is to demonstrate generating synthetic |
| 164 | +data that can look realistic. We achieve this by building in randomness (e.g. a |
| 165 | +random walk can be used to simulate stock prices) and also by using that |
| 166 | +randomness to inform the generation of non-numeric synthetic data (e.g. the |
| 167 | +ticker symbol of a stock). |
| 168 | + |
| 169 | +### Faking it |
| 170 | + |
| 171 | +Let's demonstrate this concept by pretending we have an application where users |
| 172 | +can review a location they're at. The user's name, comment, location, and device |
| 173 | +info are stored in our database for their review at a given timestamp. |
| 174 | + |
| 175 | +[Faker](https://github.com/joke2k/faker) is a commonly used Python library for |
| 176 | +generating fake data. We'll use it to generate fake names, comments, locations, |
| 177 | +and device info for our reviews: |
| 178 | + |
| 179 | +```{python} |
| 180 | +from faker import Faker |
| 181 | +
|
| 182 | +fake = Faker() |
| 183 | +
|
| 184 | +res = ( |
| 185 | + fake.name(), |
| 186 | + fake.sentence(), |
| 187 | + fake.location_on_land(), |
| 188 | + fake.user_agent(), |
| 189 | + fake.ipv4(), |
| 190 | +) |
| 191 | +res |
| 192 | +``` |
| 193 | + |
| 194 | +We can use our random numbers to influence the fake data generation in a Python |
| 195 | +UDF: |
| 196 | + |
| 197 | +```{python} |
| 198 | +#| echo: false |
| 199 | +#| code-fold: true |
| 200 | +con.raw_sql("set enable_progress_bar = false;"); |
| 201 | +``` |
| 202 | + |
| 203 | +```{python} |
| 204 | +# | code-fold: true |
| 205 | +# | code-summary: "Show me the code!" |
| 206 | +import ibis.expr.datatypes as dt |
| 207 | +
|
| 208 | +from datetime import datetime, timedelta |
| 209 | +
|
| 210 | +ibis.options.repr.interactive.max_length = 5 |
| 211 | +
|
| 212 | +record_schema = dt.Struct( |
| 213 | + { |
| 214 | + "timestamp": datetime, |
| 215 | + "name": str, |
| 216 | + "comment": str, |
| 217 | + "location": list[str], |
| 218 | + "device": dt.Struct( |
| 219 | + { |
| 220 | + "browser": str, |
| 221 | + "ip": str, |
| 222 | + } |
| 223 | + ), |
| 224 | + } |
| 225 | +) |
| 226 | +
|
| 227 | +
|
| 228 | +@ibis.udf.scalar.python |
| 229 | +def faked_batch( |
| 230 | + timestamp: datetime, |
| 231 | + a: float, |
| 232 | + b: float, |
| 233 | + c: float, |
| 234 | + batch_size: int = 8, |
| 235 | +) -> dt.Array(record_schema): |
| 236 | + """ |
| 237 | + Generate records of fake data. |
| 238 | + """ |
| 239 | + value = (a + b + c) / 3 |
| 240 | +
|
| 241 | + res = [ |
| 242 | + { |
| 243 | + "timestamp": timestamp + timedelta(seconds=0.1 * i), |
| 244 | + "name": fake.name() if value >= 0.5 else fake.first_name(), |
| 245 | + "comment": fake.sentence(), |
| 246 | + "location": fake.location_on_land(), |
| 247 | + "device": { |
| 248 | + "browser": fake.user_agent(), |
| 249 | + "ip": fake.ipv4() if value >= 0 else fake.ipv6(), |
| 250 | + }, |
| 251 | + } |
| 252 | + for i in range(batch_size) |
| 253 | + ] |
| 254 | +
|
| 255 | + return res |
| 256 | +
|
| 257 | +
|
| 258 | +if "faked" in con.list_tables(): |
| 259 | + faked = con.table("faked") |
| 260 | +else: |
| 261 | + faked = ( |
| 262 | + t.mutate( |
| 263 | + faked=faked_batch(t["timestamp"], t["a"], t["b"], t["c"]), |
| 264 | + ) |
| 265 | + .select( |
| 266 | + "a", |
| 267 | + "b", |
| 268 | + "c", |
| 269 | + ibis._["faked"].unnest(), |
| 270 | + ) |
| 271 | + .unpack("faked") |
| 272 | + .drop("a", "b", "c") |
| 273 | + ) |
| 274 | +
|
| 275 | + faked = con.create_table("faked", faked) |
| 276 | +
|
| 277 | +faked |
| 278 | +``` |
| 279 | + |
| 280 | +And now we have a "realistic" dataset of fake reviews matching our desired |
| 281 | +schema. You can adjust this to match the schema and expected distributions of |
| 282 | +your own data and scale it up as needed. |
| 283 | + |
| 284 | +### GenAI/LLMs |
| 285 | + |
| 286 | +The names and locations from Faker are bland and unrealistic. The comments are |
| 287 | +nonsensical. ~~And most importantly, we haven't filled our quota for blogs |
| 288 | +mentioning AI.~~ You could [use language models in Ibis UDFs to generate more |
| 289 | +realistic synthetic data](../lms-for-data/index.qmd). We could use "open source" |
| 290 | +language models to do this locally for free, an exercise left to the reader. |
| 291 | + |
| 292 | +## Next steps |
| 293 | + |
| 294 | +If you've followed along, you have a `synthetic.ddb` file with a couple tables: |
| 295 | + |
| 296 | +```{python} |
| 297 | +con.list_tables() |
| 298 | +``` |
| 299 | + |
| 300 | +We can estimate the size of data generated: |
| 301 | + |
| 302 | +```{python} |
| 303 | +import os |
| 304 | +
|
| 305 | +size_in_mbs = os.path.getsize("synthetic.ddb") / (1024 * 1024) |
| 306 | +print(f"synthetic.ddb: {size_in_mbs:.2f} MBs") |
| 307 | +``` |
| 308 | + |
| 309 | +You can build from here to generate realistic synthetic data at any scale for |
| 310 | +any use case. |
0 commit comments