Skip to content

Commit ea16887

Browse files
docs(blog): walking talking cube (#10160)
1 parent df5c2bf commit ea16887

File tree

3 files changed

+326
-0
lines changed

3 files changed

+326
-0
lines changed

docs/_freeze/posts/walking-talking-cube/index/execute-results/html.json

Lines changed: 16 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Lines changed: 310 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,310 @@
1+
---
2+
title: "Taking a random cube for a walk and making it talk"
3+
author: "Cody Peterson"
4+
date: "2024-09-26"
5+
image: thumbnail.png
6+
categories:
7+
- blog
8+
- duckdb
9+
- udfs
10+
---
11+
12+
***Synthetic data with Ibis, DuckDB, Python UDFs, and Faker.***
13+
14+
To follow along, install the required libraries:
15+
16+
```bash
17+
pip install 'ibis-framework[duckdb]' faker plotly
18+
```
19+
20+
## A random cube
21+
22+
We'll generate a random cube of data with Ibis (default DuckDB backend) and
23+
visualize it as a 3D line plot:
24+
25+
```{python}
26+
#| code-fold: true
27+
#| code-summary: "Show me the code!"
28+
import ibis # <1>
29+
import ibis.selectors as s
30+
import plotly.express as px # <1>
31+
32+
ibis.options.interactive = True # <2>
33+
ibis.options.repr.interactive.max_rows = 5 # <2>
34+
35+
con = ibis.connect("duckdb://synthetic.ddb") # <3>
36+
37+
if "source" in con.list_tables():
38+
t = con.table("source") # <4>
39+
else:
40+
lookback = ibis.interval(days=1) # <5>
41+
step = ibis.interval(seconds=1) # <5>
42+
43+
t = (
44+
(
45+
ibis.range( # <6>
46+
ibis.now() - lookback,
47+
ibis.now(),
48+
step=step,
49+
) # <6>
50+
.unnest() # <7>
51+
.name("timestamp") # <8>
52+
.as_table() # <9>
53+
)
54+
.mutate(
55+
index=(ibis.row_number().over(order_by="timestamp")), # <10>
56+
**{col: 2 * (ibis.random() - 0.5) for col in ["a", "b", "c"]}, # <11>
57+
)
58+
.mutate(color=ibis._["index"].histogram(nbins=8)) # <12>
59+
.drop("index") # <13>
60+
.relocate("timestamp", "color") # <14>
61+
.order_by("timestamp") # <15>
62+
)
63+
64+
t = con.create_table("source", t.to_pyarrow()) # <16>
65+
66+
c = px.line_3d( # <17>
67+
t,
68+
x="a",
69+
y="b",
70+
z="c",
71+
color="color",
72+
hover_data=["timestamp"],
73+
) # <17>
74+
c
75+
```
76+
77+
1. Import the necessary libraries.
78+
2. Enable interactive mode for Ibis.
79+
3. Connect to an on-disk DuckDB database.
80+
4. Load the table if it already exists.
81+
5. Define the time range and step for the data.
82+
6. Create the array of timestamps.
83+
7. Unnest the array to a column.
84+
8. Name the column "timestamp".
85+
9. Convert the column into a table.
86+
10. Create a monotonically increasing index column.
87+
11. Create three columns of random numbers.
88+
12. Create a color column based on the index (help visualize the time series).
89+
13. Drop the index column.
90+
14. Rearrange the columns.
91+
15. Order the table by timestamp.
92+
16. Store the table in the on-disk database.
93+
17. Create a 3D line plot of the data.
94+
95+
## Walking
96+
97+
We have a random cube of data:
98+
99+
```{python}
100+
t
101+
```
102+
103+
But we need to make it [walk](https://en.wikipedia.org/wiki/Random_walk). We'll
104+
use a window function to calculate the cumulative sum of each column:
105+
106+
::: {.panel-tabset}
107+
108+
## Without column selectors
109+
110+
```{python}
111+
window = ibis.window(order_by="timestamp", preceding=None, following=0)
112+
walked = t.select(
113+
"timestamp",
114+
"color",
115+
a=t["a"].sum().over(window),
116+
b=t["b"].sum().over(window),
117+
c=t["c"].sum().over(window),
118+
).order_by("timestamp")
119+
walked
120+
```
121+
122+
## With column selectors
123+
124+
```{python}
125+
window = ibis.window(order_by="timestamp", preceding=None, following=0)
126+
walked = t.select(
127+
"timestamp",
128+
"color",
129+
s.across(
130+
s.c("a", "b", "c"), # <1>
131+
ibis._.sum().over(window), # <2>
132+
),
133+
).order_by("timestamp")
134+
walked
135+
```
136+
137+
1. Alternatively, you can use `s.of_type(float)` to select all float columns.
138+
2. Use the `ibis._` selector to reference a deferred column expression.
139+
140+
:::
141+
142+
While the first few rows may look similar to the cube, the 3D line plot does
143+
not:
144+
145+
```{python}
146+
#| code-fold: true
147+
#| code-summary: "Show me the code!"
148+
c = px.line_3d(
149+
walked,
150+
x="a",
151+
y="b",
152+
z="c",
153+
color="color",
154+
hover_data=["timestamp"],
155+
)
156+
c
157+
```
158+
159+
## Talking
160+
161+
We've made our random cube and we've made it walk, but now we want to make it
162+
talk. At this point, you might be questioning the utility of this blog post --
163+
what are we doing and why? The purpose is to demonstrate generating synthetic
164+
data that can look realistic. We achieve this by building in randomness (e.g. a
165+
random walk can be used to simulate stock prices) and also by using that
166+
randomness to inform the generation of non-numeric synthetic data (e.g. the
167+
ticker symbol of a stock).
168+
169+
### Faking it
170+
171+
Let's demonstrate this concept by pretending we have an application where users
172+
can review a location they're at. The user's name, comment, location, and device
173+
info are stored in our database for their review at a given timestamp.
174+
175+
[Faker](https://github.com/joke2k/faker) is a commonly used Python library for
176+
generating fake data. We'll use it to generate fake names, comments, locations,
177+
and device info for our reviews:
178+
179+
```{python}
180+
from faker import Faker
181+
182+
fake = Faker()
183+
184+
res = (
185+
fake.name(),
186+
fake.sentence(),
187+
fake.location_on_land(),
188+
fake.user_agent(),
189+
fake.ipv4(),
190+
)
191+
res
192+
```
193+
194+
We can use our random numbers to influence the fake data generation in a Python
195+
UDF:
196+
197+
```{python}
198+
#| echo: false
199+
#| code-fold: true
200+
con.raw_sql("set enable_progress_bar = false;");
201+
```
202+
203+
```{python}
204+
# | code-fold: true
205+
# | code-summary: "Show me the code!"
206+
import ibis.expr.datatypes as dt
207+
208+
from datetime import datetime, timedelta
209+
210+
ibis.options.repr.interactive.max_length = 5
211+
212+
record_schema = dt.Struct(
213+
{
214+
"timestamp": datetime,
215+
"name": str,
216+
"comment": str,
217+
"location": list[str],
218+
"device": dt.Struct(
219+
{
220+
"browser": str,
221+
"ip": str,
222+
}
223+
),
224+
}
225+
)
226+
227+
228+
@ibis.udf.scalar.python
229+
def faked_batch(
230+
timestamp: datetime,
231+
a: float,
232+
b: float,
233+
c: float,
234+
batch_size: int = 8,
235+
) -> dt.Array(record_schema):
236+
"""
237+
Generate records of fake data.
238+
"""
239+
value = (a + b + c) / 3
240+
241+
res = [
242+
{
243+
"timestamp": timestamp + timedelta(seconds=0.1 * i),
244+
"name": fake.name() if value >= 0.5 else fake.first_name(),
245+
"comment": fake.sentence(),
246+
"location": fake.location_on_land(),
247+
"device": {
248+
"browser": fake.user_agent(),
249+
"ip": fake.ipv4() if value >= 0 else fake.ipv6(),
250+
},
251+
}
252+
for i in range(batch_size)
253+
]
254+
255+
return res
256+
257+
258+
if "faked" in con.list_tables():
259+
faked = con.table("faked")
260+
else:
261+
faked = (
262+
t.mutate(
263+
faked=faked_batch(t["timestamp"], t["a"], t["b"], t["c"]),
264+
)
265+
.select(
266+
"a",
267+
"b",
268+
"c",
269+
ibis._["faked"].unnest(),
270+
)
271+
.unpack("faked")
272+
.drop("a", "b", "c")
273+
)
274+
275+
faked = con.create_table("faked", faked)
276+
277+
faked
278+
```
279+
280+
And now we have a "realistic" dataset of fake reviews matching our desired
281+
schema. You can adjust this to match the schema and expected distributions of
282+
your own data and scale it up as needed.
283+
284+
### GenAI/LLMs
285+
286+
The names and locations from Faker are bland and unrealistic. The comments are
287+
nonsensical. ~~And most importantly, we haven't filled our quota for blogs
288+
mentioning AI.~~ You could [use language models in Ibis UDFs to generate more
289+
realistic synthetic data](../lms-for-data/index.qmd). We could use "open source"
290+
language models to do this locally for free, an exercise left to the reader.
291+
292+
## Next steps
293+
294+
If you've followed along, you have a `synthetic.ddb` file with a couple tables:
295+
296+
```{python}
297+
con.list_tables()
298+
```
299+
300+
We can estimate the size of data generated:
301+
302+
```{python}
303+
import os
304+
305+
size_in_mbs = os.path.getsize("synthetic.ddb") / (1024 * 1024)
306+
print(f"synthetic.ddb: {size_in_mbs:.2f} MBs")
307+
```
308+
309+
You can build from here to generate realistic synthetic data at any scale for
310+
any use case.
90.3 KB
Loading

0 commit comments

Comments
 (0)