Skip to content

Commit 751cfcf

Browse files
gforsythcpcloud
andcommitted
docs(blog): add pypi compiled file extension blog
Apply suggestions from code review Co-authored-by: Phillip Cloud <[email protected]>
1 parent 2cec720 commit 751cfcf

File tree

3 files changed

+281
-0
lines changed

3 files changed

+281
-0
lines changed

docs/_freeze/posts/querying-pypi-metadata-compiled-languages/index/execute-results/html.json

Lines changed: 15 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Lines changed: 266 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,266 @@
1+
---
2+
title: Querying every file in every release on the Python Package Index (redux)
3+
author: Gil Forsyth
4+
date: 2023-11-15
5+
categories:
6+
- blog
7+
---
8+
9+
Seth Larson wrote a great [blog
10+
post](https://sethmlarson.dev/security-developer-in-residence-weekly-report-18)
11+
on querying a PyPI dataset to look for trends in the use of memory-safe
12+
languages in Python.
13+
14+
Check out Seth's article for more information on the dataset (and
15+
it's a good read!). It caught our eye because it makes use of
16+
[DuckDB](https://duckdb.org/) to clean the data for analysis.
17+
18+
That's right up our alley here in Ibis land, so let's see if we can duplicate
19+
Seth's results (and then continue on to plot them!)
20+
21+
## Grab the data (locations)
22+
23+
Seth showed (and then safely decomposed) a nested `curl` statement and that's
24+
always viable -- we're in Python land so why not grab the filenames using
25+
`urllib3`?
26+
27+
```{python}
28+
import urllib3
29+
30+
http = urllib3.PoolManager()
31+
32+
resp = http.request("GET", "https://github.com/pypi-data/data/raw/main/links/dataset.txt")
33+
34+
parquet_files = resp.data.decode().split()
35+
parquet_files
36+
```
37+
38+
## Grab the data
39+
40+
Now we're ready to get started with Ibis!
41+
42+
DuckDB is clever enough to grab only the parquet metadata. This means we can
43+
use `read_parquet` to create a lazy view of the parquet files and then build up
44+
our expression without downloading everything beforehand!
45+
46+
```{python}
47+
import ibis
48+
from ibis import _ # <1>
49+
50+
ibis.options.interactive = True
51+
```
52+
53+
1. See https://ibis-project.org/how-to/analytics/chain_expressions.html for docs
54+
on the deferred operator!
55+
56+
Create a DuckDB connection:
57+
58+
```{python}
59+
con = ibis.duckdb.connect()
60+
```
61+
62+
And load up one of the files (we can run the full query after)!
63+
64+
```{python}
65+
pypi = con.read_parquet(parquet_files[0], table_name="pypi")
66+
```
67+
68+
```{python}
69+
pypi.schema()
70+
```
71+
72+
## Query crafting
73+
74+
Let's break down what we're looking for. As a high-level view of the use of
75+
compiled languages, Seth is using file extensions as an indicator that a given
76+
filetype is used in a Python project.
77+
78+
The dataset we're using has _every file in every project_ -- what criteria should we use?
79+
80+
We can follow Seth's lead and look for things:
81+
82+
1. A file extension that is one of: `asm`, `cc`, `cpp`, `cxx`, `h`, `hpp`, `rs`, `go`, and variants of `F90`, `f90`, etc...
83+
That is, C, C++, Assembly, Rust, Go, and Fortran.
84+
2. We exclude matches where the file path is within the `site-packages/` directory.
85+
3. We exclude matches that are in directories used for testing.
86+
87+
```{python}
88+
expr = pypi.filter(
89+
[
90+
_.path.re_search(r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0-2}(?:or)?|go)$"),
91+
~_.path.re_search(r"(^|/)test(|s|ing)"),
92+
~_.path.contains("/site-packages/"),
93+
]
94+
)
95+
expr
96+
```
97+
98+
That _could_ be right -- we can peak at the filename at the end of the `path` column to do a quick check:
99+
100+
```{python}
101+
expr.path.split("/")[-1]
102+
```
103+
104+
Ok! Next up, we want to group the matches by:
105+
106+
1. The month that the package / file was published
107+
For this, we can use the `truncate` method and ask for month as our truncation window.
108+
2. The file extension of the file used
109+
110+
```{python}
111+
expr.group_by(
112+
month=_.uploaded_on.truncate("M"),
113+
ext=_.path.re_extract(r"\.([a-z0-9]+)$", 1),
114+
).aggregate()
115+
```
116+
117+
That looks promising. Now we need to grab the package names that correspond to a
118+
given file extension in a given month and deduplicate it. And to match Seth's
119+
results, we'll also sort by the month in descending order:
120+
121+
```{python}
122+
expr = (
123+
expr.group_by(
124+
month=_.uploaded_on.truncate("M"),
125+
ext=_.path.re_extract(r"\.([a-z0-9]+)$", 1),
126+
)
127+
.aggregate(projects=_.project_name.collect().unique())
128+
.order_by(_.month.desc())
129+
)
130+
131+
expr
132+
```
133+
134+
## Massage and plot
135+
136+
Let's continue and see what our results look like.
137+
138+
We'll do a few things:
139+
140+
1. Combine all of the C and C++ extensions into a single group by renaming them all.
141+
2. Count the number of distinct entries in each group
142+
3. Plot the results!
143+
144+
```{python}
145+
collapse_names = expr.mutate(
146+
ext=_.ext.re_replace(r"cxx|cpp|cc|c|hpp|h", "C/C++")
147+
.replace("rs", "Rust")
148+
.replace("go", "Go")
149+
.replace("asm", "Assembly"),
150+
)
151+
152+
collapse_names
153+
```
154+
155+
Note that now we need to de-duplicate again, since we might've had separate
156+
unique entries for both an `h` and `c` file extension, and we don't want to
157+
double-count!
158+
159+
We could rewrite our original query and include the renames in the original
160+
`group_by` (this would be the smart thing to do), but let's push on and see if
161+
we can make this work.
162+
163+
The `projects` column is now a column of string arrays, so we want to collect
164+
all of the arrays in each group, this will give us a "list of lists", then we'll
165+
`flatten` that list and call `unique().length()` as before.
166+
167+
DuckDB has a `flatten` function, but it isn't exposed in Ibis (yet!).
168+
169+
We'll use a handy bit of Ibis magic to define a `builtin` `UDF` that will map directly
170+
onto the underlying DuckDB function (what!? See
171+
[here](https://ibis-project.org/how-to/extending/builtin.html#duckdb) for more
172+
info):
173+
174+
```{python}
175+
@ibis.udf.scalar.builtin
176+
def flatten(x: list[list[str]]) -> list[str]:
177+
...
178+
179+
180+
collapse_names = collapse_names.group_by(["month", "ext"]).aggregate(
181+
projects=flatten(_.projects.collect())
182+
)
183+
184+
collapse_names
185+
```
186+
187+
We could have included the `unique().length()` in the `aggregate` call, but
188+
sometimes it's good to check that your slightly off-kilter idea has worked (and
189+
it has!).
190+
191+
```{python}
192+
collapse_names = collapse_names.select(
193+
_.month, _.ext, project_count=_.projects.unique().length()
194+
)
195+
196+
collapse_names
197+
```
198+
199+
Now that the data are tidied, we can pass our expression directly to Altair and see what it looks like!
200+
201+
```{python}
202+
import altair as alt
203+
204+
chart = (
205+
alt.Chart(collapse_names)
206+
.mark_line()
207+
.encode(x="month", y="project_count", color="ext")
208+
.properties(width=600, height=300)
209+
)
210+
chart
211+
```
212+
213+
That looks good, but it definitely doesn't match the plot from Seth's post:
214+
215+
![upstream plot](upstream_plot.png)
216+
217+
Our current plot is only showing the results from a subset of the available
218+
data. Now that our expression is complete, we can re-run on the full dataset and
219+
compare.
220+
221+
## The full run
222+
223+
To recap -- we pulled a lazy view of a single parquet file from the `pypi-data`
224+
repo, filtered for all the files that contain file extensions we care about,
225+
then grouped them all together to get counts of the various filetypes used
226+
across projects by month.
227+
228+
Here's the entire query chained together into a single command, now running on
229+
all of the `parquet` files we have access to:
230+
231+
```{python}
232+
pypi = con.read_parquet(parquet_files, table_name="pypi")
233+
234+
full_query = (
235+
pypi.filter(
236+
[
237+
_.path.re_search(
238+
r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0-2}(?:or)?|go)$"
239+
),
240+
~_.path.re_search(r"(^|/)test(|s|ing)"),
241+
~_.path.contains("/site-packages/"),
242+
]
243+
)
244+
.group_by(
245+
month=_.uploaded_on.truncate("M"),
246+
ext=_.path.re_extract(r"\.([a-z0-9]+)$", 1),
247+
)
248+
.aggregate(projects=_.project_name.collect().unique())
249+
.order_by(_.month.desc())
250+
.mutate(
251+
ext=_.ext.re_replace(r"cxx|cpp|cc|c|hpp|h", "C/C++")
252+
.replace("rs", "Rust")
253+
.replace("go", "Go")
254+
.replace("asm", "Assembly"),
255+
)
256+
.group_by(["month", "ext"])
257+
.aggregate(project_count=flatten(_.projects.collect()).unique().length())
258+
)
259+
chart = (
260+
alt.Chart(full_query)
261+
.mark_line()
262+
.encode(x="month", y="project_count", color="ext")
263+
.properties(width=600, height=300)
264+
)
265+
chart
266+
```
Loading

0 commit comments

Comments
 (0)