|
| 1 | +--- |
| 2 | +title: Export the dataframe |
| 3 | +order: 2 |
| 4 | +--- |
| 5 | + |
| 6 | + |
| 7 | +In the [previous section](explore-as-dataframe.md), we explored some face tracking data using the dataframe view. In this section, we will see how we can use the dataframe API of the Rerun SDK to export the same data into a [Pandas](https://pandas.pydata.org) dataframe to further inspect and process it. |
| 8 | + |
| 9 | +## Load the recording |
| 10 | + |
| 11 | +The dataframe SDK loads data from an .RRD file. |
| 12 | +The first step is thus to save the recording as RRD, which can be done from the Rerun menu: |
| 13 | + |
| 14 | +<picture style="zoom: 0.5"> |
| 15 | + <img src="https://static.rerun.io/save_recording/ece0f887428b1800a305a3e30faeb57fa3d77cd8/full.png" alt=""> |
| 16 | + <source media="(max-width: 480px)" srcset="https://static.rerun.io/save_recording/ece0f887428b1800a305a3e30faeb57fa3d77cd8/480w.png"> |
| 17 | +</picture> |
| 18 | + |
| 19 | +We can then load the recording in a Python script as follows: |
| 20 | + |
| 21 | +```python |
| 22 | +import rerun as rr |
| 23 | +import numpy as np # We'll need this later. |
| 24 | + |
| 25 | +# load the recording |
| 26 | +recording = rr.dataframe.load_recording("face_tracking.rrd") |
| 27 | +``` |
| 28 | + |
| 29 | + |
| 30 | +## Query the data |
| 31 | + |
| 32 | +Once we loaded a recording, we can query it to extract some data. Here is how it is done: |
| 33 | + |
| 34 | +```python |
| 35 | +# query the recording into a pandas dataframe |
| 36 | +view = recording.view( |
| 37 | + index="frame_nr", |
| 38 | + contents="/blendshapes/0/jawOpen" |
| 39 | +) |
| 40 | +table = view.select().read_all() |
| 41 | +``` |
| 42 | + |
| 43 | +A lot is happening here, let's go step by step: |
| 44 | +1. We first create a _view_ into the recording. The view specifies which index column we want to use (in this case the `"frame_nr"` timeline), and which other content we want to consider (here, only the `/blendshapes/0/jawOpen` entity). The view defines a subset of all the data contained in the recording where each row has a unique value for the index, and columns are filtered based on the value(s) provided as `contents` argument. |
| 45 | +2. A view can then be queried. Here we use the simplest possible form of querying by calling `select()`. No filtering is applied, and all view columns are selected. The result thus corresponds to the entire view. |
| 46 | +3. The object returned by `select()` is a [`pyarrow.RecordBatchReader`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html). This is essentially an iterator that returns the stream of [`pyarrow.RecordBatch`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow-recordbatch)es containing the query data. |
| 47 | +4. Finally, we use the [`pyarrow.RecordBatchReader.read_all()`](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html#pyarrow.RecordBatchReader.read_all) function to read all record batches as a [`pyarrow.Table`](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table). |
| 48 | + |
| 49 | +**Note**: queries can be further narrowed by filtering rows and/or selecting a subset of the view columns. See the reference documentation for more information. |
| 50 | + |
| 51 | +<!-- TODO(#7499): add a link to the reference documentation --> |
| 52 | + |
| 53 | +Let's have a look at the resulting table: |
| 54 | + |
| 55 | +```python |
| 56 | +print(table) |
| 57 | +``` |
| 58 | + |
| 59 | +Here is the result: |
| 60 | +``` |
| 61 | +pyarrow.Table |
| 62 | +frame_nr: int64 |
| 63 | +frame_time: timestamp[ns] |
| 64 | +log_tick: int64 |
| 65 | +log_time: timestamp[ns] |
| 66 | +/blendshapes/0/jawOpen:Scalar: list<item: double> |
| 67 | + child 0, item: double |
| 68 | +---- |
| 69 | +frame_nr: [[0],[1],...,[412],[413]] |
| 70 | +frame_time: [[1970-01-01 00:00:00.000000000],[1970-01-01 00:00:00.040000000],...,[1970-01-01 00:00:16.480000000],[1970-01-01 00:00:16.520000000]] |
| 71 | +log_tick: [[34],[92],...,[22077],[22135]] |
| 72 | +log_time: [[2024-10-13 08:26:46.819571000],[2024-10-13 08:26:46.866358000],...,[2024-10-13 08:27:01.722971000],[2024-10-13 08:27:01.757358000]] |
| 73 | +/blendshapes/0/jawOpen:Scalar: [[[0.03306490555405617]],[[0.03812221810221672]],...,[[0.06996039301156998]],[[0.07366073131561279]]] |
| 74 | +``` |
| 75 | + |
| 76 | +Again, this is a [PyArrow](https://arrow.apache.org/docs/python/index.html) table which contains the result of our query. Further exploring Arrow structures is beyond the scope of this guide. Yet, it is a reminder that Rerun natively stores—and returns—data in arrow format. As such, it efficiently interoperates with other Arrow-native and/or compatible tools such as [Polars](https://pola.rs) or [DuckDB](https://duckdb.org). |
| 77 | + |
| 78 | + |
| 79 | +## Create a Pandas dataframe |
| 80 | + |
| 81 | +Before exploring the data further, let's convert the table to a Pandas dataframe: |
| 82 | + |
| 83 | +```python |
| 84 | +df = table.to_pandas() |
| 85 | +``` |
| 86 | + |
| 87 | +Alternatively, the dataframe can be created directly, without using the intermediate PyArrow table: |
| 88 | + |
| 89 | +```python |
| 90 | +df = view.select().read_pandas() |
| 91 | +``` |
| 92 | + |
| 93 | + |
| 94 | +## Inspect the dataframe |
| 95 | + |
| 96 | +Let's have a first look at this dataframe: |
| 97 | + |
| 98 | +```python |
| 99 | +print(df) |
| 100 | +``` |
| 101 | + |
| 102 | +Here is the result: |
| 103 | + |
| 104 | +<!-- NOLINT_START --> |
| 105 | + |
| 106 | +``` |
| 107 | + frame_nr frame_time log_tick log_time /blendshapes/0/jawOpen:Scalar |
| 108 | +0 0 1970-01-01 00:00:00.000 34 2024-10-13 08:26:46.819571 [0.03306490555405617] |
| 109 | +1 1 1970-01-01 00:00:00.040 92 2024-10-13 08:26:46.866358 [0.03812221810221672] |
| 110 | +2 2 1970-01-01 00:00:00.080 150 2024-10-13 08:26:46.899699 [0.027743922546505928] |
| 111 | +3 3 1970-01-01 00:00:00.120 208 2024-10-13 08:26:46.934704 [0.024137917906045914] |
| 112 | +4 4 1970-01-01 00:00:00.160 266 2024-10-13 08:26:46.967762 [0.022867577150464058] |
| 113 | +.. ... ... ... ... ... |
| 114 | +409 409 1970-01-01 00:00:16.360 21903 2024-10-13 08:27:01.619732 [0.07283800840377808] |
| 115 | +410 410 1970-01-01 00:00:16.400 21961 2024-10-13 08:27:01.656455 [0.07037288695573807] |
| 116 | +411 411 1970-01-01 00:00:16.440 22019 2024-10-13 08:27:01.689784 [0.07556036114692688] |
| 117 | +412 412 1970-01-01 00:00:16.480 22077 2024-10-13 08:27:01.722971 [0.06996039301156998] |
| 118 | +413 413 1970-01-01 00:00:16.520 22135 2024-10-13 08:27:01.757358 [0.07366073131561279] |
| 119 | +
|
| 120 | +[414 rows x 5 columns] |
| 121 | +``` |
| 122 | + |
| 123 | +<!-- NOLINT_END --> |
| 124 | + |
| 125 | +We can make several observations from this output. |
| 126 | + |
| 127 | +- The first four columns are timeline columns. These are the various timelines the data is logged to in this recording. |
| 128 | +- The last columns is named `/blendshapes/0/jawOpen:Scalar`. This is what we call a _component column_, and it corresponds to the [Scalar](../../reference/types/components/scalar.md) component logged to the `/blendshapes/0/jawOpen` entity. |
| 129 | +- Each row in the `/blendshapes/0/jawOpen:Scalar` column consists of a _list_ of (typically one) scalar. |
| 130 | + |
| 131 | +This last point may come as a surprise but is a consequence of Rerun's data model where components are always stored as arrays. This enables, for example, to log an entire point cloud using the [`Points3D`](../../reference/types/archetypes/points3d.md) archetype under a single entity and at a single timestamp. |
| 132 | + |
| 133 | +Let's explore this further, recalling that, in our recording, no face was detected at around frame #170: |
| 134 | + |
| 135 | +```python |
| 136 | +print(df["/blendshapes/0/jawOpen:Scalar"][160:180]) |
| 137 | +``` |
| 138 | + |
| 139 | +Here is the result: |
| 140 | + |
| 141 | +``` |
| 142 | +160 [0.0397215373814106] |
| 143 | +161 [0.037685077637434006] |
| 144 | +162 [0.0402931347489357] |
| 145 | +163 [0.04329492896795273] |
| 146 | +164 [0.0394592322409153] |
| 147 | +165 [0.020853394642472267] |
| 148 | +166 [] |
| 149 | +167 [] |
| 150 | +168 [] |
| 151 | +169 [] |
| 152 | +170 [] |
| 153 | +171 [] |
| 154 | +172 [] |
| 155 | +173 [] |
| 156 | +174 [] |
| 157 | +175 [] |
| 158 | +176 [] |
| 159 | +177 [] |
| 160 | +178 [] |
| 161 | +179 [] |
| 162 | +Name: /blendshapes/0/jawOpen:Scalar, dtype: object |
| 163 | +``` |
| 164 | + |
| 165 | +We note that the data contains empty lists when no face is detected. When the blendshapes entities are [`Clear`](../../reference/types/archetypes/clear.md)ed, this happens for the corresponding timestamps and all further timestamps until a new value is logged. |
| 166 | + |
| 167 | +While this data representation is in general useful, a flat floating point representation with NaN for missing values is typically more convenient for scalar data. This is achieved using the [`explode()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html) method: |
| 168 | + |
| 169 | +```python |
| 170 | +df["jawOpen"] = df["/blendshapes/0/jawOpen:Scalar"].explode().astype(float) |
| 171 | +print(df["jawOpen"][160:180]) |
| 172 | +``` |
| 173 | +Here is the result: |
| 174 | +``` |
| 175 | +160 0.039722 |
| 176 | +161 0.037685 |
| 177 | +162 0.040293 |
| 178 | +163 0.043295 |
| 179 | +164 0.039459 |
| 180 | +165 0.020853 |
| 181 | +166 NaN |
| 182 | +167 NaN |
| 183 | +168 NaN |
| 184 | +169 NaN |
| 185 | +170 NaN |
| 186 | +171 NaN |
| 187 | +172 NaN |
| 188 | +173 NaN |
| 189 | +174 NaN |
| 190 | +175 NaN |
| 191 | +176 NaN |
| 192 | +177 NaN |
| 193 | +178 NaN |
| 194 | +179 NaN |
| 195 | +Name: jawOpen, dtype: float64 |
| 196 | +``` |
| 197 | + |
| 198 | +This confirms that the newly created `"jawOpen"` column now contains regular, 64-bit float numbers, and missing values are represented by NaNs. |
| 199 | + |
| 200 | +_Note_: should you want to filter out the NaNs, you may use the [`dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) method. |
| 201 | + |
| 202 | +## Next steps |
| 203 | + |
| 204 | +With this, we are ready to analyze the data and log back the result to the Rerun viewer, which is covered in the [next section](analyze-and-log.md) of this guide. |
0 commit comments