Simplify DataFrame interface? #28

altavir · 2021-06-28T05:54:11Z

DataFrame primary interface seems to be over-complicated. A lot of methods have only default implementations and could be moved to extensions. I propose to simplify it significantly like I've done here. It would allow to add and maintain features in a simpler way. For example, it would allow to addition of row-based DataFrames.

nikitinas · 2021-07-01T16:11:09Z

I agree. We've just finished experimenting with public DataFrame API and started global code review and cleanup.

altavir · 2021-07-01T16:42:23Z

Nice to hear that. I would be willing to help if you want. I've started to write a PR, but hit a general design problem of the column being a DataFrame and a lot of methods and iterators intermixing because of that. I think, there must be a clean separation of entities so one could choose what he iterates over and DataFrame itself should not be iterable.

nikitinas · 2021-12-14T23:04:41Z

@altavir Please, take a look at updated DataFrame interface. I've converted most methods into extensions and now it looks rather minimalistic.

nikitinas · 2021-12-14T23:20:26Z

Regarding your statement that DataFrame should not be iterable. Currently it doesn't implement Iterable interface, but supports iterator() operator that allows to use it in for loop.

Iteration over DataFrame is generally ambiguous and can have different meanings: iteration over rows, columns or values. This is why it has forEachRow and forEachColumn operations instead of just forEach. But in case with for loop we have only two options: either prohibit such usage or allow some default behavior. In my opinion, default behavior is better than nothing, that's why for iterates over rows, similar to Kotlin Collections.

altavir · 2021-12-15T06:29:32Z

@nikitinas It is still not possible to do an external implementation of DataFrame.

This method:

dataframe/src/main/kotlin/org/jetbrains/kotlinx/dataframe/ColumnsContainer.kt

Line 41 in 50dadec

public fun <R> getColumnOrNull(column: ColumnSelector<T, R>): DataColumn<R>?

requires ColumnsSelectionDsl implementation, but it is internal so it could not be instantiated from external code.
The same goes for

dataframe/src/main/kotlin/org/jetbrains/kotlinx/dataframe/DataFrame.kt

Line 47 in 50dadec

public fun <R> aggregate(body: AggregateGroupedBody<T, R>): DataRow<T>

. It requires instantiation of the interface that could not be done without replicating the whole internal infrastructure.

I would recommend using more basic access methods in classes and moving all advanced DSL to extensions or helpers.

nikitinas · 2021-12-15T09:00:42Z

Current design doesn't support external implementations of DataFrame interface, though it can be changed in future.

As for now everything is bound to the following data model:

DataFrame is a list of columns
DataColumn can be one of three kinds: ValueColumn, ColumnGroup or FrameColumn.

Implementation of all DataFrame operations rely on this data model heavily, because they return a new DataFrame instance that is created from a new list of columns. That's why any new implementation of DataFrame will require lots of changes in implementation of all operations as well. Could you, please, explain the case when you need another DataFrame implementation.

The major extension point is expected to be ValueColumn. That is an actual data storage. It can be extended to support primitive types for better performance. You can also provide your own List implementation over your columns with data and use it in existing ValueColumnImpl implementation. In particular, this allows to create column-based DataFrame wrapper over some other row-based data structure. Native support for row-based DataFrame is not on the roadmap, because it requires totally different implementation of all DataFrame operations.

altavir · 2021-12-15T10:45:21Z

I think that extensibility is very important. We would like to be able to use different data sources, such as databases, streaming files reads, and remote data. And it won't be possible to use if the API is locked on the current DataFrame internals.

A good test would be an example implementation of new data source integration, which does not depend on internals.

I've managed to implement a column here: https://github.com/mipt-npm/tables-kt/blob/master/tables-kt-dataframe/src/main/kotlin/space/kscience/dataforge/dataframe/TableAsDataFrame.kt, but did not manage to do a DataFrame, Is it possible to create a DataFrame from existing columns, bypassing builders?

nikitinas · 2021-12-15T14:59:00Z

You can use toDataFrame() extension for Iterable<DataColumn> or dataFrameOf(columns) to create DataFrame from a list of columns.

https://kotlin.github.io/dataframe/createdataframe.html#todataframe

nikitinas · 2021-12-15T15:09:20Z

Integration of external data sources will require reimplementation of all operations, otherwise it will be just a simple data viewer. Originally library design was focused on great user experience for in-memory data wrangling, but I agree that now we can consider external data sources support as well. We can choose some particular external source and think how we can extend DataFrame for it.

holgerbrandl · 2021-12-21T16:17:36Z

To me the most obvious candidate are databases. In dbplyr (note the b) they managed to keep 90% of the dplyr API while enabling server-side execution. It's just amazing. I use this almost daily to break down big data sets on the server-side before pulling them to my local computer. Just see https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html for their great intro.

However, I totally agree with @nikitinas that a well-designed API should come first before considering such options/extensions.

altavir · 2021-12-22T06:58:38Z

@holgerbrandl This is what I was talking about. Here is the integration of Tables.kt with Exposed: https://github.com/mipt-npm/tables-kt/tree/master/tables-kt-exposed/src/main/kotlin/space/kscience/dataforge/exposed. And I've added the same for direct CSV.

I am not sure I understand why internals are important for data wrangling. It would be nice to have a seminar on that.

holgerbrandl · 2021-12-22T09:02:20Z

From what I understand, the exposed example is creating a wrapper around a single table. But I think it does not allow to map compute constructs into the db. This is essentially the superpower of dpblyr: we can do

eqDim = tbl(con, in_schema("Foo", "Equipment")) %>% # create ref to a db-table
  inner_join(tbl(con, in_schema("Foo", "EquipmentName"))) %>% # create ref to another db-table and perform join (without doing it yet)
  select(EquipmentId, Equipment, Location, Operator) %>%
  group_by(Equipment) %>%
  mutate(..) %>% # not fully worked, but I guess this part is clear
  filter(..) %>% # nothing has been queries or computed until this point. Similar to a kotlin Sequence
  collect() # here dbplyr will compile a query, run it against the db, and pull the result

I guess you agree that this goes beyond your example, as this requires a local query builder & optimizer (which is part of dbplyr). Most importantly this is exactly the same API as for local dplyr computation. By simply moving up the collect we would turn this into an eager-compute chain. Clearly, this is very advanced and took the R community years to develop. So it's just meant as pointer/inspiration regarding later API requirements.

Indeed only very few people care about the internals. From our discussion yesterday, I found the DataFrame interface very simple (maybe expect the column index lookup, which you explained being necessary for performance reasons).

altavir · 2021-12-22T09:11:39Z

I just gave a simple example. It is not hard to add joins as well. The DataFrame user API is fine, but do we need to have all those things implementation locked? Or it is possible to add helpers for them?

nikitinas · 2021-12-23T07:44:13Z

Considering DataFrame API over DB we have several issues:

Row-based computations. DataFrame API uses row-based expressions in many operations, e.g. filter. These expressions can not be mapped into SQL, because they are just Kotlin lambdas executed for every row. On the other hand, they give flexibility to use any Kotlin functions in DataRow context and integrate smoothly with Kotlin stdlib. This is a trade off between API convenience and implementation extensibility. It can be solved by splitting DataFrame API into local and remote sets of operations to prohibit row-based computations in remote case. As an example of such API separation we have filter and filterBy operations: filter accepts row-based lambda (similar to stdlib) and filterBy selects Boolean column that can also be created by any column arithmetics.
Query optimization and deferred computation. As @holgerbrandl mentioned, efficient SQL execution requires bundling of several operations into one transaction. It can be solved by lazy implementation of DataFrame interface that collects all applied operations and performs them at either explicit collect() or at the first operation that is not mappable to SQL.
Write to DB. If we consider not only queries, but also write operations and want to use DataFrame API for SQL table updates, creation of new tables, table schema modification etc. without any data transfer to local machine - it will be even more challenging, but still possible.

So, it's possible. Just requires some work to be done.

I suggest to start with much more simple approach that will solve first two problems and will allow to use data from DB: we can add an adapter for Exposed that will convert Query into DataFrame. This will be an extension .toDataFrame() that doesn't require any internal changes in DataFrame implementation.

GavinRay97 · 2022-04-25T14:50:36Z

FWIW, the book "How Query Engines Work" by Andy Grove (Apache DataFusion/Ballista/Arrow-Rust author) covers building essentially the above -- in Kotlin

It walks through building first a DataFrame class, and an expression AST, then teaches you how to add a query planner/optimizer on top of it. Fantastic book.

Just in case anyone else finds this repo and is interested in similar topics = )

holgerbrandl · 2022-04-27T06:41:30Z

Great pointer, thanks for sharing.

nikitinas · 2022-04-27T07:56:18Z

Thank you for the link. Our current idea is to sit on top of Exposed that already supports SQL query optimization.

Jolanrensen · 2023-06-22T18:44:29Z

For simple, untyped/unchecked conversion from Exposed Queries to DataFrames, something as simple as this already works:

// Try to get a proper name instead of something like $Line23432.Albums.artistId
// Needs to be expanded
val Expression<*>.readableName: String
    get() = when (this) {
        is Column<*> -> name
        is ExpressionAlias<*> -> alias
        is BiCompositeColumn<*, *, *> -> getRealColumns().joinToString("_") { it.readableName }
        else -> toString()
    }

// Simply retrieve the entire Query and convert the rows to columns
fun  Iterable<ResultRow> /* Query */.toDataFrame(): DataFrame<*> {
    val map = mutableMapOf<String, MutableList<Any?>>()
    forEach { row ->
        for (expression in row.fieldIndex.keys) {
            map.getOrPut(expression.readableName) { 
                mutableListOf()
            } += row[expression]
        }
    }

    return map.toDataFrame()
}

Now of course, this pulls the entire query into memory, where DF operates. If you don't want that, the operation should be batched into multiple dataframes.

zaleslaw self-assigned this Jan 31, 2023

zaleslaw added the research This requires a deeper dive to gather a better understanding label Apr 25, 2023

zaleslaw added this to the Backlog milestone Apr 25, 2023

zaleslaw removed their assignment Apr 25, 2023

Jolanrensen mentioned this issue Jun 22, 2023

Prepare a survey (or GitHub Discussion) about data sources #408

Open

zaleslaw self-assigned this Jun 22, 2023

zaleslaw modified the milestones: Backlog, 0.12.0 Jun 22, 2023

zaleslaw removed their assignment Jun 23, 2023

zaleslaw modified the milestones: 0.12.0, Backlog Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify DataFrame interface? #28

Simplify DataFrame interface? #28

altavir commented Jun 28, 2021

nikitinas commented Jul 1, 2021

altavir commented Jul 1, 2021

nikitinas commented Dec 14, 2021

nikitinas commented Dec 14, 2021 •

edited

Loading

altavir commented Dec 15, 2021

nikitinas commented Dec 15, 2021 •

edited

Loading

altavir commented Dec 15, 2021

nikitinas commented Dec 15, 2021 •

edited

Loading

nikitinas commented Dec 15, 2021 •

edited

Loading

holgerbrandl commented Dec 21, 2021

altavir commented Dec 22, 2021

holgerbrandl commented Dec 22, 2021

altavir commented Dec 22, 2021

nikitinas commented Dec 23, 2021 •

edited

Loading

GavinRay97 commented Apr 25, 2022 •

edited

Loading

holgerbrandl commented Apr 27, 2022

nikitinas commented Apr 27, 2022

Jolanrensen commented Jun 22, 2023 •

edited

Loading

Simplify DataFrame interface? #28

Simplify DataFrame interface? #28

Comments

altavir commented Jun 28, 2021

nikitinas commented Jul 1, 2021

altavir commented Jul 1, 2021

nikitinas commented Dec 14, 2021

nikitinas commented Dec 14, 2021 • edited Loading

altavir commented Dec 15, 2021

nikitinas commented Dec 15, 2021 • edited Loading

altavir commented Dec 15, 2021

nikitinas commented Dec 15, 2021 • edited Loading

nikitinas commented Dec 15, 2021 • edited Loading

holgerbrandl commented Dec 21, 2021

altavir commented Dec 22, 2021

holgerbrandl commented Dec 22, 2021

altavir commented Dec 22, 2021

nikitinas commented Dec 23, 2021 • edited Loading

GavinRay97 commented Apr 25, 2022 • edited Loading

holgerbrandl commented Apr 27, 2022

nikitinas commented Apr 27, 2022

Jolanrensen commented Jun 22, 2023 • edited Loading

nikitinas commented Dec 14, 2021 •

edited

Loading

nikitinas commented Dec 15, 2021 •

edited

Loading

nikitinas commented Dec 15, 2021 •

edited

Loading

nikitinas commented Dec 15, 2021 •

edited

Loading

nikitinas commented Dec 23, 2021 •

edited

Loading

GavinRay97 commented Apr 25, 2022 •

edited

Loading

Jolanrensen commented Jun 22, 2023 •

edited

Loading