Skip to content

Simplify DataFrame interface? #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
altavir opened this issue Jun 28, 2021 · 18 comments
Open

Simplify DataFrame interface? #28

altavir opened this issue Jun 28, 2021 · 18 comments
Labels
research This requires a deeper dive to gather a better understanding
Milestone

Comments

@altavir
Copy link

altavir commented Jun 28, 2021

DataFrame primary interface seems to be over-complicated. A lot of methods have only default implementations and could be moved to extensions. I propose to simplify it significantly like I've done here. It would allow to add and maintain features in a simpler way. For example, it would allow to addition of row-based DataFrames.

@nikitinas
Copy link
Contributor

I agree. We've just finished experimenting with public DataFrame API and started global code review and cleanup.

@altavir
Copy link
Author

altavir commented Jul 1, 2021

Nice to hear that. I would be willing to help if you want. I've started to write a PR, but hit a general design problem of the column being a DataFrame and a lot of methods and iterators intermixing because of that. I think, there must be a clean separation of entities so one could choose what he iterates over and DataFrame itself should not be iterable.

@nikitinas
Copy link
Contributor

@altavir Please, take a look at updated DataFrame interface. I've converted most methods into extensions and now it looks rather minimalistic.

@nikitinas
Copy link
Contributor

nikitinas commented Dec 14, 2021

Regarding your statement that DataFrame should not be iterable. Currently it doesn't implement Iterable interface, but supports iterator() operator that allows to use it in for loop.

Iteration over DataFrame is generally ambiguous and can have different meanings: iteration over rows, columns or values. This is why it has forEachRow and forEachColumn operations instead of just forEach. But in case with for loop we have only two options: either prohibit such usage or allow some default behavior. In my opinion, default behavior is better than nothing, that's why for iterates over rows, similar to Kotlin Collections.

@altavir
Copy link
Author

altavir commented Dec 15, 2021

@nikitinas It is still not possible to do an external implementation of DataFrame.

I would recommend using more basic access methods in classes and moving all advanced DSL to extensions or helpers.

@nikitinas
Copy link
Contributor

nikitinas commented Dec 15, 2021

Current design doesn't support external implementations of DataFrame interface, though it can be changed in future.

As for now everything is bound to the following data model:

  • DataFrame is a list of columns
  • DataColumn can be one of three kinds: ValueColumn, ColumnGroup or FrameColumn.

Implementation of all DataFrame operations rely on this data model heavily, because they return a new DataFrame instance that is created from a new list of columns. That's why any new implementation of DataFrame will require lots of changes in implementation of all operations as well. Could you, please, explain the case when you need another DataFrame implementation.

The major extension point is expected to be ValueColumn. That is an actual data storage. It can be extended to support primitive types for better performance. You can also provide your own List implementation over your columns with data and use it in existing ValueColumnImpl implementation. In particular, this allows to create column-based DataFrame wrapper over some other row-based data structure. Native support for row-based DataFrame is not on the roadmap, because it requires totally different implementation of all DataFrame operations.

@altavir
Copy link
Author

altavir commented Dec 15, 2021

I think that extensibility is very important. We would like to be able to use different data sources, such as databases, streaming files reads, and remote data. And it won't be possible to use if the API is locked on the current DataFrame internals.

A good test would be an example implementation of new data source integration, which does not depend on internals.

I've managed to implement a column here: https://github.com/mipt-npm/tables-kt/blob/master/tables-kt-dataframe/src/main/kotlin/space/kscience/dataforge/dataframe/TableAsDataFrame.kt, but did not manage to do a DataFrame, Is it possible to create a DataFrame from existing columns, bypassing builders?

@nikitinas
Copy link
Contributor

nikitinas commented Dec 15, 2021

You can use toDataFrame() extension for Iterable<DataColumn> or dataFrameOf(columns) to create DataFrame from a list of columns.

https://kotlin.github.io/dataframe/createdataframe.html#todataframe

@nikitinas
Copy link
Contributor

nikitinas commented Dec 15, 2021

Integration of external data sources will require reimplementation of all operations, otherwise it will be just a simple data viewer. Originally library design was focused on great user experience for in-memory data wrangling, but I agree that now we can consider external data sources support as well. We can choose some particular external source and think how we can extend DataFrame for it.

@holgerbrandl
Copy link

To me the most obvious candidate are databases. In dbplyr (note the b) they managed to keep 90% of the dplyr API while enabling server-side execution. It's just amazing. I use this almost daily to break down big data sets on the server-side before pulling them to my local computer. Just see https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html for their great intro.

However, I totally agree with @nikitinas that a well-designed API should come first before considering such options/extensions.

@altavir
Copy link
Author

altavir commented Dec 22, 2021

@holgerbrandl This is what I was talking about. Here is the integration of Tables.kt with Exposed: https://github.com/mipt-npm/tables-kt/tree/master/tables-kt-exposed/src/main/kotlin/space/kscience/dataforge/exposed. And I've added the same for direct CSV.

I am not sure I understand why internals are important for data wrangling. It would be nice to have a seminar on that.

@holgerbrandl
Copy link

From what I understand, the exposed example is creating a wrapper around a single table. But I think it does not allow to map compute constructs into the db. This is essentially the superpower of dpblyr: we can do

eqDim = tbl(con, in_schema("Foo", "Equipment")) %>% # create ref to a db-table
  inner_join(tbl(con, in_schema("Foo", "EquipmentName"))) %>% # create ref to another db-table and perform join (without doing it yet)
  select(EquipmentId, Equipment, Location, Operator) %>%
  group_by(Equipment) %>%
  mutate(..) %>% # not fully worked, but I guess this part is clear
  filter(..) %>% # nothing has been queries or computed until this point. Similar to a kotlin Sequence
  collect() # here dbplyr will compile a query, run it against the db, and pull the result

I guess you agree that this goes beyond your example, as this requires a local query builder & optimizer (which is part of dbplyr). Most importantly this is exactly the same API as for local dplyr computation. By simply moving up the collect we would turn this into an eager-compute chain. Clearly, this is very advanced and took the R community years to develop. So it's just meant as pointer/inspiration regarding later API requirements.

Indeed only very few people care about the internals. From our discussion yesterday, I found the DataFrame interface very simple (maybe expect the column index lookup, which you explained being necessary for performance reasons).

@altavir
Copy link
Author

altavir commented Dec 22, 2021

I just gave a simple example. It is not hard to add joins as well. The DataFrame user API is fine, but do we need to have all those things implementation locked? Or it is possible to add helpers for them?

@nikitinas
Copy link
Contributor

nikitinas commented Dec 23, 2021

Considering DataFrame API over DB we have several issues:

  1. Row-based computations. DataFrame API uses row-based expressions in many operations, e.g. filter. These expressions can not be mapped into SQL, because they are just Kotlin lambdas executed for every row. On the other hand, they give flexibility to use any Kotlin functions in DataRow context and integrate smoothly with Kotlin stdlib. This is a trade off between API convenience and implementation extensibility. It can be solved by splitting DataFrame API into local and remote sets of operations to prohibit row-based computations in remote case. As an example of such API separation we have filter and filterBy operations: filter accepts row-based lambda (similar to stdlib) and filterBy selects Boolean column that can also be created by any column arithmetics.

  2. Query optimization and deferred computation. As @holgerbrandl mentioned, efficient SQL execution requires bundling of several operations into one transaction. It can be solved by lazy implementation of DataFrame interface that collects all applied operations and performs them at either explicit collect() or at the first operation that is not mappable to SQL.

  3. Write to DB. If we consider not only queries, but also write operations and want to use DataFrame API for SQL table updates, creation of new tables, table schema modification etc. without any data transfer to local machine - it will be even more challenging, but still possible.

So, it's possible. Just requires some work to be done.

I suggest to start with much more simple approach that will solve first two problems and will allow to use data from DB: we can add an adapter for Exposed that will convert Query into DataFrame. This will be an extension .toDataFrame() that doesn't require any internal changes in DataFrame implementation.

@GavinRay97
Copy link

GavinRay97 commented Apr 25, 2022

FWIW, the book "How Query Engines Work" by Andy Grove (Apache DataFusion/Ballista/Arrow-Rust author) covers building essentially the above -- in Kotlin

It walks through building first a DataFrame class, and an expression AST, then teaches you how to add a query planner/optimizer on top of it. Fantastic book.

Just in case anyone else finds this repo and is interested in similar topics = )

@holgerbrandl
Copy link

Great pointer, thanks for sharing.

@nikitinas
Copy link
Contributor

Thank you for the link. Our current idea is to sit on top of Exposed that already supports SQL query optimization.

@zaleslaw zaleslaw self-assigned this Jan 31, 2023
@zaleslaw zaleslaw added the research This requires a deeper dive to gather a better understanding label Apr 25, 2023
@zaleslaw zaleslaw added this to the Backlog milestone Apr 25, 2023
@zaleslaw zaleslaw removed their assignment Apr 25, 2023
@zaleslaw zaleslaw self-assigned this Jun 22, 2023
@zaleslaw zaleslaw modified the milestones: Backlog, 0.12.0 Jun 22, 2023
@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Jun 22, 2023

For simple, untyped/unchecked conversion from Exposed Queries to DataFrames, something as simple as this already works:

// Try to get a proper name instead of something like $Line23432.Albums.artistId
// Needs to be expanded
val Expression<*>.readableName: String
    get() = when (this) {
        is Column<*> -> name
        is ExpressionAlias<*> -> alias
        is BiCompositeColumn<*, *, *> -> getRealColumns().joinToString("_") { it.readableName }
        else -> toString()
    }

// Simply retrieve the entire Query and convert the rows to columns
fun  Iterable<ResultRow> /* Query */.toDataFrame(): DataFrame<*> {
    val map = mutableMapOf<String, MutableList<Any?>>()
    forEach { row ->
        for (expression in row.fieldIndex.keys) {
            map.getOrPut(expression.readableName) { 
                mutableListOf()
            } += row[expression]
        }
    }

    return map.toDataFrame()
}

Now of course, this pulls the entire query into memory, where DF operates. If you don't want that, the operation should be batched into multiple dataframes.

@zaleslaw zaleslaw removed their assignment Jun 23, 2023
@zaleslaw zaleslaw modified the milestones: 0.12.0, Backlog Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research This requires a deeper dive to gather a better understanding
Projects
None yet
Development

No branches or pull requests

6 participants