-
Notifications
You must be signed in to change notification settings - Fork 73
Simplify DataFrame interface? #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I agree. We've just finished experimenting with public DataFrame API and started global code review and cleanup. |
Nice to hear that. I would be willing to help if you want. I've started to write a PR, but hit a general design problem of the column being a DataFrame and a lot of methods and iterators intermixing because of that. I think, there must be a clean separation of entities so one could choose what he iterates over and DataFrame itself should not be iterable. |
@altavir Please, take a look at updated |
Regarding your statement that Iteration over |
@nikitinas It is still not possible to do an external implementation of DataFrame.
I would recommend using more basic access methods in classes and moving all advanced DSL to extensions or helpers. |
Current design doesn't support external implementations of As for now everything is bound to the following data model:
Implementation of all The major extension point is expected to be |
I think that extensibility is very important. We would like to be able to use different data sources, such as databases, streaming files reads, and remote data. And it won't be possible to use if the API is locked on the current DataFrame internals. A good test would be an example implementation of new data source integration, which does not depend on internals. I've managed to implement a column here: https://github.com/mipt-npm/tables-kt/blob/master/tables-kt-dataframe/src/main/kotlin/space/kscience/dataforge/dataframe/TableAsDataFrame.kt, but did not manage to do a DataFrame, Is it possible to create a DataFrame from existing columns, bypassing builders? |
You can use https://kotlin.github.io/dataframe/createdataframe.html#todataframe |
Integration of external data sources will require reimplementation of all operations, otherwise it will be just a simple data viewer. Originally library design was focused on great user experience for in-memory data wrangling, but I agree that now we can consider external data sources support as well. We can choose some particular external source and think how we can extend |
To me the most obvious candidate are databases. In dbplyr (note the b) they managed to keep 90% of the dplyr API while enabling server-side execution. It's just amazing. I use this almost daily to break down big data sets on the server-side before pulling them to my local computer. Just see https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html for their great intro. However, I totally agree with @nikitinas that a well-designed API should come first before considering such options/extensions. |
@holgerbrandl This is what I was talking about. Here is the integration of Tables.kt with Exposed: https://github.com/mipt-npm/tables-kt/tree/master/tables-kt-exposed/src/main/kotlin/space/kscience/dataforge/exposed. And I've added the same for direct CSV. I am not sure I understand why internals are important for data wrangling. It would be nice to have a seminar on that. |
From what I understand, the exposed example is creating a wrapper around a single table. But I think it does not allow to map compute constructs into the db. This is essentially the superpower of dpblyr: we can do eqDim = tbl(con, in_schema("Foo", "Equipment")) %>% # create ref to a db-table
inner_join(tbl(con, in_schema("Foo", "EquipmentName"))) %>% # create ref to another db-table and perform join (without doing it yet)
select(EquipmentId, Equipment, Location, Operator) %>%
group_by(Equipment) %>%
mutate(..) %>% # not fully worked, but I guess this part is clear
filter(..) %>% # nothing has been queries or computed until this point. Similar to a kotlin Sequence
collect() # here dbplyr will compile a query, run it against the db, and pull the result I guess you agree that this goes beyond your example, as this requires a local query builder & optimizer (which is part of dbplyr). Most importantly this is exactly the same API as for local dplyr computation. By simply moving up the Indeed only very few people care about the internals. From our discussion yesterday, I found the |
I just gave a simple example. It is not hard to add joins as well. The DataFrame user API is fine, but do we need to have all those things implementation locked? Or it is possible to add helpers for them? |
Considering DataFrame API over DB we have several issues:
So, it's possible. Just requires some work to be done. I suggest to start with much more simple approach that will solve first two problems and will allow to use data from DB: we can add an adapter for Exposed that will convert Query into DataFrame. This will be an extension |
FWIW, the book "How Query Engines Work" by Andy Grove (Apache DataFusion/Ballista/Arrow-Rust author) covers building essentially the above -- in Kotlin It walks through building first a
Just in case anyone else finds this repo and is interested in similar topics = ) |
Great pointer, thanks for sharing. |
Thank you for the link. Our current idea is to sit on top of Exposed that already supports SQL query optimization. |
For simple, untyped/unchecked conversion from Exposed Queries to DataFrames, something as simple as this already works: // Try to get a proper name instead of something like $Line23432.Albums.artistId
// Needs to be expanded
val Expression<*>.readableName: String
get() = when (this) {
is Column<*> -> name
is ExpressionAlias<*> -> alias
is BiCompositeColumn<*, *, *> -> getRealColumns().joinToString("_") { it.readableName }
else -> toString()
}
// Simply retrieve the entire Query and convert the rows to columns
fun Iterable<ResultRow> /* Query */.toDataFrame(): DataFrame<*> {
val map = mutableMapOf<String, MutableList<Any?>>()
forEach { row ->
for (expression in row.fieldIndex.keys) {
map.getOrPut(expression.readableName) {
mutableListOf()
} += row[expression]
}
}
return map.toDataFrame()
} Now of course, this pulls the entire query into memory, where DF operates. If you don't want that, the operation should be batched into multiple dataframes. |
DataFrame primary interface seems to be over-complicated. A lot of methods have only default implementations and could be moved to extensions. I propose to simplify it significantly like I've done here. It would allow to add and maintain features in a simpler way. For example, it would allow to addition of row-based DataFrames.
The text was updated successfully, but these errors were encountered: