Skip to content

inefficient data.table code in benchmarks #31

Open
@BenoitLondon

Description

@BenoitLondon

the data.table code is a bit unfair...

in the first code,

robject_dt <- function() {
  
  as.data.table(DataMultiTypes)[
    
    colInt > 2000 & colInt < 8000
    
  ][, .(min_colInt = min(colInt),
        mean_colInt = mean(colInt),
        mas_colInt = max(colInt),
        min_colNum = min(colNum),
        mean_colNum = mean(colNum),
        max_colNum = max(colNum)),
    
    by = colString
  ]
}

as.data.table does a full copy of the data and to make a fair comparison with polars you could build the data.table before hand,
data.table gets closer to dplyr in my benchmark

In the csv example you do not need as.data.table as fread returns a data.table
and then data.table method gets 2.5x faster than dplyr (on my machine with 10 threads for data.table) and probably beats polars(eager)

I could not run the polars code as it was throwing errors like

syntax error: days is not a method/attribute of the class RPolarsExprDTNameSpace 
       when calling method:
       (pl$col("colDate2") - pl$col("colDate1"))$dt$days

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions