Skip to content

WIP: POC Redesign extraction #256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 100 commits into
base: main
Choose a base branch
from
Open

Conversation

llrs-roche
Copy link

@llrs-roche llrs-roche commented Mar 5, 2025

Linked to insightsengineering/NEST-roadmap#36

Pull Request

Redesign extraction

This branch/PR is a POC of specification of user input to be extracted.
See tests added about how it works (currently most if not all of them should work), but simple example of specification:

spec <- c(datasets("df"), variables(where(is.factor)))

Features:

  • Specification can be composed, so it allows to reuse specifications.
  • Three different steps: specification, validation and extraction.
  • Supporting a new object class is as easy as adding a new extraction method (if the default doesn't work) so that user supplied functions work on that class.
  • Possible to set a defaults for the choices. (See TODO list).
  • Simple to define and reuse if user defines good names of each specification.

Examples

Resolve input to be used by module developer

The module_input_ui and module_input_server help the developer to get the required data.

devtools::load_all(".")
library("shiny")
library("teal")
options(shiny.reactlog = TRUE)


# UI function for the custom histogram module
histogram_module_ui <- function(id) {
  ns <- shiny::NS(id)
  shiny::tagList(
    shiny::selectInput(
      ns("dataset"),
      "Select Dataset",
      choices = NULL    ),
    shiny::selectInput(
      ns("variable"),
      "Select Variable",
      choices = NULL
    ),
    shiny::plotOutput(ns("histogram_plot")),
    shiny::verbatimTextOutput(ns("plot_code")) # To display the reactive plot code
  )
}

# Server function for the custom histogram module with injected variables in within()
histogram_module_server <- function(id, data, spec) {
  moduleServer(id, function(input, output, session) {
    # Update dataset choices based on available datasets in teal_data
    spec_resolved <- teal.transform::resolver(spec, data())

    shiny::updateSelectInput(
      session,
      "dataset",
      choices = spec_resolved$datasets$names,
      selected = spec_resolved$datasets$select
    )
    react_dataset <- reactive({
      req(input$dataset)
      req(input$dataset %in% spec_resolved$datasets$names)
      if (!input$dataset %in% spec_resolved$datasets$select) {
        spec_resolved |>
          update_spec("datasets", input$dataset) |>
          teal.transform::resolver(data())
      } else {
        spec_resolved
      }
    })

    react_variable <- reactive({
      req(input$variable, react_dataset())
      spec_resolved <- react_dataset()
      req(input$variable %in% spec_resolved$variables$names)
      if (!input$variable %in% spec_resolved$variables$select) {
        react_dataset() |>
          update_spec("variables", input$variable) |>
          teal.transform::resolver(data())
      } else {
        spec_resolved
      }
    })

    observe({
      spec_resolved <- req(react_dataset())
      shiny::updateSelectInput(
        session,
        "variable",
        choices = spec_resolved$variables$names,
        selected = spec_resolved$variables$select
      )
    })

    # Create a reactive `teal_data` object with the histogram plot
    result <- reactive({
      spec_resolved <- req(react_variable())
      # Create a new teal_data object with the histogram plot
      new_data <- within(
        data(),
        {
          my_plot <- hist(
            input_dataset[[input_vars]],
            las = 1,
            main = paste("Histogram of", input_vars),
            xlab = input_vars,
            col = "lightblue",
            border = "black"
          )
        },
        input_dataset = as.name(as.character(spec_resolved$datasets$select)), # Replace `input_dataset` with selection dataset
        input_vars = as.character(spec_resolved$variables$select) # Replace `input_vars` with selection
      )
      new_data
    })

    # Render the histogram from the updated teal_data object
    output$histogram_plot <- shiny::renderPlot({
      req(result())
      result()[["my_plot"]] # Access and render the plot stored in `new_data`
    })

    # Reactive expression to get the generated code for the plot
    output$plot_code <- shiny::renderText({
      teal.code::get_code(result()) # Retrieve and display the code for the updated `teal_data` object
    })
  })
}

# Custom histogram module creation
create_histogram_module <- function(label = "Histogram Module", spec) {

  teal::module(
    label = label,
    ui = histogram_module_ui,
    server = histogram_module_server,
    server_args = list(spec = spec),
    datanames = "all"
  )
}

# Initialize the teal app
app <- init(
  data = teal_data(IRIS = iris, MTCARS = mtcars),
  modules = modules(create_histogram_module(spec = c(datasets(where(is.data.frame)),
                                                     variables(where(is.numeric)))))
)

runApp(app)

Merge arbitrary data

The merge_module_srv accepts the inputs and merge arbitrary data processed by the module_input_server, this is done thanks to extract_input the responsible to obtain the requested data from data().

devtools::load_all(".")
library("teal")
library("shiny")
library("dplyr")

# Initialize the teal app
mod <- function(label, x, y) {
  module(
    ui = function(id, x, y) {
      ns <- NS(id)
      div(
        module_input_ui(ns("x"), "x", x),
        module_input_ui(ns("y"), "y", y),
        shiny::tagList(
          shiny::textOutput(ns("text")),
          shiny::tableOutput(ns("table"))
        )
      )
    },
    server = function(id, data, x, y) {
      moduleServer(id, function(input, output, session) {
        x_in <- module_input_server("x", x, data)
        y_in <- module_input_server("y", y, data)

        merged_data <- reactive({
          req(x_in(), y_in())
          data_to_merge <- list(x_in(), y_in())
          m <- merge_call_multiple(data_to_merge, merge_function = "dplyr::full_join", data = data())
        })

        output$table <- shiny::renderTable({
          merged_data()[["ANL"]]
        })
      })
    },
    ui_args = list(x = x, y = y),
    server_args = list(x = x, y = y)
  )
}

data <- within(teal.data::teal_data(), {
  df1 <- data.frame(id = 1:10, var1 = letters[1:10], var2 = letters[1:10], var3 = 1:10)
  df2 <- data.frame(id = rep(1:10, 2), var2 = letters[1:20], var3 = factor(LETTERS[1:20]), var4 = LETTERS[1:20])
})

ids_df1 <- c("id", "var1", "var2")
ids_df2 <- c("id", "var2")

app <- init(
  data = data,
  modules = modules(
    mod(x = c(datasets("df1"), variables(ids_df1, ids_df1)),
        y = c(datasets("df2"), variables(ids_df2, ids_df2)))
  )
)

shinyApp(app$ui, app$server)

MAE

devtools::load_all(".")
library("DFplyr")

tda <- within(teal.data::teal_data(), {
  library("MultiAssayExperiment")
  m <- diag(5)
  i <- iris
  data(miniACC, envir = environment())
  mae2 <- hermes::multi_assay_experiment
})


r <- resolver(c(datasets(where(function(x){is(x, "MultiAssayExperiment")})),
           mae_colData(where(is.numeric))),
         tda)

tda[[r[[1]]$select]] |>
  colData() |>
  select(r[[2]]$names)

TODO

  • Data extraction and preparation design.

  • UI update based on partial resolution.

  • Check it is simple for app developers.

  • Use tidyselect instead of own operators and resolution.

  • Update API to not rely on operators on this package (rely only on those on tidyselect)

  • Test extraction on non data.frame objects: MAE objects/matrices, lists...

  • Check missing corners: test required changes on current modules:
    tmg: tm_missing_data, tm_g_distribution, and tm_data_table;
    tmc: tm_a_mmrm and tm_t_events
    Only up to the server body, check the defaults,
    8 hours each module, mmrm = 10 hours.

  • Verify value specification: Some modules request variable A vs other variables or a numeric input.

  • To test the data extraction and merging the module (ANL).
  • With different data types: data.frame, Matrices, MAE, SE
    20 hours
  • Verify bookmarking works

  • Verify it works on multiple data merges from multiple sources (cfr: [Bug]: Forced merge precludes using modules without joinin keys #258 )

  • Check it is simple for module developers (less important).

  • UI validation: removing unnecessary input UI or gray out when no input is possible (just one choice). Example: selection that allows to select data.frames if they have column A, there is no need to select a variable.

  • Documentation/help

  • Check transition/notify dependencies/users

  • Update module packages to use that feature

    • teal.modules.general
    • teal.modules.clinical
    • Other
  • Set sensible defaults to specifications
    What should be the constructor of the function shuold have as a default for the extract spec (i.e. is it data.frame? take all numeric columns? Only select the first one (with all available options)?
    Each module make have different requirements.

@chlebowa
Copy link
Contributor

chlebowa commented Mar 5, 2025

spec <- datasets("df") & variables(is.factor)

This would specify "factor columns from df", correct? Variables are specified with predicate functions?
Currently variable choices are specified with functions that act on the whole dataset, which allows for more flexibility, and which was deemed necessary in preliminary discussions. If one wants to use all columns whose names are in all caps, it cannot be done by applying predicates to columns.

@llrs-roche
Copy link
Author

llrs-roche commented Mar 5, 2025

This would specify "factor columns from df", correct?

Yes, I'm glad it is easy to understand.

Variables are specified with predicate functions? Currently variable choices are specified with functions that act on the whole dataset, which allows for more flexibility, and which was deemed necessary in preliminary discussions. If one wants to use all columns whose names are in all caps, it cannot be done by applying predicates to columns.

Not sure I fully understand your question. Variables can also be manually set, for example variables("A"). But filtering columns whose names are all in caps is currently possible:

td <- within(teal.data::teal_data(), {
  df <- data.frame(A = as.factor(LETTERS[1:5]), Ab = letters[1:5])
  m <- matrix()
})
spec <- datasets("df") & variables(function(x){x==toupper(x)})
resolver(spec, td) #no print method available yet, but basically resolves to df dataset and A variable

The only limitation is that the function provided should return a logical vector. In terms of implementation, the function is applied to the names and to the values, precisely to accommodate cases like this.

@chlebowa
Copy link
Contributor

chlebowa commented Mar 5, 2025

spec <- datasets("df") & variables(function(x){x==toupper(x)})

I don't get this one. I would think it would return columns where the values are in all caps.

@chlebowa
Copy link
Contributor

chlebowa commented Mar 5, 2025

In terms of implementation, the function is applied to the names and to the values, precisely to accommodate cases like this.

The same function is applied to column names and column values? 🤔

@llrs-roche
Copy link
Author

spec <- datasets("df") & variables(function(x){x==toupper(x)})

I don't get this one. I would think it would return columns where the values are in all caps.

It could mean both things imho. But here it is specifying variables, not values. For filtering by value there is values() (still work in progress).
Perhaps this other example is more representative of what I meant, (didn't realize the confusion when I reused an example with capital letters):

td <- within(teal.data::teal_data(), {
    df <- data.frame(A = as.factor(letters[1:5]), Ab = letters[1:5])
    m <- matrix()
})
spec <- datasets("df") & variables(function(x){x==toupper(x)})
resolver(spec, td)

Still selects A as the name is in capital letters

The same function is applied to column names and column values? 🤔

Yes, if there is a single argument that should accept is.factor and all_caps <- function(x){x == toupper(x)} and filter by names too with functions like stars_with(). I hope this helps.

@chlebowa
Copy link
Contributor

chlebowa commented Mar 5, 2025

Thanks, it does explain things a bit.

@llrs-roche
Copy link
Author

llrs-roche commented Mar 7, 2025

I updated the branch to make it easy to reflect updates on selections from the possible choices. Besides some internal functions, it now exports a new key function update_spec() to check if updated selections invalidate other selections (If the dataset changes, variables should change too).

Here is a full example with a shiny module and teal:

library("teal")
library("shiny")

# Initialize the teal app
mod <- function(label, x, y) {
  module(
    ui = function(id, x, y) {
      ns <- NS(id)
      div(
        module_input_ui(ns("x"), "x", x),
        module_input_ui(ns("y"), "y", y),
        shiny::tagList(
          shiny::textOutput(ns("text"))
        )
      )
    },
    server = function(id, data, x, y) {
      moduleServer(id, function(input, output, session) {
        x_in <- module_input_server("x", x, data)
        y_in <- module_input_server("y", y, data)

        output$text <- shiny::renderText({
          req(x_in(), y_in())
          l <- lapply(
            list(x = x_in, y = y_in),
            function(sel) {
              if (sum(lengths(sel()))) {
                paste0(
                  "Object: ", sel()$datasets, "\nVariables: ",
                  paste0(sel()$variables, collapse = ", ")
                )
              } else {
                "No info??"
              }
            }
          )
          unlist(l)
        })
      })
    },
    ui_args = list(x = x, y = y),
    server_args = list(x = x, y = y)
  )
}

Here I create three different specifications, you can pass them to the app and they will work:

iris_numeric <- datasets("IRIS") & variables(is.numeric)
mtcars_char <- datasets("MTCARS") & variables(is.character)
iris_numeric_or_mtcars_char <- iris_numeric | mtcars_char

app <- init(
  data = teal.data::teal_data(IRIS = iris, MTCARS = mtcars),
  modules = modules(
    mod(x = datasets("IRIS") & variables(is.factor),
        y = datasets("MTCARS") & variables(is.numeric))
  )
)

shinyApp(app$ui, app$server)

@llrs-roche
Copy link
Author

I've been experimenting a bit and it works well for some simple cases:

env <- within(teal.data::teal_data(), {
  iris <- iris
  iris2 <- cbind(iris, A = 1)
  m <- diag(5)
  d <- cbind(m, B = 1)
})

However it easily fails when multiple inputs are acceptable:

env2 <- within(env, {
  iris2$a <- iris2$Species
})
# Get the levels of the factors:
sapply(env2, function(x){if(is.data.frame(x) & any(sapply(x, is.factor))){
  levels(x[ , sapply(x, is.factor)])
}})

We would still to ensure that the output is valid within some limits (only allow named logical values? only allow the names of data objects, variables, or values?).

When multiple options are valid on the previous step evaluating in further steps could create a confusion on how to limit to those selected by the user or on the structure returned (list of list of lists). I am thinking for example on the allowing selecting all the levels from variables that starts with ARM but at most one variable can be selected).

This direct evaluation makes a tad harder to compose the extraction. Say we want the variable ARM of ADSL and then in a different input the ARMCD. With the current design we can reuse the dataset specification and append the variable specification. With a direct function, app developers will need to do that themselves:

adam <- within(teal.data::teal_data(), {
  penguins <- penguins 
  ADSL <- data.frame(ARM = 1, ARMCD = 2)
})

is_ADSL <- function(x) {
  x == "ADSL"
}

is_ARM <- function(x) {
  x == "ARM"
}

is_ARMCD <- function(x) {
  x == "ARMCD"
}

sapply(names(adam), function(x, env){
  if (is_ADSL(x)) {
    nam <- names(env[[x]])
    nam[is_ARM(nam)]
  }
}, env = adam)
#> $ADSL
#> [1] "ARM"
#> 
#> $penguins
#> NULL

sapply(names(adam), function(x, env){
  if (is_ADSL(x)) {
    nam <- names(env[[x]])
    nam[is_ARMCD(nam)]
  }
}, env = adam)
#> $ADSL
#> [1] "ARMCD"
#> 
#> $penguins
#> NULL

Created on 2025-04-23 with reprex v2.1.1
(Here I created helper functions because I expect this will be repeated on multiple inputs on a real app).

We might still need to evaluate several times the function in case the data has changed due to user updated selections. So the benefit of having a single eager function is diminished.

On the plus side I like that this would make it a lot easier for new/different classes (On the current implementation I think there is room for improvement on supporting them). But I don't think there will be requests/interest to many new classes (Bioconductor emphasis on reusing them will make it easy to support different classes that inherit from a single one).

But let's see the team's and @gogonzo feedback.

@pawelru
Copy link
Contributor

pawelru commented Apr 23, 2025

If it comes to the readability - please consider purrr::mapX() family of functions that allows to loop through multiple objects (of the same length). In our case it would be env and names of objects

r$> purrr::map2_chr(.x = names(as.list(env)), .y = as.list(env), .f = \(name, value) `if`(is.data.frame(value) && "Species" %in% names(value), name, NA_character_))
[1] NA     "iris"

Then we would require a function of \(name, value) ... which could be a little bit more clear and easier for app developer.

We would still to ensure that the output is valid within some limits (only allow named logical values? only allow the names of data objects, variables, or values?).

Yes. We can make a simple decorator with some "validation" on the returned value. That validation might need some additional information though.

We might still need to evaluate several times the function in case the data has changed due to user updated selections. So the benefit of having a single eager function is diminished.

Definitely yes. Note the flow: filter panel -> encoding -> output. Encoding needs to be reactive on the changes in filter panel.

Say we want the variable ARM of ADSL and then in a different input the ARMCD. With the current design we can reuse the dataset specification and append the variable specification.

I was thinking about this as this might be one of the biggest challenges in here. Recently I am more interacting with some sort of configuration files in which things are more explicit. How about enabling different logic depending on the dataset selected via simple case when clause?

\(dataname, value) {
  case_when(
    dataname
    "ADSL" = <tidyselection for ADSL>
    "ADTTE" = ...
    ...
  )
}

@chlebowa
Copy link
Contributor

If it comes to the readability - please consider purrr::mapX() family of functions that allows to loop through multiple objects (of the same length). In our case it would be env and names of objects

mapply is quite enough for this purpose, IMHO. purrr is a can of worms best left unopened.

@llrs-roche
Copy link
Author

Then we would require a function of (name, value) ... which could be a little bit more clear and easier for app developer.

One of the feedback from Nina was if there would be a sensible default for app developers. I guess if we go to this implementation it could be even simpler as the app developer could pick the first of whatever makes sense for their modules.

Yes. We can make a simple decorator with some "validation" on the returned value. That validation might need some additional information though.

This validation is what led to multiple methods required for new classes on the current design. Accessing and extracting is one thing but then we need to make sure the values retrieved makes sense.

How about enabling different logic depending on the dataset selected via simple case when clause?

Having a switch clause won't help reducing the length of code when the input to show the users is from the same dataset. Soon we will have requests to provide a template/default or simplify that part of the API. Even if we didn't, these switch clauses would be repeated on multiple user input it could becomes too verbose imho.

PS: @chlebowa feel free to give some feedback about the design itself too. Not sure if you like the dual purpose of the functions provided on the current implementation or if you would prefer for easy extension having direct access to the data.

PS2: As the increment is about to finish we'll need to take a decision soon to where this is heading and the next steps.

Copy link
Contributor

@gogonzo gogonzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing logical operators implemented on type classes will simplify the code in multiple places. In POC we can still use c() in the examples and app wouldn't fail. API should be only a replacement to the "explicit" class specification.

# make sure this works
c(
  datasets(is.data.frame, select = "df1"),
  variables(is.numeric)
)

# this is just an alternative syntax to above  
datasets(is.data.frame, select = "df1") & variables(is.numeric)

Additionaly, we don't have transform-constructor and it is only possible to create a "transform" using c() or logical-operators, which is the opposite of how the design should be implemented. Problem is that transform is already a base method and we can't create a different function than transform(_data, ...). Probably, we need other name for this class.

Comment on lines +1 to +11
helper_input <- function(id,
label,
multiple = FALSE) {
shiny::selectInput(
id,
label,
choices = NULL,
selected = NULL,
multiple = multiple
)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is short enough to use it directly

R/types.R Outdated
"Invalid selection" = check_input(type)
)
if (is.function(x)) {
x <- list(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it a list?

Suggested change
x <- list(x)
x <- list(x)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I has some challenges combining different types when a single function is provided and this was the simplest way to solve them that I found. If we go with direct access to the qenv with tidyselect helpers we will need to replace that and capture the expression or something similar.

R/resolver.R Outdated

# Evaluate if the function applied to the data
# but we need to return the name of the data received
functions_data <- function(spec_criteria, names_data, data) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functions_names and function_data could be replaced to handle tidyselect syntax. Implementation is pretty simple:

foo <- function(data, ...) {
  eval_select(rlang::expr(c(...)), data)
}

iris |> foo(starts_with("Sepal"))
# Sepal.Length  Sepal.Width 
#           1            2 

iris |>  foo(where(is.numeric))
# Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
#           1            2            3            4

Remaining question is how to implement eval_select for datasets or for other classes. maybe we should have own "eval_select" equivalent to handle multiple types (qenv, data.frame)

Copy link
Contributor

@gogonzo gogonzo Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for example:

teal.code::qenv() |> within({
  aa <- 1
  ab <- 2
  ba <- 3
  bb <- NULL
})  |> as.list() |> foo(starts_with("a") | where(is.null), -aa)
# ab bb 
#  4  2 

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems much cleaner than what I implemented 👍👍👍. The exploration for other classes I think is worth it, specially seeing it already works with lists (which I thought it wouldn't work). Extending to multiple classes might be then simpler

@@ -0,0 +1,134 @@
#' @export
Ops.transform <- function(e1, e2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After digging in the implementation I think tidyselect inspired syntax is little bit off from the origin. I think tidyselect makes sense for names argument e.g. datasets(names = starts_with(...)).
Tidyselect allows to combine different "selection helpers" together with logical operators like datasets(starts_with("Sepal") | !ends_with("Width")). I find combining "type" objects (datasets, variables, ...) with logical operators complicated and confusing. For example what would be a result of:

datasets(is.data.frame, select = "iris") & variables(starts_with("Sepal")) &
!datasets(is.data.frame, select = "mtcars") & variables(is.numeric)

I think it doesn't make sense to copy tidyselect syntax and maintain the operators on our side. And I would stick to use operators for a single argument (names) rather than for the whole "transform" object. Maintaining ops for the whole transform class might be (and it is already) very complex.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After digging in the implementation I think tidyselect inspired syntax is little bit off from the origin. I think tidyselect makes sense for names argument e.g. datasets(names = starts_with(...)).

So you wouldn't allow tidyselect syntax on the select = ? And I take that you would rename x to names on the types 👍 .

Tidyselect allows to combine different "selection helpers" together with logical operators like datasets(starts_with("Sepal") | !ends_with("Width")). I find combining "type" objects (datasets, variables, ...) with logical operators complicated and confusing. For example what would be a result of:

datasets(is.data.frame, select = "iris") & variables(starts_with("Sepal")) &
!datasets(is.data.frame, select = "mtcars") & variables(is.numeric)

The negation currently only takes into account the specification on x argument (as you can see with (!datasets(is.data.frame, select = "mtcars"))), so this would theoretically results in an empty selection as it would select all data.frames and then exclude them.
Given ! precedence it is only applied to the datasets, so it would be possible to select numeric variables and all those that start with Sepal of the data.frames selected (none on this case).
Currently there is no method for ! on the "transform" class. ie this returns an error: !(datasets(is.data.frame, select = "mtcars") & variables(is.numeric)).

datasets(is.data.frame, select = "iris") & variables(starts_with("Sepal")) &
!datasets(is.data.frame, select = "mtcars") & variables(is.numeric)

I think it doesn't make sense to copy tidyselect syntax and maintain the operators on our side.

The problem with relying on tidyselect is that it only applies to data.frames, it won't work with other classes not even matrices. If this is okay, and we try to apply some tricks as Pawel suggested above for other methods I agree this in principle is less work for us.

And I would stick to use operators for a single argument (names) rather than for the whole "transform" object. Maintaining ops for the whole transform class might be (and it is already) very complex.

I agree that operators are more complex than what I thought initially. This is partly due to operators working by classes. If we keep them for the names argument and (only?) allow tidyselect helpers then we don't even need do do that ourselves.

Co-authored-by: Dawid Kałędkowski <[email protected]>
Signed-off-by: Lluís Revilla <[email protected]>
Copy link
Author

@llrs-roche llrs-roche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You made me realize that tidyselect is easier and a good path forward 👍

Removing logical operators implemented on type classes will simplify the code in multiple places.

Will do that, as they will only be applied within the tidyselect environment.

In POC we can still use c() in the examples and app wouldn't fail. API should be only a replacement to the "explicit" class specification.

# make sure this works
c(
  datasets(is.data.frame, select = "df1"),
  variables(is.numeric)
)

# this is just an alternative syntax to above  
datasets(is.data.frame, select = "df1") & variables(is.numeric)

Additionaly, we don't have transform-constructor and it is only possible to create a "transform" using c() or logical-operators, which is the opposite of how the design should be implemented.

Sorry, I first thought about the grammar and then on the implementation and this reflects on the code.

Problem is that transform is already a base method and we can't create a different function than transform(_data, ...). Probably, we need other name for this class.

What would be a good name? Some ideas:

  • specification but it can be technical and doesn't explain what it does,
  • selection closer to dplyr::select and might help users,
  • input closer to shiny/web development.

R/types.R Outdated
"Invalid selection" = check_input(type)
)
if (is.function(x)) {
x <- list(x)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I has some challenges combining different types when a single function is provided and this was the simplest way to solve them that I found. If we go with direct access to the qenv with tidyselect helpers we will need to replace that and capture the expression or something similar.

@@ -0,0 +1,134 @@
#' @export
Ops.transform <- function(e1, e2) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After digging in the implementation I think tidyselect inspired syntax is little bit off from the origin. I think tidyselect makes sense for names argument e.g. datasets(names = starts_with(...)).

So you wouldn't allow tidyselect syntax on the select = ? And I take that you would rename x to names on the types 👍 .

Tidyselect allows to combine different "selection helpers" together with logical operators like datasets(starts_with("Sepal") | !ends_with("Width")). I find combining "type" objects (datasets, variables, ...) with logical operators complicated and confusing. For example what would be a result of:

datasets(is.data.frame, select = "iris") & variables(starts_with("Sepal")) &
!datasets(is.data.frame, select = "mtcars") & variables(is.numeric)

The negation currently only takes into account the specification on x argument (as you can see with (!datasets(is.data.frame, select = "mtcars"))), so this would theoretically results in an empty selection as it would select all data.frames and then exclude them.
Given ! precedence it is only applied to the datasets, so it would be possible to select numeric variables and all those that start with Sepal of the data.frames selected (none on this case).
Currently there is no method for ! on the "transform" class. ie this returns an error: !(datasets(is.data.frame, select = "mtcars") & variables(is.numeric)).

datasets(is.data.frame, select = "iris") & variables(starts_with("Sepal")) &
!datasets(is.data.frame, select = "mtcars") & variables(is.numeric)

I think it doesn't make sense to copy tidyselect syntax and maintain the operators on our side.

The problem with relying on tidyselect is that it only applies to data.frames, it won't work with other classes not even matrices. If this is okay, and we try to apply some tricks as Pawel suggested above for other methods I agree this in principle is less work for us.

And I would stick to use operators for a single argument (names) rather than for the whole "transform" object. Maintaining ops for the whole transform class might be (and it is already) very complex.

I agree that operators are more complex than what I thought initially. This is partly due to operators working by classes. If we keep them for the names argument and (only?) allow tidyselect helpers then we don't even need do do that ourselves.

R/resolver.R Outdated

# Evaluate if the function applied to the data
# but we need to return the name of the data received
functions_data <- function(spec_criteria, names_data, data) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems much cleaner than what I implemented 👍👍👍. The exploration for other classes I think is worth it, specially seeing it already works with lists (which I thought it wouldn't work). Extending to multiple classes might be then simpler

Copy link
Contributor

@chlebowa chlebowa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments b/c I got a commit notification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Design data extract and data merge
4 participants