Skip to content

fix(ibis): refactor the available function list #1112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Mar 31, 2025

Conversation

goldmedal
Copy link
Contributor

@goldmedal goldmedal commented Mar 26, 2025

Description

  • Get the function infromation from INFORMATION_SCHEMA.
  • Enhance the type signature regsitering.
  • Remove the duplicate remote function for each DataSource.
    • add a python script for it: ibis-server/tools/remote_function_check.py
  • Fix the type name and format for each DataSource.

Summary by CodeRabbit

  • New Features

    • Expanded support for SQL functions across various database integrations, with updated descriptions and type specifications for improved query accuracy.
    • Introduced a utility tool to validate and maintain the function catalog.
  • Improvements

    • Enhanced performance and reliability through asynchronous processing and refined error handling.
    • Extended public API exposure to unlock broader analytical capabilities.
    • Significant increase in the number of available functions, reflecting expanded functionality.

Copy link

coderabbitai bot commented Mar 26, 2025

Walkthrough

This pull request updates several test files in the ibis-server by modifying function test assertions, expected values, and removing CSV‐based test functions. The constant DATAFUSION_FUNCTION_COUNT has been increased significantly. A new tool script for duplicate function checking in CSV files is introduced. In addition, core modules in wren-core-py and wren-core have been enhanced to support asynchronous processing, improved error handling with the introduction of try_map_data_type, and refined remote function and UDF handling.

Changes

File(s) Change Summary
ibis-server/tests/conftest.py Updated constant: DATAFUSION_FUNCTION_COUNT changed from 273 to 25941.
ibis-server/tests/routers/v3/connector/*/test_functions.py Modified test_function_list: updated filtering logic, expected function name, description, parameter types, and return types; removed the test_functions function in several connectors (BigQuery, ClickHouse, MSSQL, MySQL, Postgres, Trino). In Snowflake, assertions and remote path checks are commented out with new expectations.
ibis-server/tools/remote_function_check.py Introduced a new script that checks for duplicate functions in a CSV file, creates a backup (.bkup), and writes only valid function entries.
wren-core-py/src/context.rs, wren-core-py/src/errors.rs, wren-core-py/src/remote_functions.rs In context.rs: Removed remote_functions field and added a new runtime field; updated initialization and asynchronous operation methods. In errors.rs: Reintroduced ParseIntError import. In remote_functions.rs: Expanded derive attributes and enhanced conversion and display logic for PyRemoteFunction.
wren-core-py/tests/test_modeling_core.py Revised SQL queries and assertions (e.g., add_two now takes two parameters), increased the expected function count, and added validations for new functions.
wren-core/core/src/lib.rs, wren-core/core/src/logical_plan/analyze/plan.rs, wren-core/core/src/logical_plan/utils.rs, wren-core/core/src/mdl/function.rs, wren-core/core/src/mdl/mod.rs, wren-core/core/src/mdl/utils.rs Broadened exports in lib.rs to include all DataFusion prelude and arrow modules. Replaced map_data_type with try_map_data_type in logical plan modules and MDL utilities for better error handling. Enhanced the RemoteFunction struct and UDF conversions by introducing a new ReturnType enum and signature methods.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant PySessionContext
    participant Runtime
    participant Database
    Client->>PySessionContext: new(mdl_base64, remote_functions_path)
    PySessionContext->>Runtime: Initialize asynchronous runtime
    PySessionContext->>Runtime: Register remote functions asynchronously
    Client->>PySessionContext: transform_sql(sql)
    PySessionContext->>Runtime: Execute SQL transformation asynchronously
    Runtime->>Database: Query registered functions (if needed)
    Database-->>Runtime: Return query results
    Runtime-->>PySessionContext: Return transformed SQL/result
    PySessionContext-->>Client: Return transformed SQL/result
Loading

Possibly related PRs

Suggested labels

dependencies

Suggested reviewers

  • wwwy3y3

Poem

I’m a little rabbit, hopping along the code,
Found changes in tests and core—what a load!
New async paths, errors well-handled with care,
And CSV checks to make our functions fair.
With a twitch of my ears and a happy little hop,
I celebrate these tweaks until bugs all stop! 🐇🌟


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a965395 and 7ac4ccb.

📒 Files selected for processing (1)
  • ibis-server/tests/routers/v3/connector/snowflake/test_functions.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • ibis-server/tests/routers/v3/connector/snowflake/test_functions.py
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: cargo test (amd64)
  • GitHub Check: ci
  • GitHub Check: test
  • GitHub Check: cargo test (win64)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added core rust Pull requests that update Rust code python Pull requests that update Python code labels Mar 26, 2025
@goldmedal
Copy link
Contributor Author

We still have the BigQuery secret key issue for the CI. I tested BigQuery on my laptop, and it worked well.

> just test bigquery 
poetry run pytest -m 'bigquery'
================================================================================================================= test session starts ==================================================================================================================
platform darwin -- Python 3.11.11, pytest-8.3.5, pluggy-1.5.0
configfile: pyproject.toml
plugins: anyio-4.8.0
collected 288 items / 256 deselected / 32 selected                                                                                                                                                                                                     

tests/routers/v2/connector/test_bigquery.py ....................                                                                                                                                                                                 [ 62%]
tests/routers/v3/connector/bigquery/test_functions.py ...                                                                                                                                                                                        [ 71%]
tests/routers/v3/connector/bigquery/test_query.py .........  

@goldmedal goldmedal marked this pull request as ready for review March 28, 2025 10:29
@goldmedal goldmedal requested a review from douenergy March 28, 2025 10:30
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (16)
wren-core/core/src/lib.rs (1)

4-4:

❓ Verification inconclusive

Expanded public API with broader datafusion exports

You've changed from selectively exporting SessionContext to re-exporting all items from datafusion::arrow::* and datafusion::prelude::*. While this makes more functionality available, it could lead to:

  1. Potential naming conflicts with other modules
  2. Tighter coupling to datafusion's API (changes in datafusion could directly impact users)
  3. Less clear API boundaries

Consider a more selective approach or documenting the expanded public API:

- pub use datafusion::arrow::*;
+ // Export specific types from datafusion::arrow that are needed
+ pub use datafusion::arrow::{array, compute, datatypes, record_batch};

- pub use datafusion::prelude::*;
+ // Export commonly used types from datafusion::prelude
+ pub use datafusion::prelude::{SessionContext, col, lit, when, SessionConfig};

Let's check if any of the exported symbols might conflict with other common names:

Also applies to: 7-7


🏁 Script executed:

#!/bin/bash
# Check for potential naming conflicts in exported symbols
echo "Checking for common names that might conflict..."

# Get symbols from datafusion::prelude
cargo doc --no-deps --open &> /dev/null
echo "Common names from datafusion::prelude to watch for conflicts:"
cargo rustdoc -- --document-private-items 2>/dev/null
find target/doc -name "*.html" | grep -i "datafusion.*prelude" | xargs grep -l "pub fn \|pub struct \|pub enum" | xargs grep -o "pub [a-z]* [A-Za-z0-9_]*" | grep -E "pub (fn|struct|enum) (error|result|option|string|vec|map|set|path|file|io|read|write|context|config|session)"

Length of output: 636


Attention: Verify Exported Symbols and API Boundaries

The recent change re-exports all symbols from both datafusion::arrow and datafusion::prelude, which may expose the API to potential naming conflicts, tighter coupling with DataFusion’s internals, and a less clear public API surface. We attempted an automated check for common naming conflicts using documentation generated from DataFusion's prelude, but the script did not yield any output (the target/doc directory was not found). This suggests that automated validation was inconclusive, so please verify manually if any of the re-exported symbols conflict with common names used elsewhere in the codebase.

Recommendations:

  • Review Naming Conflicts: Manually inspect the exported symbols to ensure they don’t conflict with common names (e.g., context, config, session).
  • Consider a Selective Export: To minimize risks, export only those components that are actually required. For example:
    - pub use datafusion::arrow::*;
    + // Export specific items needed from datafusion::arrow:
    + pub use datafusion::arrow::{array, compute, datatypes, record_batch};
    
    - pub use datafusion::prelude::*;
    + // Export selected functionality from datafusion::prelude:
    + pub use datafusion::prelude::{SessionContext, col, lit, when, SessionConfig};
  • Documentation: Clearly document the expanded public API and note the rationale behind the selective exports.

Please manually verify the exported symbols for potential conflicts and evaluate whether a more selective export approach would reduce coupling with DataFusion’s API.

ibis-server/tools/remote_function_check.py (3)

1-43: New utility script for duplicate function detection looks good, with minor improvement opportunities

This script efficiently identifies and removes duplicate functions from CSV files by comparing against functions available in wren_core. It supports the PR's goal of eliminating duplicate remote functions.

Consider these improvements:

  1. Add error handling for file operations (file not found, permission errors)
  2. Replace print statements with proper logging
  3. Make the CSV structure parsing more robust (currently assumes function name is in second column)
-import argparse
-import os
-import wren_core
+import argparse
+import logging
+import os
+import wren_core
+from pathlib import Path
+
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)

# Later in code:
-print("Default Function count: ", len(function_names))
-print("Function is already in the function list:")
+logger.info(f"Default Function count: {len(function_names)}")
+logger.info("Function is already in the function list:")

# Add error handling:
+try:
    with open(args.path, "r") as file:
        # ...
+except FileNotFoundError:
+    logger.error(f"File not found: {args.path}")
+    exit(1)
+except PermissionError:
+    logger.error(f"Permission denied: {args.path}")
+    exit(1)
🧰 Tools
🪛 Ruff (0.8.2)

1-1: File ibis-server/tools/remote_function_check.py is part of an implicit namespace package. Add an __init__.py.

(INP001)


23-23: print found

(T201)


24-24: print found

(T201)


27-27: Unnecessary open mode parameters

Remove open mode parameters

(UP015)


42-42: print found

(T201)


27-27: Remove unnecessary file mode parameter

The default mode for open() is already 'r' (read), so specifying it is redundant.

-with open(args.path, "r") as file:  
+with open(args.path) as file:  
🧰 Tools
🪛 Ruff (0.8.2)

27-27: Unnecessary open mode parameters

Remove open mode parameters

(UP015)


1-6: Add __init__.py to avoid implicit namespace package warning

Static analysis detected that this file is part of an implicit namespace package.

Create an empty __init__.py file in the tools directory to resolve this warning.

🧰 Tools
🪛 Ruff (0.8.2)

1-1: File ibis-server/tools/remote_function_check.py is part of an implicit namespace package. Add an __init__.py.

(INP001)

ibis-server/tests/routers/v3/connector/snowflake/test_functions.py (1)

56-65: Database-specific function testing looks good, but consider checking result length

The test now properly checks for Snowflake-specific function "is_null_value" instead of generic "abs", which aligns with the PR goal to improve DataSource-specific function handling.

Consider re-enabling the count assertion to verify the total number of functions:

-    # assert len(result) == DATAFUSION_FUNCTION_COUNT + 60
+    assert len(result) == DATAFUSION_FUNCTION_COUNT + 60  # Adjust number if needed
wren-core-py/src/remote_functions.rs (4)

66-75: Semantic clarity on variable naming.
Within .and_then(|types| ...), you're processing parameter names but calling the variable types. For clarity, consider updating the variable name. Otherwise, the logic to flatten and filter out empty strings is sound.

 .map(|names| {
     names
         .into_iter()
         .flatten()
         .collect::<Vec<String>>()
         .join(",")
 })
-.and_then(|types| if types.is_empty() { None } else { Some(types) });
+.and_then(|joined_names| if joined_names.is_empty() { None } else { Some(joined_names) });

86-90: Empty string handling for return_type.
Converting the empty string into None is a neat solution. If whitespaces are possible in the CSV, you may want to trim them first, but that’s optional.


109-116: Potential whitespace trimming.
When building Vec<Option<String>> for param names, consider trimming each piece to avoid storing accidental whitespace.

.map(|name| {
     let trimmed = name.trim();
     if trimmed.is_empty() {
         None
     } else {
         Some(trimmed.to_string())
     }
})

121-128: Param type parsing consistency.
This is the same approach used for param names. Trimming may also be beneficial here for cleanliness.

wren-core-py/src/context.rs (2)

62-62: Unwrap risk in default.
Using Runtime::new().unwrap() can panic on failure. Consider .expect(...) or error handling if the runtime fails to initialize.


238-305: Async function “get_regietered_functions” with potential typo.
The method’s name is misspelled (“regietered”). Consider renaming to get_registered_functions for clarity. Otherwise, the batch parsing logic is solid.

-async fn get_regietered_functions(
+async fn get_registered_functions(
wren-core/core/src/mdl/function.rs (5)

29-55: Handling fallback signatures on unrecognized parameter types.

The new get_signature method and transform_param_type function effectively parse and apply exact signatures when param_types is fully known. However, note that if any parameter type parsing fails, the entire signature falls back to [Nullary, VariadicAny]. Consider logging or otherwise flagging partial mismatches, so users can diagnose unexpected fallback signatures.


89-151: Introducing the ReturnType enum and related logic.

This addition elegantly handles specialized return types, skipping complex nested array logic for now. For future extensibility, consider whether deeper nesting or advanced type resolution might be needed (e.g., arrays of arrays). Otherwise, this approach is concise and covers common cases.


178-190: Conversion with potential panic on invalid return type.

impl From<RemoteFunction> for ByPassScalarUDF panics if ReturnType::from_str fails, aligning with the chosen "panic on invalid input" strategy. If more graceful handling is desired, consider returning a Result<Self> instead to propagate errors.


261-269: Consistent panic on invalid return type for Aggregate UDF.

As with the scalar variant, the logic is clear but might be improved by propagating errors if dynamic input is expected.


466-640: Extensive coverage of remote-to-bypass function testing.

These tests thoroughly validate various parameter name/type scenarios and return-type enum variants. To further enhance coverage, consider adding negative tests that confirm how errors/panics are handled upon invalid input (e.g., unrecognized data types).

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a70c850 and a965395.

⛔ Files ignored due to path filters (11)
  • ibis-server/resources/function_list/bigquery.csv is excluded by !**/*.csv
  • ibis-server/resources/function_list/canner.csv is excluded by !**/*.csv
  • ibis-server/resources/function_list/clickhouse.csv is excluded by !**/*.csv
  • ibis-server/resources/function_list/mssql.csv is excluded by !**/*.csv
  • ibis-server/resources/function_list/mysql.csv is excluded by !**/*.csv
  • ibis-server/resources/function_list/postgres.csv is excluded by !**/*.csv
  • ibis-server/resources/function_list/snowflake.csv is excluded by !**/*.csv
  • ibis-server/resources/function_list/trino.csv is excluded by !**/*.csv
  • wren-core-py/Cargo.lock is excluded by !**/*.lock
  • wren-core-py/poetry.lock is excluded by !**/*.lock
  • wren-core-py/tests/functions.csv is excluded by !**/*.csv
📒 Files selected for processing (20)
  • ibis-server/tests/conftest.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/bigquery/test_functions.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/canner/test_functions.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/clickhouse/test_functions.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/mssql/test_functions.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/mysql/test_functions.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/postgres/test_functions.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/snowflake/test_functions.py (1 hunks)
  • ibis-server/tests/routers/v3/connector/trino/test_functions.py (1 hunks)
  • ibis-server/tools/remote_function_check.py (1 hunks)
  • wren-core-py/src/context.rs (7 hunks)
  • wren-core-py/src/errors.rs (1 hunks)
  • wren-core-py/src/remote_functions.rs (3 hunks)
  • wren-core-py/tests/test_modeling_core.py (2 hunks)
  • wren-core/core/src/lib.rs (1 hunks)
  • wren-core/core/src/logical_plan/analyze/plan.rs (5 hunks)
  • wren-core/core/src/logical_plan/utils.rs (9 hunks)
  • wren-core/core/src/mdl/function.rs (7 hunks)
  • wren-core/core/src/mdl/mod.rs (4 hunks)
  • wren-core/core/src/mdl/utils.rs (3 hunks)
🧰 Additional context used
🧬 Code Definitions (6)
ibis-server/tests/routers/v3/connector/snowflake/test_functions.py (9)
ibis-server/tests/routers/v3/connector/canner/test_functions.py (1)
  • set_remote_function_list_path (36-40)
ibis-server/tests/routers/v3/connector/mssql/test_functions.py (1)
  • set_remote_function_list_path (36-40)
ibis-server/tests/routers/v3/connector/mysql/test_functions.py (1)
  • set_remote_function_list_path (37-41)
ibis-server/tests/routers/v3/connector/postgres/test_functions.py (1)
  • set_remote_function_list_path (38-42)
ibis-server/tests/routers/v3/connector/clickhouse/test_functions.py (1)
  • set_remote_function_list_path (36-40)
ibis-server/tests/routers/v3/connector/trino/test_functions.py (1)
  • set_remote_function_list_path (36-40)
ibis-server/tests/routers/v3/connector/bigquery/conftest.py (1)
  • set_remote_function_list_path (33-37)
ibis-server/app/config.py (1)
  • set_remote_function_list_path (70-71)
ibis-server/tests/conftest.py (1)
  • client (18-23)
wren-core/core/src/logical_plan/analyze/plan.rs (1)
wren-core/core/src/logical_plan/utils.rs (1)
  • try_map_data_type (97-102)
wren-core-py/src/context.rs (3)
wren-core/core/src/mdl/function.rs (14)
  • name (216-218)
  • name (279-281)
  • name (342-344)
  • new (165-175)
  • new (248-258)
  • new (311-321)
  • from (179-188)
  • from (262-271)
  • from (325-334)
  • param_types (46-49)
  • return_type (224-226)
  • return_type (287-289)
  • from_str (79-86)
  • from_str (119-129)
wren-core/core/src/mdl/mod.rs (3)
  • new (117-172)
  • new (421-423)
  • register_remote_function (383-408)
wren-core-py/src/remote_functions.rs (2)
  • from (65-99)
  • from (103-139)
wren-core-py/tests/test_modeling_core.py (2)
wren-core-py/src/context.rs (3)
  • transform_sql (139-148)
  • functions (93-96)
  • get_available_functions (151-160)
ibis-server/app/routers/v3/connector.py (1)
  • functions (104-115)
wren-core/core/src/mdl/mod.rs (5)
wren-core/core/src/logical_plan/utils.rs (1)
  • try_map_data_type (97-102)
wren-core/core/src/mdl/context.rs (1)
  • new (228-231)
wren-core/core/src/mdl/function.rs (3)
  • new (165-175)
  • new (248-258)
  • new (311-321)
wren-core/core/src/logical_plan/analyze/model_anlayze.rs (2)
  • new (67-75)
  • new (788-796)
wren-core/core/src/mdl/lineage.rs (1)
  • new (26-37)
wren-core/core/src/mdl/function.rs (1)
wren-core/core/src/logical_plan/utils.rs (1)
  • map_data_type (104-171)
🪛 Ruff (0.8.2)
ibis-server/tools/remote_function_check.py

1-1: File ibis-server/tools/remote_function_check.py is part of an implicit namespace package. Add an __init__.py.

(INP001)


23-23: print found

(T201)


24-24: print found

(T201)


27-27: Unnecessary open mode parameters

Remove open mode parameters

(UP015)


42-42: print found

(T201)

🔇 Additional comments (67)
ibis-server/tests/conftest.py (1)

14-14:

❓ Verification inconclusive

Significant increase in DATAFUSION_FUNCTION_COUNT constant

The constant has been increased from 273 to 25941, which is a substantial change. This appears to align with retrieving function information from INFORMATION_SCHEMA as mentioned in the PR description, rather than manually defining functions.

Please verify that this dramatic increase is expected and won't negatively impact test performance:


🏁 Script executed:

#!/bin/bash
# Check if test performance is affected by the large function count
time=$(python -c "
import time
start = time.time()
import sys
sys.path.append('ibis-server')
from tests.conftest import DATAFUSION_FUNCTION_COUNT
print(f'Loading {DATAFUSION_FUNCTION_COUNT} functions took {time.time() - start:.2f} seconds')
")
echo "$time"

Length of output: 521


Review Verification: Manual Check Required for Dependency Issue

The test performance verification did not complete as expected because the environment is missing the asgi_lifespan module (error: ModuleNotFoundError: No module named 'asgi_lifespan'). The dramatic increase in DATAFUSION_FUNCTION_COUNT from 273 to 25941 appears intentional—aligning with the new approach of retrieving function data from INFORMATION_SCHEMA—but its impact on test performance remains unverified.

  • Action Items:
    • Ensure that the required asgi_lifespan module is installed or properly configured for the test environment.
    • Manually verify that loading 25,941 functions does not degrade test performance when the dependencies are correctly set up.
wren-core-py/src/errors.rs (1)

4-4: Reintroducing ParseIntError import

The import for ParseIntError has been added back, which is correct since the code already has an implementation for From<ParseIntError> at lines 64-68. This ensures the error handling for integer parsing works properly.

ibis-server/tests/routers/v3/connector/canner/test_functions.py (1)

56-65:

✅ Verification successful

Updated function testing and reduced count

The test now:

  1. Expects 28 additional functions (down from 30)
  2. Tests the "to_base64" function instead of "abs"
  3. Has updated the expected function description and return type

These changes align with the PR's goal of eliminating duplicate remote functions for each DataSource.

Let's verify if other test files have been consistently updated with the new function pattern:


🏁 Script executed:

#!/bin/bash
# Check if all connector test files have been updated to test "to_base64" instead of "abs"
echo "Checking for consistent function test patterns across connector tests..."

# Look for test files that might still be using "abs" function
grep -r "x\[\"name\"\] == \"abs\"" --include="test_*.py" ibis-server/tests/routers/v3/connector/

# Check other connectors for consistent handling of to_base64
echo -e "\nChecking to_base64 implementation across connector test files:"
grep -r "x\[\"name\"\] == \"to_base64\"" --include="test_*.py" ibis-server/tests/routers/v3/connector/

Length of output: 762


Consistent test updates verified across connectors

The updated test in ibis-server/tests/routers/v3/connector/canner/test_functions.py now correctly expects 28 additional functions, tests the "to_base64" function (with the updated description and return type), and is consistent with similar changes in other connectors (e.g., in the Trino tests). Shell script results confirmed that there are no remaining tests still using the "abs" function and that the new "to_base64" pattern is applied uniformly.

ibis-server/tests/routers/v3/connector/mssql/test_functions.py (1)

57-65: Function test now uses database-specific 'sysdatetime' instead of generic 'floor'

The test now validates a MSSQL-specific function with appropriate type information, aligning with the PR's goal to improve function information retrieval and type signatures for each DataSource.

ibis-server/tests/routers/v3/connector/mysql/test_functions.py (1)

58-66: Database-specific function test updated correctly

The test now validates MySQL-specific function "lcase" with appropriate type information (Utf8 instead of int), which aligns with the PR goal of improving function registration and type signatures for each DataSource.

wren-core/core/src/mdl/mod.rs (4)

1-1: Import of try_map_data_type is consistent with the implementation changes.

The import of try_map_data_type from logical_plan::utils aligns with the function's usage throughout the file. This change is part of the refactoring effort to improve error handling for data type mapping.


222-222: Enhanced error handling with try_map_data_type.

Replacing map_data_type with try_map_data_type improves robustness by providing a fallback to Utf8 type when parsing fails, as shown in the implementation from the relevant code snippets.


391-392: Consistent application of try_map_data_type across all function types.

The change consistently applies try_map_data_type for all three function types (Scalar, Aggregate, Window), ensuring uniform error handling behavior across different function registrations.

Also applies to: 397-398, 403-404


1310-1311: Improved error message specificity for struct field access.

The error message has been updated to provide more detailed information about type mismatches when accessing struct fields. The new message explicitly mentions that the operation is only valid for Struct, Map, or Null types, helping developers identify the issue more quickly.

ibis-server/tests/routers/v3/connector/trino/test_functions.py (1)

57-65: Test updated to verify to_base64 function instead of array_distinct.

The test has been updated to verify the "to_base64" function instead of "array_distinct", with updated type signatures that match the correct specification. This aligns with the PR objective to improve type signatures and ensure consistency in function definitions.

The parameter type changed from "array" to "Binary" and return type from "array" to "Utf8", which is the correct signature for a base64 encoding function.

ibis-server/tests/routers/v3/connector/postgres/test_functions.py (1)

65-67: Updated extract function type signatures to use precise Arrow types.

The parameter and return types for the "extract" function have been updated to use more precise Arrow type definitions:

  • Parameter types changed from "text,timestamp" to "Utf8,Timestamp(Nanosecond, None)"
  • Return type changed from "numeric" to "Decimal128(38, 10)"

These changes reflect a move towards more accurate type representations that match the Arrow data model, which is essential for proper type handling across systems.

ibis-server/tests/routers/v3/connector/clickhouse/test_functions.py (1)

57-65: Test updated to verify uniq aggregate function instead of abs scalar function.

The test case has been modified to check for the "uniq" aggregate function instead of the "abs" scalar function, with appropriate updates to the function properties:

  1. Function type changed from "scalar" to "aggregate"
  2. Description updated to correctly describe the HyperLogLog-based cardinality estimation
  3. Parameter types and return types set to None, which is appropriate for this function in ClickHouse

This change better represents the ClickHouse-specific functions in the test suite and aligns with the PR objective to improve function definitions across different data sources.

ibis-server/tests/routers/v3/connector/bigquery/test_functions.py (2)

48-55: Enhanced test precision for string_agg function

The test now properly filters for the specific string_agg function with LargeUtf8,LargeUtf8 parameter types, which makes the test more precise and less prone to errors if multiple variants of this function exist.


57-63: Updated string_agg function attributes to match actual implementation

The function metadata has been updated with a more descriptive explanation, correct parameter names, and accurate type information that aligns with the actual BigQuery implementation.

wren-core/core/src/logical_plan/utils.rs (5)

95-102: Improved error handling with fallback for unsupported data types

The new try_map_data_type function gracefully handles type mapping failures by providing a fallback to DataType::Utf8 when a data type can't be parsed. This prevents errors from propagating and makes the system more robust when dealing with unknown or unsupported types.


106-107: Better variable naming for improved clarity

Introducing lower_data_type for the lowercase version of the data type string improves code readability by making it clear which value is being used in subsequent operations.


147-148: Added support for XML and JSONB data types

Support for additional data types enhances compatibility with various database systems. The appropriate fallback types are chosen for data types not natively supported by Arrow.


165-168: Fixed case sensitivity in the DataType parsing fallback

The code now correctly handles case sensitivity by using the original data type string with DataType::from_str() rather than the lowercase version, which may lead to parsing errors for case-sensitive type names.


437-457: Updated test cases to use try_map_data_type

Test cases have been appropriately updated to use the new error-handling function while maintaining the same assertions, ensuring that the existing functionality continues to work correctly.

wren-core/core/src/mdl/utils.rs (2)

215-216: Consistent adoption of error-handled type mapping

Replacing map_data_type with try_map_data_type in the field creation function improves error handling for column type mapping, making the system more resilient when processing columns with unsupported data types.


233-240: Enhanced robustness in remote field handling

The remote field creation now uses the error-handled type mapping function, reducing the risk of failures when processing data types from remote data sources that might have type systems incompatible with Arrow.

wren-core/core/src/logical_plan/analyze/plan.rs (6)

23-23: Updated import to use the error-handling version of map_data_type

The import has been updated to use try_map_data_type instead of map_data_type, aligning with the refactoring effort to improve error handling throughout the codebase.


230-233: Improved error handling in model plan node building

Using try_map_data_type in the field creation process for the model plan node ensures that the system can handle unexpected data types gracefully without failing the entire plan construction.


767-770: Enhanced robustness in model source node field creation

The model source node now uses the error-handled type mapping function when creating fields, ensuring that wildcard selections can handle columns with unsupported data types.


807-810: Consistent error handling across field creation

Using try_map_data_type consistently in all model source node field creation paths ensures uniform behavior when handling data types in both wildcard and explicit column scenarios.


901-903: Improved calculation plan node field creation

The calculation plan node now uses error-handled type mapping for the calculation column's field creation, preventing failures when dealing with complex calculated expressions that might result in unsupported data types.


906-908: Consistent error handling for primary key fields

Using try_map_data_type for primary key field creation in calculation plan nodes ensures that the join operations will work correctly even when primary keys have unusual or complex data types.

wren-core-py/src/remote_functions.rs (5)

22-22: Non-issue import addition.
Adding std::fmt::Display is appropriate for implementing the Display trait below.


27-27: Extended trait derivation.
Deriving Eq, PartialEq, Hash, and Debug on PyRemoteFunction is useful for hashing, set membership checks, and debugging.


77-85: Consistent parameter type flattening.
This mirrors the parameter names logic and is equally valid. No concerns.


94-94: Properly using the transformed return_type.
Aligns with the new handling for empty versus non-empty strings.


141-155: Useful Display implementation.
This user-friendly string representation will help with logging and debugging. No issues.

wren-core-py/tests/test_modeling_core.py (10)

97-101: Updated function call for two parameters.
Using both c_custkey arguments confirms the function signature. The rewritten SQL assertion is correct.


109-109: Large function list size.
Raising the count to 25948 might affect performance. Verify this aligns with new function entries and ensure no unintentional duplicates.


112-116: Consistent transformation verification.
Ensures the generated SQL handles both parameters correctly. LGTM.


121-121: Additional function count check.
25941 closely matches the previous large count—just confirm correctness so no duplication or missing entries exist.


131-131: Specifying return type as Int32.
Reflects refined data type handling. The test matches recent changes.


133-133: Correct parameter types.
Int32,Int32” is consistent with the new approach.


141-148: Added scalar function “add_custom”.
Parameters are listed in CSV but param_names is intentionally None. This is fine if it’s expected.


149-156: Validating array-based function.
A function returning None for both param and return types suggests it’s untyped or determined at runtime. No immediate concerns.


157-164: Testing function with no param names and no return type.
This aligns with the possibility of dynamic type resolution in the system.


165-175: Function missing param type in CSV.
The test clarifies how a missing param type yields None. Implementation is logically consistent with your data flow.

wren-core-py/src/context.rs (12)

25-25: Added import for FromStr.
Enables parsing of string-based function types. Straightforward addition.


27-28: New imports for vec and Runtime.
These support your asynchronous patterns. No concerns.


31-31: Importing GenericStringType.
Likely needed for reading string columns from DataFusion. Looks good.


38-41: Expanded imports from wren_core.
Brings in mdl, AggregateUDF, etc. to manage user-defined functions. Safe changes.


48-48: Storing the runtime in the session context.
This design allows async operations. Straightforward.


86-87: Information Schema enabled.
Allows queries against information_schema for function metadata, which matches your function retrieval logic.


88-88: Runtime creation with error handling.
map_err(CoreError::from)? is good for safer initialization.


90-99: Asynchronous retrieval of registered functions.
Good approach to check for duplicates before re-registering. Minimizes overhead.


115-115: Consistent runtime usage.
Keeps the runtime in sync when an MDL is provided. No issues.


140-146: Async transform flow.
block_on(mdl::transform_sql_with_ctx(...)) is consistent with the new runtime-based context.


152-159: Async function list retrieval.
Modeling functions as PyRemoteFunction after awaiting the DataFusion queries is orderly.


203-216: Registering remote functions by function type.
Good approach to unify the code path for scalar, aggregate, and window UDFs.

wren-core/core/src/mdl/function.rs (13)

3-3: Consolidate and standardize new imports.

These import statements for internal_err, not_impl_err, and items from datafusion::logical_expr as well as the map_data_type utility appear consistent and purposeful. They unify error handling and provide necessary functions for the new return-type logic.

Also applies to: 8-10, 17-18


24-25: Param fields allow for partial undefined parameters.

Expanding param_names and param_types to Option<Vec<Option<String>>> provides flexibility for scenarios where certain parameter names or types may remain unspecified. This design choice is coherent, but ensure that downstream logic (like doc generation and signature construction) gracefully handles these nested None values.


57-57: Added PartialEq to derive macros.

Adding PartialEq for the enum or struct is beneficial for test assertions and comparisons. No concerns here.


66-66: FunctionType display implementation unchanged.

The impl Display for FunctionType remains straightforward, mirroring the enum variants. No issues noted.


80-80: String matching in from_str aligns with the existing pattern.

The match s.to_lowercase().as_str() approach is consistent across similar code. Approved.


159-159: Embedding ReturnType and doc:

Storing ReturnType and doc directly in ByPassScalarUDF is consistent with the broader refactor design. The approach is clear and maintainable.

Also applies to: 161-161


191-209: Building documentation from partially known parameters.

Using param_types strings as argument descriptions and defaulting to an empty string for missing elements ensures consistent doc generation. This looks good, and the logic handles None gracefully.


224-225: ByPassScalarUDF return type forwarding.

The return_type method simply defers to self.return_type.to_data_type(arg_types), which is coherent with the new ReturnType abstraction.


232-234: Documentation forwarding.

Returning self.doc.as_ref() is straightforward. No concerns here.


242-242: Embedding ReturnType and doc in ByPassAggregateUDF.

Adopting the same design pattern for aggregates ensures consistency across UDF implementations.

Also applies to: 244-244


287-288: Aggregate UDF return type logic is straightforward.

The code properly delegates to ReturnType::to_data_type(arg_types).


295-297: Aggregate UDF documentation retrieval.

Again, a simple reference to self.doc, consistent with previous patterns.


357-360: Window function field logic.

The field method uses to_data_type(field_args.input_types()) to derive the return type, which is aligned with the rest of the code.

@douenergy douenergy merged commit 52f9163 into Canner:main Mar 31, 2025
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery core ibis python Pull requests that update Python code rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants