Skip to content

feat(ibis): introduce minio connector #1048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 3, 2025
Merged

Conversation

goldmedal
Copy link
Contributor

@goldmedal goldmedal commented Feb 3, 2025

Description

  • Introduce the minio File connector.
  • The usage is the same as the local and s3 file connector but the connection info differs.

URL

/v2/connector/minio_file
/v3/connector/minio_file

Connection Info

{
        "url": "/tpch/data",
        "format": "parquet",
        "bucket": "s3-bucket-name",
        "endpoint": "localhost:32821",
        "access_key": "minioadmin",
        "secret_key": "minioadmin"
}
  • url: The root path of the dataset. (It doesn't include the bucket name)
  • format: The specific file format.
  • bucket: The bucket name.
  • endpoint: The endpoint of minio.
  • access_key: The account of minio
  • secret_kety: The password of minio

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Extended data source support to include Minio file connections.
    • Introduced mapping and metadata handling for Minio storage, enhancing querying and connection functionalities.
  • Tests

    • Added comprehensive asynchronous tests to verify Minio connectivity, metadata retrieval, and proper error handling.
  • Chores

    • Updated development dependencies and test markers to support Minio integration in the testing framework.

Copy link

coderabbitai bot commented Feb 3, 2025

Walkthrough

The changes introduce support for Minio as a file connection type across the application. Modifications include updates to the rewriter to recognize Minio file sources, new DTO and connection info classes for Minio, enhancements in the connector and metadata logic to handle Minio-specific operations, and a new utility function for initializing DuckDB with Minio settings. Additionally, dependency and testing configurations were updated to include Minio, along with a comprehensive suite of tests validating the new functionality.

Changes

File(s) Change Summary
ibis-server/app/mdl/rewriter.py Modified _get_write_dialect to include DataSource.minio_file returning "duckdb".
ibis-server/app/model/__init__.py Added QueryMinioFileDTO and MinioFileConnectionInfo; updated ConnectionInfo type.
ibis-server/app/model/connector.py Enhanced DuckDBConnector to support MinioFileConnectionInfo with specific error handling using IOException.
ibis-server/app/model/data_source.py Added minio_file entry to DataSource and mapped it to QueryMinioFileDTO in DataSourceExtension.
ibis-server/app/model/metadata/factory.py Updated factory mapping to include DataSource.minio_file and added MinioFileMetadata class with methods for connection, versioning, operator initialization, and path formatting.
ibis-server/app/model/metadata/object_storage.py Introduced MinioFileMetadata class and associated methods for Minio-specific metadata handling.
ibis-server/app/model/utils.py Introduced init_duckdb_minio for initializing a DuckDB connection using Minio settings.
ibis-server/pyproject.toml Added minio to test dependencies and introduced a new pytest marker minio_file.
ibis-server/tests/routers/v2/connector/test_minio_file.py Created comprehensive tests for querying, dry runs, metadata retrieval, error handling, and listing files using the Minio connector.

Suggested labels

ci

Suggested reviewers

  • onlyjackfrost
  • wwwy3y3

Poem

Oh, what a joyful sight,
A rabbit hops with coding might,
Minio joins the file brigade,
Through tests and queries, all well-laid,
With carrots 🥕 and hops so free,
Our code now sings in harmony!
Hop on, let’s celebrate this spree!


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52f00d3 and 9297efa.

⛔ Files ignored due to path filters (1)
  • ibis-server/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • ibis-server/pyproject.toml (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • ibis-server/pyproject.toml
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: ci

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added ibis dependencies Pull requests that update a dependency file python Pull requests that update Python code labels Feb 3, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
ibis-server/app/model/utils.py (1)

33-33: Consider making the region configurable.

The region is hardcoded to 'ap-northeast-1'. While this might work for Minio, consider making it configurable through MinioFileConnectionInfo for better flexibility.

-        REGION 'ap-northeast-1'
+        REGION '{connection_info.region.get_secret_value() if hasattr(connection_info, "region") else "ap-northeast-1"}'
ibis-server/app/model/__init__.py (1)

63-64: LGTM! Well-structured Minio connection classes.

The implementation follows the established patterns and properly handles Minio-specific requirements.

Consider adding validation for the endpoint format to ensure it's a valid host:port combination. Example:

from pydantic import validator
import re

class MinioFileConnectionInfo(BaseModel):
    # ... existing fields ...
    
    @validator('endpoint')
    def validate_endpoint(cls, v):
        if not re.match(r'^[a-zA-Z0-9.-]+(?::\d+)?$', v.get_secret_value()):
            raise ValueError('endpoint must be in format host[:port]')
        return v

Also applies to: 169-181, 194-194

ibis-server/app/model/connector.py (1)

174-175: Consider consolidating error handling for IOException.

The same error handling for IOException is duplicated in both query and dry_run methods.

Consider extracting the error handling into a helper method:

+def _handle_io_exception(e: IOException, is_dry_run: bool) -> None:
+    error_class = QueryDryRunError if is_dry_run else UnprocessableEntityError
+    raise error_class(f"Failed to execute query: {e!s}")
+
 def query(self, sql: str, limit: int) -> pd.DataFrame:
     try:
         return self.connection.execute(sql).fetch_df().head(limit)
     except IOException as e:
-        raise UnprocessableEntityError(f"Failed to execute query: {e!s}")
+        self._handle_io_exception(e, is_dry_run=False)
     except HTTPException as e:
         raise UnprocessableEntityError(f"Failed to execute query: {e!s}")

 def dry_run(self, sql: str) -> None:
     try:
         self.connection.execute(sql)
     except IOException as e:
-        raise QueryDryRunError(f"Failed to execute query: {e!s}")
+        self._handle_io_exception(e, is_dry_run=True)
     except HTTPException as e:
         raise QueryDryRunError(f"Failed to execute query: {e!s}")

Also applies to: 182-183

ibis-server/app/model/metadata/object_storage.py (1)

223-227: Consider using URL construction utility.

Manual string concatenation for URLs can be error-prone.

Consider using urllib.parse:

+from urllib.parse import urlunparse
+
 def _get_dal_operator(self):
     info: MinioFileConnectionInfo = self.connection_info
-    if info.ssl_enabled:
-        endpoint = f"https://{info.endpoint.get_secret_value()}"
-    else:
-        endpoint = f"http://{info.endpoint.get_secret_value()}"
+    scheme = "https" if info.ssl_enabled else "http"
+    endpoint = urlunparse((scheme, info.endpoint.get_secret_value(), "", "", "", ""))
ibis-server/tests/routers/v2/connector/test_minio_file.py (2)

89-89: Consider using a variable for Minio version.

The Minio version is hardcoded in the image tag.

Consider extracting it to a constant:

+MINIO_VERSION = "RELEASE.2024-12-18T13-15-44Z"
+
 @pytest.fixture(scope="module")
 def minio(request) -> MinioContainer:
     logger.info("Starting Minio container")
-    minio = MinioContainer(image="minio/minio:RELEASE.2024-12-18T13-15-44Z")
+    minio = MinioContainer(image=f"minio/minio:{MINIO_VERSION}")

93-149: Consider using a helper function for file uploads.

The file upload logic is repetitive.

Consider extracting it to a helper function:

+def _upload_test_files(client, bucket: str) -> None:
+    test_files = [
+        ("tpch/data/orders.parquet", "tests/resource/tpch/data/orders.parquet"),
+        ("tpch/data/customer.parquet", "tests/resource/tpch/data/customer.parquet"),
+        # ... add other files
+    ]
+    for bucket_path, local_path in test_files:
+        client.fput_object(bucket, bucket_path, local_path)
+
 @pytest.fixture(scope="module")
 def minio(request) -> MinioContainer:
     logger.info("Starting Minio container")
     minio = MinioContainer(image="minio/minio:RELEASE.2024-12-18T13-15-44Z")
     minio.start()
     client = minio.get_client()
     client.make_bucket(bucket)
-    client.fput_object(...)  # Remove all individual upload calls
+    _upload_test_files(client, bucket)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3cc2162 and 52f00d3.

⛔ Files ignored due to path filters (1)
  • ibis-server/poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • ibis-server/app/mdl/rewriter.py (1 hunks)
  • ibis-server/app/model/__init__.py (3 hunks)
  • ibis-server/app/model/connector.py (3 hunks)
  • ibis-server/app/model/data_source.py (3 hunks)
  • ibis-server/app/model/metadata/factory.py (2 hunks)
  • ibis-server/app/model/metadata/object_storage.py (3 hunks)
  • ibis-server/app/model/utils.py (2 hunks)
  • ibis-server/pyproject.toml (2 hunks)
  • ibis-server/tests/routers/v2/connector/test_minio_file.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: ci
🔇 Additional comments (10)
ibis-server/app/model/metadata/factory.py (1)

8-12: LGTM! Clean integration of MinioFileMetadata.

The changes follow the established pattern for object storage metadata integration.

Also applies to: 28-28

ibis-server/app/model/utils.py (1)

25-46: LGTM! Well-structured Minio initialization.

The implementation follows the established pattern of the S3 initialization while properly handling Minio-specific configurations:

  • Secret creation with proper error handling
  • Minio-specific endpoint configuration
  • Path-style URL configuration
  • SSL configuration based on user preference
ibis-server/app/mdl/rewriter.py (1)

75-79: LGTM! Consistent dialect handling for Minio.

The change appropriately groups Minio with other file-based data sources that use the DuckDB dialect.

ibis-server/app/model/connector.py (2)

41-45: LGTM! Clean grouping of file-based data sources.

The grouping of local_file, s3_file, and minio_file data sources is logical and maintainable.


168-169: LGTM! Proper initialization for Minio connections.

The Minio initialization follows the same pattern as S3, maintaining consistency in the codebase.

ibis-server/app/model/data_source.py (1)

50-50: LGTM! Clean integration of Minio as a data source.

The Minio file source is properly integrated into both DataSource and DataSourceExtension enums.

Also applies to: 76-76

ibis-server/app/model/metadata/object_storage.py (1)

207-213: LGTM! Clean implementation of MinioFileMetadata.

The class properly extends ObjectStorageMetadata and implements the required methods.

ibis-server/tests/routers/v2/connector/test_minio_file.py (1)

548-558: LGTM! Clean helper function for connection info.

The _to_connection_info helper function properly encapsulates the connection info creation logic.

ibis-server/pyproject.toml (2)

45-45: MinIO Dependency Inclusion in Dev Dependencies
The addition of "minio" to the testcontainers extras is aligned with the PR objective of introducing MinIO support. This dependency will aid in setting up an appropriate test environment for MinIO-related features. Ensure that the version choice (implicit here) remains compatible with your testcontainers setup.


69-69: Addition of MinIO Marker for Pytest
The new marker "minio_file: mark a test as a minio file test" in the [tool.pytest.ini_options] section is a clear and useful addition. It allows you to easily filter and manage tests that are specific to MinIO functionality, which is valuable for continuous integration and selective test runs.

@goldmedal goldmedal requested a review from wwwy3y3 February 3, 2025 05:12
@goldmedal goldmedal merged commit 9cf19fd into Canner:main Feb 3, 2025
7 checks passed
@goldmedal goldmedal deleted the feat/minio-2 branch February 3, 2025 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file ibis python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants