Skip to content

feat(source-file): Add custom http proxy support #62451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
133f35d
feat(source-file): Add HTTP proxy URL and custom CA certificate support
devin-ai-integration[bot] Jun 11, 2025
5997150
fix: Bump version to 0.6.0 and apply pre-commit formatting fixes
devin-ai-integration[bot] Jun 11, 2025
921a0a3
docs(source-file): Add changelog entry for version 0.6.0
devin-ai-integration[bot] Jun 11, 2025
8c2bc2a
Merge branch 'master' into devin/1749618047-add-http-proxy-support
aaronsteers Jun 13, 2025
35b527a
docs: Add proxy investigation plan and test files for debugging
devin-ai-integration[bot] Jun 28, 2025
2523de8
add working proxy script, do some clean up
aaronsteers Jun 30, 2025
389eadb
refactored implementation
aaronsteers Jun 30, 2025
3d860b7
Delete airbyte-integrations/connectors/source-file/test_direct_config…
aaronsteers Jun 30, 2025
711d7fd
Update airbyte-integrations/connectors/source-file/integration_tests/…
aaronsteers Jun 30, 2025
2c7d860
Apply suggestions from code review
aaronsteers Jun 30, 2025
03256f9
Merge remote-tracking branch 'origin/master' into aj/feat/source-file…
aaronsteers Jun 30, 2025
1f23fba
poetry lock
aaronsteers Jun 30, 2025
ad20ad6
fix
aaronsteers Jun 30, 2025
eec798f
misc updates/fixes
aaronsteers Jun 30, 2025
92f1c39
add comment to clarify random ip
aaronsteers Jun 30, 2025
e0ff9bc
fix(source-file): Update proxy unit tests to match environment variab…
devin-ai-integration[bot] Jun 30, 2025
54dd48b
create as combined certs bundle including built-in system certs
aaronsteers Jul 1, 2025
c26570d
Update airbyte-integrations/connectors/source-file/integration_tests/…
aaronsteers Jul 1, 2025
3b55028
Merge branch 'master' into aj/feat/source-file/add-custom-proxy-support
aaronsteers Jul 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Copyright (c) 2025 Airbyte, Inc., all rights reserved.
"""A mitm-proxy intercept script.

This proves whether the proxy is working by intercepting a specific URL
and modifying the response to return a different CSV.

Usage:
```bash
# First launch the proxy sever:
uvx --from=mitmproxy mitmdump --listen-port 8080 -s integration_tests/proxy_intercept_script.py

# If the secrets file, doesn't exist, create it and open it in an editor to provide the proxy CA cert:
cp integration_tests/proxy_test_config.json.template secrets/proxy_test_config.json
code secrets/proxy_test_config.json

# Now launch the connector:
poetry run python main.py discover --config secrets/proxy_test_config.json
"""

from mitmproxy import http


def response(flow: http.HTTPFlow) -> None:
"""Intercept ALL httpbin requests and return modified base64 CSV data."""
if "httpbin.org" in flow.request.pretty_host:
modified_csv = "intercepted_column,proxy_status\nproxy,INTERCEPTED\ntest,SUCCESS\nverification,CONFIRMED"
flow.response.text = modified_csv
flow.response.headers["content-type"] = "text/csv"
flow.response.status_code = 200

print("🎯 PROXY INTERCEPTED REQUEST!")
print(f" URL: {flow.request.pretty_url}")
print(f" Method: {flow.request.method}")
print(f" User-Agent: {flow.request.headers.get('User-Agent', 'Not set')}")
print(f" Modified response: {modified_csv}")
print("=" * 60)


def request(flow: http.HTTPFlow) -> None:
"""Log ALL requests to prove proxy is receiving traffic."""
print(f"📡 PROXY RECEIVED REQUEST: {flow.request.method} {flow.request.pretty_url}")
print(f" Headers: {dict(flow.request.headers)}")
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"dataset_name": "proxy_investigation_test",
"format": "csv",
"url": "https://httpbin.org/base64/a2V5LHZhbHVlCmZvbyxiYXIKYW5zd2VyLDQyCnF1ZXN0aW9uLHdobyBrbm93cw==",
"provider": {
"storage": "HTTPS",
"proxy_url": "http://localhost:8080",
"ca_certificate": "-----BEGIN CERTIFICATE-----\n...\n-----END CERTIFICATE-----\n"
}
}
2 changes: 1 addition & 1 deletion airbyte-integrations/connectors/source-file/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ data:
connectorSubtype: file
connectorType: source
definitionId: 778daa7c-feaf-4db6-96f3-70fd645acc77
dockerImageTag: 0.5.35
dockerImageTag: 0.6.0
dockerRepository: airbyte/source-file
documentationUrl: https://docs.airbyte.com/integrations/sources/file
githubIssueLabel: source-file
Expand Down
213 changes: 181 additions & 32 deletions airbyte-integrations/connectors/source-file/poetry.lock

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion airbyte-integrations/connectors/source-file/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

[tool.poetry]
version = "0.5.35"
version = "0.6.0"
name = "source-file"
description = "Source implementation for File"
authors = ["Airbyte <[email protected]>"]
Expand Down Expand Up @@ -47,6 +47,7 @@ pytest-mock = "^3.6.1"
pytest = "^8.0.0"
requests-mock = "^1.9.3"
pytest-docker = "==3.0.0"
ruff = "^0.12.1"


[tool.poe]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
from airbyte_cdk.models import AirbyteStream, FailureType, SyncMode
from airbyte_cdk.utils import AirbyteTracedException, is_cloud_environment

from .proxy import configure_custom_http_proxy
from .utils import LOCAL_STORAGE_NAME, backoff_handler


Expand Down Expand Up @@ -259,7 +260,21 @@ class Client:
CSV_CHUNK_SIZE = 10_000
binary_formats = {"excel", "excel_binary", "feather", "parquet", "orc", "pickle"}

def __init__(self, dataset_name: str, url: str, provider: dict, format: str = None, reader_options: dict = None):
def __init__(
self,
dataset_name: str,
url: str,
provider: dict,
format: str | None = None,
reader_options: dict | None = None,
http_proxy: dict | None = None,
):
if http_proxy:
configure_custom_http_proxy(
http_proxy_config=http_proxy,
logger=logger,
)

self._dataset_name = dataset_name
self._url = url
self._provider = provider
Expand Down
121 changes: 121 additions & 0 deletions airbyte-integrations/connectors/source-file/source_file/proxy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Copyright (c) 2025 Airbyte, Inc., all rights reserved.
"""Proxy config constants and helper functions."""

import os
import tempfile
from logging import Logger
from pathlib import Path


# Constants for proxy configuration keys
PROXY_PARENT_CONFIG_KEY = "http_proxy"
PROXY_URL_CONFIG_KEY = "proxy_url"
PROXY_CA_CERTIFICATE_CONFIG_KEY = "proxy_ca_certificate"


# Our hard-coded exclude list:
AIRBYTE_NO_PROXY_ENTRIES = [
# Local and loopback
"localhost",
"127.0.0.1",
"*.local",
# Cloud metadata endpoints
"169.254.169.254", # Special link-local IP for metadata servers (AWS, Azure, etc.)
"metadata.google.internal", # GCP
# Airbyte control/telemetry
"*.airbyte.io",
"*.airbyte.com",
"connectors.airbyte.com",
# Third-party telemetry
"sentry.io",
"api.segment.io",
"*.sentry.io",
"*.datadoghq.com",
"app.datadoghq.com",
]


def _get_no_proxy_entries_from_env_var() -> list[str]:
"""Return a list of entries from the NO_PROXY environment variable."""
if "NO_PROXY" in os.environ:
return [x.strip() for x in os.environ["NO_PROXY"].split(",") if x.strip()]

return []


def _get_no_proxy_string() -> str:
"""Return a string to be used as the NO_PROXY environment variable.

This ensures that requests to these hosts bypass the proxy.
"""
# Merge and dedupe our hardcoded list with any already-set `NO_PROXY` env var
return ",".join(
filter(
None, # Remove any None/Falsey values
list(
set(
# Combine and dedupe:
_get_no_proxy_entries_from_env_var() + AIRBYTE_NO_PROXY_ENTRIES
)
),
)
)


def _install_ca_certificate(ca_cert_file_text: str) -> Path:
"""Install the CA certificate for the proxy.

This involves saving the text to a local file and then setting
the appropriate environment variables to use this certificate.

Returns the path to the temporary CA certificate file.
"""
with tempfile.NamedTemporaryFile(
mode="w",
delete=False,
prefix="airbyte-custom-ca-cert-",
suffix=".pem",
encoding="utf-8",
) as temp_file:
temp_file.write(ca_cert_file_text)
temp_file.flush()

os.environ["REQUESTS_CA_BUNDLE"] = temp_file.name
os.environ["CURL_CA_BUNDLE"] = temp_file.name
os.environ["SSL_CERT_FILE"] = temp_file.name

return Path(temp_file.name).absolute()


def configure_custom_http_proxy(
http_proxy_config: dict[str, str],
*,
logger: Logger,
proxy_url: str | None = None,
ca_cert_file_text: str | None = None,
) -> None:
"""Initialize the proxy environment variables.

If connector_config_dict is provided it contains an "http_proxy" entry, this config
will be scanned for proxy config settings.

If proxy_url and/or `ca_cert_file_text` are provided, they will override the values in
connector_config_dict.

The function will no-op if neither input option provides a proxy URL.
"""
proxy_url = proxy_url or http_proxy_config.get(PROXY_URL_CONFIG_KEY)
ca_cert_file_text = ca_cert_file_text or http_proxy_config.get(PROXY_CA_CERTIFICATE_CONFIG_KEY)

if proxy_url:
logger.info(f"Using custom proxy URL: {proxy_url}")

if ca_cert_file_text:
# Install the CA certificate if provided, and set CA-related env vars:
cert_file_path = _install_ca_certificate(ca_cert_file_text)
logger.info(f"Using custom installed CA certificate: {cert_file_path!s}")

# Set the remaining proxy config env vars:
os.environ["NO_PROXY"] = _get_no_proxy_string()
os.environ["HTTP_PROXY"] = proxy_url
os.environ["HTTPS_PROXY"] = proxy_url
26 changes: 26 additions & 0 deletions airbyte-integrations/connectors/source-file/source_file/spec.json
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,32 @@
}
}
]
},
"http_proxy": {
"type": "object",
"title": "HTTP Proxy Configuration",
"description": "Configuration for using a custom HTTP proxy to access remote files.",
"properties": {
"proxy_url": {
"order": 1,
"type": "string",
"title": "HTTP Proxy URL",
"description": "HTTP/HTTPS proxy URL for accessing remote files. Format: http://proxy-host:port or https://proxy-host:port",
"examples": [
"http://proxy.company.com:8080",
"https://secure-proxy.company.com:3128"
]
},
"proxy_ca_certificate": {
"order": 2,
"type": "string",
"title": "Proxy CA Certificate",
"description": "Custom CA certificate for communicating with the custom Proxy URL. Provide the full certificate in PEM format, beginning with '-----BEGIN CERTIFICATE-----' and ending with '-----END CERTIFICATE-----'. Ignored if Proxy URL is not set.",
"airbyte_secret": true,
"writeOnly": true,
"multiline": true
}
}
}
}
}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Copyright (c) 2025 Airbyte, Inc., all rights reserved.

import os
from pathlib import Path
from unittest.mock import Mock, mock_open, patch

import pytest
from source_file.client import Client, URLFile

from airbyte_cdk.entrypoint import logger


class TestProxyCertificateSupport:
"""Test proxy and certificate support for HTTPS provider"""

def test_https_with_proxy_only(self):
"""Test HTTPS provider with proxy_url configuration"""
http_proxy_config = {"proxy_url": "http://proxy.company.com:8080"}

with patch.dict("os.environ", {}, clear=True), patch("source_file.client.configure_custom_http_proxy") as mock_configure:
client = Client(
dataset_name="test", url="https://example.com/test.csv", provider={"storage": "HTTPS"}, http_proxy=http_proxy_config
)

mock_configure.assert_called_once_with(http_proxy_config=http_proxy_config, logger=logger)

def test_https_with_certificate_only(self):
"""Test HTTPS provider with ca_certificate configuration"""
test_cert = "-----BEGIN CERTIFICATE-----\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA...\n-----END CERTIFICATE-----"
http_proxy_config = {"proxy_ca_certificate": test_cert}

with (
patch.dict("os.environ", {}, clear=True),
patch("source_file.proxy._install_ca_certificate") as mock_install,
patch("source_file.client.configure_custom_http_proxy") as mock_configure,
):
mock_install.return_value = Path("/tmp/test_cert.pem")

client = Client(
dataset_name="test", url="https://example.com/test.csv", provider={"storage": "HTTPS"}, http_proxy=http_proxy_config
)

mock_configure.assert_called_once_with(http_proxy_config=http_proxy_config, logger=logger)

def test_https_with_proxy_and_certificate(self):
"""Test HTTPS provider with both proxy_url and ca_certificate"""
test_cert = "-----BEGIN CERTIFICATE-----\ntest\n-----END CERTIFICATE-----"
http_proxy_config = {"proxy_url": "https://secure-proxy.company.com:3128", "proxy_ca_certificate": test_cert}

with patch.dict("os.environ", {}, clear=True), patch("source_file.client.configure_custom_http_proxy") as mock_configure:
client = Client(
dataset_name="test", url="https://example.com/test.csv", provider={"storage": "HTTPS"}, http_proxy=http_proxy_config
)

mock_configure.assert_called_once_with(http_proxy_config=http_proxy_config, logger=logger)

def test_https_without_proxy_or_certificate(self):
"""Test HTTPS provider without proxy or certificate (regression test)"""
with patch.dict("os.environ", {}, clear=True), patch("source_file.client.configure_custom_http_proxy") as mock_configure:
client = Client(dataset_name="test", url="https://example.com/test.csv", provider={"storage": "HTTPS"}, http_proxy=None)

mock_configure.assert_not_called()

def test_https_with_user_agent_and_proxy(self):
"""Test HTTPS provider with user_agent and proxy_url"""
http_proxy_config = {"proxy_url": "http://proxy.test.com:8080"}

with (
patch.dict("os.environ", {"AIRBYTE_VERSION": "1.2.3"}, clear=True),
patch("source_file.client.configure_custom_http_proxy") as mock_configure,
):
client = Client(
dataset_name="test",
url="https://example.com/test.csv",
provider={"storage": "HTTPS", "user_agent": True},
http_proxy=http_proxy_config,
)

mock_configure.assert_called_once_with(http_proxy_config=http_proxy_config, logger=logger)

def test_certificate_installation(self):
"""Test certificate installation creates temporary file and sets environment variables"""
test_cert = "-----BEGIN CERTIFICATE-----\ntest\n-----END CERTIFICATE-----"

with patch("tempfile.NamedTemporaryFile") as mock_temp_file, patch.dict("os.environ", {}, clear=True):
mock_file = mock_open()
mock_temp_file.return_value.__enter__.return_value = mock_file.return_value
mock_file.return_value.name = "/tmp/test_cert.pem"

from source_file.proxy import _install_ca_certificate

result_path = _install_ca_certificate(test_cert)

mock_file.return_value.write.assert_called_once_with(test_cert)
mock_file.return_value.flush.assert_called_once()

assert os.environ.get("REQUESTS_CA_BUNDLE") == "/tmp/test_cert.pem"
assert os.environ.get("CURL_CA_BUNDLE") == "/tmp/test_cert.pem"
assert os.environ.get("SSL_CERT_FILE") == "/tmp/test_cert.pem"

def test_proxy_environment_variables_set(self):
"""Test that proxy configuration sets the correct environment variables"""
http_proxy_config = {
"proxy_url": "http://proxy.test.com:8080",
"proxy_ca_certificate": "-----BEGIN CERTIFICATE-----\ntest\n-----END CERTIFICATE-----",
}

with patch.dict("os.environ", {}, clear=True), patch("source_file.proxy._install_ca_certificate") as mock_install:
mock_install.return_value = Path("/tmp/test_cert.pem")

from source_file.proxy import configure_custom_http_proxy

configure_custom_http_proxy(http_proxy_config=http_proxy_config, logger=logger)

assert os.environ.get("HTTP_PROXY") == "http://proxy.test.com:8080"
assert os.environ.get("HTTPS_PROXY") == "http://proxy.test.com:8080"
assert "NO_PROXY" in os.environ
mock_install.assert_called_once_with("-----BEGIN CERTIFICATE-----\ntest\n-----END CERTIFICATE-----")
1 change: 1 addition & 0 deletions docs/integrations/sources/file.md
Original file line number Diff line number Diff line change
Expand Up @@ -298,6 +298,7 @@ In order to read large files from a remote location, this connector uses the [sm

| Version | Date | Pull Request | Subject |
| :------ | :--------- | :------------------------------------------------------- | :------------------------------------------------------------------------------------------------------ |
| 0.6.0 | 2025-07-02 | [61521](https://github.com/airbytehq/airbyte/pull/61521) | Add support for custom Https proxy URL and custom Proxy CA certificate |
| 0.5.35 | 2025-06-28 | [62343](https://github.com/airbytehq/airbyte/pull/62343) | Update dependencies |
| 0.5.34 | 2025-06-22 | [61283](https://github.com/airbytehq/airbyte/pull/61283) | Update dependencies |
| 0.5.33 | 2025-05-27 | [60869](https://github.com/airbytehq/airbyte/pull/60869) | Update dependencies |
Expand Down
Loading