Skip to content

Storing large access data externally #6199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jun 11, 2025

Conversation

galvana
Copy link
Contributor

@galvana galvana commented Jun 4, 2025

Closes ENG-684

Description Of Changes

This PR introduces an automatic external storage fallback system to handle data that exceeds PostgreSQL's 1GB column limit. The solution centers around a new EncryptedLargeDataDescriptor that can be seamlessly applied to SQLAlchemy model columns. This descriptor maintains backward compatibility by storing small data directly in the database as before, while automatically detecting large datasets and transparently offloading them to encrypted external storage through configurable backends (local, S3, or GCS).

RequestTask.data_for_erasures: Data size (1,111,650,541 bytes) exceeds threshold (671,088,640 bytes), storing externally

EncryptedLargeDataDescriptor

  • Generic Python descriptor that can be applied to any encrypted SQLAlchemy column
    • Currently applied to RequestTask.access_results, RequestTask.data_for_erasures, and PrivacyRequest.filtered_final_upload
  • Automatically determines whether to store data in database or external storage based on size
  • Provides transparent getter/setter interface that maintains existing API compatibility
  • Handles metadata tracking for externally stored files

AES GCM Encryption Utilities

  • Leverages existing SQLAlchemy-Utils AES GCM encryption for consistency
  • Introduces new memory-efficient chunked processing using Python's cryptography library
  • Processes large datasets in batches to avoid memory overflow
  • Maintains cross-compatibility between both encryption implementations

ExternalStorageService

  • Unified interface supporting multiple storage backends (local filesystem, S3, GCS)
  • Handles consistent file naming and organization across different storage types
  • Manages complete lifecycle: store, retrieve, and cleanup operations
  • Provides encrypted file storage with proper key management and metadata tracking

Additional Code Changes

  • Memory-efficient util to estimate the size on-disk for in-memory access results
  • No longer cache large results in Redis for DSR 3.0

Steps to Confirm

  1. Check out the branch on https://github.com/ethyca/fidesplus/pull/2232
  2. Enable DSR 3.0, put this in your .env FIDES__EXECUTION__USE_DSR_3_0=true
  3. Make sure local storage is set to the default storage config with html
  4. Create a new system with a Large Data Test Connector. The connection test will fail, that's ok
image
  1. Submit an access request for [email protected]
  2. The encrypted values and final DSR package should be under fidesplus/fides_uploads
image

This could take a while, it took ~13 minutes from beginning to end

image

Pre-Merge Checklist

  • Issue requirements met
  • All CI pipelines succeeded
  • CHANGELOG.md updated
    • Add a db-migration This indicates that a change includes a database migration label to the entry if your change includes a DB migration
    • Add a high-risk This issue suggests changes that have a high-probability of breaking existing code label to the entry if your change includes a high-risk change (i.e. potential for performance impact or unexpected regression) that should be flagged
    • Updates unreleased work already in Changelog, no new entry necessary
  • Followup issues:
    • Followup issues created
    • No followup issues
  • Database migrations:
    • Ensure that your downrev is up to date with the latest revision on main
    • Ensure that your downgrade() migration is correct and works
      • If a downgrade migration is not possible for this change, please call this out in the PR description!
    • No migrations
  • Documentation:
    • Documentation complete, PR opened in fidesdocs
    • Documentation issue created in fidesdocs
    • If there are any new client scopes created as part of the pull request, remember to update public-facing documentation that references our scope registry
    • No documentation updates required

Copy link

vercel bot commented Jun 4, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
fides-plus-nightly ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 11, 2025 9:59pm
1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
fides-privacy-center ⬜️ Ignored (Inspect) Jun 11, 2025 9:59pm

Copy link

codecov bot commented Jun 8, 2025

Codecov Report

Attention: Patch coverage is 37.73585% with 264 lines in your changes missing coverage. Please review.

Project coverage is 76.69%. Comparing base (4d4d933) to head (c750ecc).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/fides/api/service/external_data_storage.py 27.15% 110 Missing ⚠️
...des/api/util/encryption/aes_gcm_encryption_util.py 18.34% 89 Missing ⚠️
...des/api/models/field_types/encrypted_large_data.py 54.21% 35 Missing and 3 partials ⚠️
src/fides/api/util/data_size.py 35.71% 25 Missing and 2 partials ⚠️

❌ Your patch status has failed because the patch coverage (37.73%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (76.69%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (4d4d933) and HEAD (c750ecc). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (4d4d933) HEAD (c750ecc)
12 10
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #6199       +/-   ##
===========================================
- Coverage   87.14%   76.69%   -10.45%     
===========================================
  Files         433      439        +6     
  Lines       26863    27278      +415     
  Branches     2935     2969       +34     
===========================================
- Hits        23410    20922     -2488     
- Misses       2820     5638     +2818     
- Partials      633      718       +85     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@galvana galvana requested a review from JadeCara June 10, 2025 17:49
@galvana galvana marked this pull request as ready for review June 10, 2025 17:53
Copy link
Contributor

@JadeCara JadeCara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions, no blockers though :) Nice work and so many tests!


return f"{self.model_class}/{instance_id}/{self.field_name}/{timestamp}.txt"

def __get__(self, instance: Any, owner: Type) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see owner being used anywhere in this function - can probably be removed from the signature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the methods that are part of the descriptor protocol, we need to leave it even if we don't use it https://docs.python.org/3/reference/datamodel.html#object.__get__

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this belongs in models top level? It kind of blends in with the regular models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it under /models/field_types

)

return is_large

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These top two functions might be useful in other areas as well. I wonder if putting them into a data_size.py type util would make them easier to find/use?

Comment on lines 63 to 80
# Get storage config
if storage_key:
storage_config = (
db.query(StorageConfig)
.filter(StorageConfig.key == storage_key)
.first()
)
if not storage_config:
msg = f"Storage configuration with key '{storage_key}' not found"
logger.error(msg)
raise ExternalDataStorageError(msg)
else:
storage_config = get_active_default_storage_config(db)
if not storage_config:
msg = "No active default storage configuration available for large data"
logger.error(msg)
raise ExternalDataStorageError(msg)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is doing a number of things - this top part might be a good candidate to pull out into its own testable function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

@galvana galvana merged commit cd2c209 into main Jun 11, 2025
17 checks passed
@galvana galvana deleted the ENG-684-save-large-access-data-externally branch June 11, 2025 22:02
Copy link

cypress bot commented Jun 11, 2025

fides    Run #12981

Run Properties:  status check passed Passed #12981  •  git commit cd2c20930b: Storing large access data externally (#6199)
Project fides
Branch Review main
Run status status check passed Passed #12981
Run duration 00m 53s
Commit git commit cd2c20930b: Storing large access data externally (#6199)
Committer Adrian Galvan
View all properties for this run ↗︎

Test results
Tests that failed  Failures 0
Tests that were flaky  Flaky 0
Tests that did not run due to a developer annotating a test with .skip  Pending 0
Tests that did not run due to a failure in a mocha hook  Skipped 0
Tests that passed  Passing 5
View all changes introduced in this branch ↗︎

galvana added a commit that referenced this pull request Jun 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants