-
Notifications
You must be signed in to change notification settings - Fork 81
Storing large access data externally #6199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Skipped Deployment
|
Codecov ReportAttention: Patch coverage is
❌ Your patch status has failed because the patch coverage (37.73%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.
Additional details and impacted files@@ Coverage Diff @@
## main #6199 +/- ##
===========================================
- Coverage 87.14% 76.69% -10.45%
===========================================
Files 433 439 +6
Lines 26863 27278 +415
Branches 2935 2969 +34
===========================================
- Hits 23410 20922 -2488
- Misses 2820 5638 +2818
- Partials 633 718 +85 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few suggestions, no blockers though :) Nice work and so many tests!
|
||
return f"{self.model_class}/{instance_id}/{self.field_name}/{timestamp}.txt" | ||
|
||
def __get__(self, instance: Any, owner: Type) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see owner
being used anywhere in this function - can probably be removed from the signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the methods that are part of the descriptor protocol, we need to leave it even if we don't use it https://docs.python.org/3/reference/datamodel.html#object.__get__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this belongs in models top level? It kind of blends in with the regular models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it under /models/field_types
) | ||
|
||
return is_large | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These top two functions might be useful in other areas as well. I wonder if putting them into a data_size.py
type util would make them easier to find/use?
# Get storage config | ||
if storage_key: | ||
storage_config = ( | ||
db.query(StorageConfig) | ||
.filter(StorageConfig.key == storage_key) | ||
.first() | ||
) | ||
if not storage_config: | ||
msg = f"Storage configuration with key '{storage_key}' not found" | ||
logger.error(msg) | ||
raise ExternalDataStorageError(msg) | ||
else: | ||
storage_config = get_active_default_storage_config(db) | ||
if not storage_config: | ||
msg = "No active default storage configuration available for large data" | ||
logger.error(msg) | ||
raise ExternalDataStorageError(msg) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is doing a number of things - this top part might be a good candidate to pull out into its own testable function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call!
fides
|
Project |
fides
|
Branch Review |
main
|
Run status |
|
Run duration | 00m 53s |
Commit |
|
Committer | Adrian Galvan |
View all properties for this run ↗︎ |
Test results | |
---|---|
|
0
|
|
0
|
|
0
|
|
0
|
|
5
|
View all changes introduced in this branch ↗︎ |
Closes ENG-684
Description Of Changes
This PR introduces an automatic external storage fallback system to handle data that exceeds PostgreSQL's 1GB column limit. The solution centers around a new
EncryptedLargeDataDescriptor
that can be seamlessly applied to SQLAlchemy model columns. This descriptor maintains backward compatibility by storing small data directly in the database as before, while automatically detecting large datasets and transparently offloading them to encrypted external storage through configurable backends (local, S3, or GCS).EncryptedLargeDataDescriptor
RequestTask.access_results
,RequestTask.data_for_erasures
, andPrivacyRequest.filtered_final_upload
AES GCM Encryption Utilities
ExternalStorageService
Additional Code Changes
Steps to Confirm
.env
FIDES__EXECUTION__USE_DSR_3_0=true
html
Large Data Test Connector
. The connection test will fail, that's ok[email protected]
fidesplus/fides_uploads
This could take a while, it took ~13 minutes from beginning to end
Pre-Merge Checklist
CHANGELOG.md
updatedmain
downgrade()
migration is correct and works