-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Option to run Incremental Export From Bigtable (Timestamp Filtering) #2233
base: release_2024-11-26-00_RC00
Are you sure you want to change the base?
Add Option to run Incremental Export From Bigtable (Timestamp Filtering) #2233
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very difficult to see the diffs =\ is there anything we can do about it? seems formatting is wrong...
v1/pom.xml
Outdated
@@ -852,7 +852,7 @@ | |||
<artifactId>protobuf-maven-plugin</artifactId> | |||
<version>0.6.1</version> | |||
<configuration> | |||
<protocArtifact>com.google.protobuf:protoc:${protobuf.version}:exe:${os.detected.classifier}</protocArtifact> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this done on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also update the integration test to use the filter, and maybe add unit tests to test edge cases like start without end etc
* Dataflow pipeline that exports data from a Cloud Bigtable table to Avro files in GCS. Currently, | ||
* filtering on Cloud Bigtable table is not supported. | ||
* Dataflow pipeline that exports data from a Cloud Bigtable table to Avro files in GCS. | ||
* 2/25 Add filtering rows based on timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2/25?
groupName = "Source", | ||
optional = true, | ||
regexes = {"[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}(\\.[0-9]+)?Z"}, | ||
description = "Start Timestamp in UTC Format (YYYY-MM-DDTHH:MM:SSZ) for exporting ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something in this line seems broken. If not, please remove the whitespace after "Exporting".
optional = true, | ||
regexes = {"[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}(\\.[0-9]+)?Z"}, | ||
description = "End Timestamp in UTC Format (YYYY-MM-DDTHH:MM:SSZ)", | ||
helpText = " Example UTC timestamp 2024-10-27T10:15:30.00Z" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing text in the beginning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @maheshgoyal15 ! Could you add some e2e tests here to ensure this WAI?
Added test case for edge cases and the filtering the rows. Added optional parameter section in Readme
@ron-gal There were few delays but I am able to incorporate all the recommendation you provided. Can you take a look at this PR |
Description:
This pull request introduces a new feature to enable incremental data exports from Google Cloud Bigtable based on timestamp filtering. This addresses a specific customer requirement for efficiently extracting daily data subsets for transfer to a non-production Bigtable instance.
Problem:
The customer needs to export data from their production Bigtable database on a daily basis, filtering it based on a timestamp criterion. Currently, the existing export functionality lacks the ability to perform incremental exports based on timestamps, forcing them to perform full table scans or implement complex custom solutions.
Solution:
This PR implements the following changes to enable incremental exports based on timestamps:
--startTimestamp
and--endTimestamp
.TimestampRangeFilter
to efficiently retrieve only the required data.BigtableToAvro.Options
interface has been extended to includegetStartTimestamp()
andgetEndTimestamp()
methods.timestampConverter
function is implemented to convert the provided UTC timestamp strings into microseconds.filterProvider
ValueProvider
is used to construct theRowFilter
based on the provided start and end timestamps.BigtableIO.Read
operation now includes awithRowFilter(filterProvider)
call to apply the timestamp filter.