-
Notifications
You must be signed in to change notification settings - Fork 970
perf: New neo4j csv publisher to improve performance using batched params #1957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: New neo4j csv publisher to improve performance using batched params #1957
Conversation
…able two or one way relations Signed-off-by: Kristen Armes <[email protected]>
…ogic to reusable function Signed-off-by: Kristen Armes <[email protected]>
Signed-off-by: Kristen Armes <[email protected]>
…d lint Signed-off-by: Kristen Armes <[email protected]>
Signed-off-by: Kristen Armes <[email protected]>
|
||
# CSV HEADER | ||
# A header with this suffix will be pass to Neo4j statement without quote | ||
UNQUOTED_SUFFIX = ':UNQUOTED' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this i think does nothing and is no longer needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this, it is used to specify fields where the values are not strings, so like "order_pos:UNQUOTED"
would be for numbers 1, 2, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find a reference for this, could you share. I recall looking this up a year+ ago finding it hasn't been doing anything and wasn't really a thing since maybe the earliest versions of neo4j. I tried just now again, and didn't find anything concrete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind I found it. Clearly I was dumb back then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I just looked more. IIUC: in the publisher, all that's happening is this suffix is being removed so it doesn't end up ingested. At some point, it was doing something, but not anymore (please correct me if I'm wrong). Instead of continuing to support this, instead can we simply remove this behavior from the loader and then, likewise, from all publishers? I did a full code search and I'm quite sure this is safe! Maybe double check though.
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
def get_scope(self) -> str: | ||
return 'publisher.neo4j' | ||
|
||
def _create_indices(self, node_file: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idea: allow for list of pre and post-publish actions. and set this as a default pre-publish action
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
What is the minimum Neo4j version required to be able to use this new publisher? I would suggest to document it if there is such requirement |
Hey @ozandogrultan, the Neo4j driver was upgraded in databuilder v7.0.0 (#1938) so it should work for anyone past that version. But I can make a note of it in the description too 👍 |
…nsaction Signed-off-by: Kristen Armes <[email protected]>
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Kristen Armes <[email protected]>
Signed-off-by: Kristen Armes <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing. Minor comments raised. Very nearly done.
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Kristen Armes <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😭 🥲 😭
tx.run(stmt, parameters=params) | ||
|
||
|
||
def get_props_body_keys(record_keys: list, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so good
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
databuilder/databuilder/publisher/neo4j_csv_unwind_publisher.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Kristen Armes <[email protected]>
amazing! any perf number before vs after? |
@feng-tao we've seen great improvements in a few tests I've ran, with having deleted the nodes/relations in between runs: Task # 1: 67676 statements Before: Successfully published. Elapsed: 203 seconds After: Successfully published. Elapsed: 8 seconds Task # 2: 237734 statements Before: Successfully published. Elapsed: 876 seconds After: Successfully published. Elapsed: 29 seconds |
nice, it would be great if you could share the change to the slack so that others could leverage :) (I still remembered the old days to take long time to update in the etl job) |
Summary of Changes
unwind
clause in the cypher queries when merging nodes and relations to be able to batch params and improve performancePUBLISH_REVERSE_RELATIONSHIPS
configPRESERVE_ADHOC_UI_DATA
configpublished_tag
property that is set by databuilder.Tests
Basic unit tests for the publisher
Documentation
N/A
CheckList
Make sure you have checked all steps below to ensure a timely review.