-
Notifications
You must be signed in to change notification settings - Fork 25
Speedup import and add regression check for import time #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks a lot for digging into this, really appreciated! I suspect you're right, we're dependent on upstream PRs being merged to achieve significant improvements here. |
3147dc5
to
54aaf3f
Compare
54aaf3f
to
1059be9
Compare
i think it would be good to add performance regression tests to this PR and run them both on |
I will do this :) I was planning to add a runtime unit test as well (perhaps to core parts of pymatgen as well), now we only test the results, and runtime is not covered at all. Thanks again for bringing up this idea! |
Regarding the actual test methodology, I haven't got much finding about best practice on doing an import time test. Your method certainly works, what about spawning a subprocess (is there any pitfall about this, like subprocess overhead maybe?), something like: import subprocess
import time
def measure_import_time(import_command: str) -> float:
start_time = time.time()
subprocess.run(['python', '-c', 'import pymatviz'], check=True) # import_command
return time.time() - start_time I hope I'm correct here, I guess there's a slight pitfall with the
Is there any advantage to use relative time ( |
7497edc
to
23cbe32
Compare
i'm not sure but i think
yes, have a look at this video which explains the virtues of |
Thanks! Let's use subprocess for now, this should make things much easier (if there is not pitfall)
Thanks for sharing, I would have a look later but I certainly trust your judgement. By the way, how do we regenerate the |
![]() Currently like 12% of the import time is used to import
|
the command for that is pytest --store-durations --durations-path tests/.pytest-split-durations i should have documented that somewhere. probably best in |
Thanks a ton for letting me know! What I don't really understand is how do we generate this runtime record via GitHub CI runners (in contrast to run that command locally)? Especially for
I have a feeling that we may not want this test to run for each commit (because I don't expect we would introduce changes that vary import time very often), it's slowing down everything. Perhaps we only run it only when merging into main? For this PR, I currently cannot think of more improvement to make as the left is pretty much core packages like scipy, pandas, matplotlib. Do we want to perhaps merge it (after I modify it to run at main branch only) |
a36eccc
to
5143765
Compare
very thankful for this PR! 👍 didn't mean to keep it parked so long
i think generating locally should be fine. what matters shouldn't be absolute values for test run times but how long a test takes relative to the others which will be similar in CI and locally
even pymatgen only takes 10 min or so to run the whole test suite locally for me. curious how long for you?
is the test slow enough in CI that it would extend overall time to checks complete if it ran inside its own split? i just tried is this PR ready to go from your side? are we waiting on a |
ah, saw your comment in
|
Yep should takes about the same time just like pymatgen CI (macOS runner is running on M1 chip, each split takes around 1 min), as I'm usually coding on a M3 MacBook Air laptop. I'm just being curious as I think running on a dedicated runner would give better scalability and consistency :)
The original implementation was running 10 repeats, but I reduced repeat to 3 so should be good now. |
for module_name in REF_IMPORT_TIME | ||
} | ||
|
||
# Print out the import times in a copyable format | ||
print("\nCopyable import time dictionary:") | ||
print("{") | ||
for module_name, import_time in import_times.items(): | ||
print(f' "{module_name}": {import_time:.2f},') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part was intended to generate a directly copyable dict (and human-readable for manual inspection) for us to directly copy and paste to update the time recording, printing a dict directly gives a one-liner:
Copyable import time dictionary:
{'pymatviz': 2084.25, 'pymatviz.coordination': 2342.41, 'pymatviz.cumulative': 2299.73, 'pymatviz.histogram': 2443.11, 'pymatviz.phonons': 2235.57, 'pymatviz.powerups': 2172.71, 'pymatviz.ptable': 2286.77, 'pymatviz.rainclouds': 2702.03, 'pymatviz.rdf': 2331.98, 'pymatviz.relevance': 2256.29, 'pymatviz.sankey': 2313.12, 'pymatviz.scatter': 2312.48, 'pymatviz.structure_viz': 2330.39, 'pymatviz.sunburst': 2395.04, 'pymatviz.uncertainty': 2317.87, 'pymatviz.xrd': 2242.09}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i saw that. the thing is, if you paste that printed dict into a python file, ruff
will auto-format it to multiple lines so having easier to read code seemed more important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point!
@@ -69,12 +66,12 @@ def measure_import_time_in_ms(module_name: str, count: int = 3) -> float: | |||
""" | |||
total_time = 0.0 | |||
|
|||
for _ in range(count): | |||
start_time = time.perf_counter_ns() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. using perf_counter_ns
was intended to avoid float precision lost (the former gives time in int
, while the latter gives in float
), but admittedly for our case the error should be negligible as: 1. it's far from the ns level; 2. we're rounding it anyway
Use perf_counter_ns() to avoid the precision loss caused by the float type.
* Add test framework to monitor module import times with regression tests * Use time.perf_counter for accurate timing * Implement lazy imports across multiple modules to improve performance: - scipy - plotly.figure_factory - sklearn - pymatgen (Structure, NearNeighbors, PhononDos, PhononBands, Composition) * Add reference import times for all core modules * Configure tests to run only on main branch * Add grace and hard thresholds for import time regression --------- Co-authored-by: Janosh Riebesell <[email protected]>
Summary
pymatviz
import cost way too high #209.pytest-split-durations
(how to do this?)Profile command:
python -X importtime -c "import pymatviz" 2> pmv.log && tuna pmv.log