-
Notifications
You must be signed in to change notification settings - Fork 2
canonicalize input labels to lowercase before matching during import #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -147,13 +147,13 @@ function match_edf_label(label, signal_names, channel_name, canonical_names) | |||
# will not match. the fix for this is to preprocess signal headers before | |||
# `plan_edf_to_onda_samples` to normalize known instances (after reviewing the plan) | |||
m = match(r"[\s\[,\]]*(?<signal>.+?)[\s,\]]*\s+(?<spec>.+)"i, label) | |||
if !isnothing(m) && m[:signal] in signal_names | |||
if !isnothing(m) && m[:signal] in Iterators.map(lowercase, signal_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterators.map
for laziness? At that point I almost wonder if
if !isnothing(m) && m[:signal] in Iterators.map(lowercase, signal_names) | |
if !isnothing(m) && any(==(m[:signal]) \circ lowercase, signal_names) |
would be faster
[noblock]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, good q...I wonder if this is really worth it at all actually. I'll do a bit of benchmarking and try to figure out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick-and-dirty benchmarks: this method is slower than what's on main by about 20%. using any(==)
is marginally slower still
julia> master
BenchmarkTools.Trial: 1121 samples with 1 evaluation per sample.
Range (min … max): 4.144 ms … 7.371 ms ┊ GC (min … max): 0.00% … 40.79%
Time (median): 4.300 ms ┊ GC (median): 0.00%
Time (mean ± σ): 4.459 ms ± 665.456 μs ┊ GC (mean ± σ): 3.99% ± 9.35%
▆▅▂█▄▁ ▁
██████▇▄▆▁▆▆▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▇▇██▆█ █
4.14 ms Histogram: log(frequency) by time 6.92 ms <
Memory estimate: 3.20 MiB, allocs estimate: 68903.
julia> pr
BenchmarkTools.Trial: 873 samples with 1 evaluation per sample.
Range (min … max): 5.326 ms … 8.560 ms ┊ GC (min … max): 0.00% … 33.95%
Time (median): 5.463 ms ┊ GC (median): 0.00%
Time (mean ± σ): 5.722 ms ± 688.035 μs ┊ GC (mean ± σ): 4.51% ± 8.85%
▅█▇█
▆████▆▃▃▃▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▄▄▄▄▃▂▂ ▃
5.33 ms Histogram: frequency by time 7.61 ms <
Memory estimate: 5.83 MiB, allocs estimate: 134608.
julia> any
BenchmarkTools.Trial: 846 samples with 1 evaluation per sample.
Range (min … max): 5.467 ms … 10.615 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.575 ms ┊ GC (median): 0.00%
Time (mean ± σ): 5.909 ms ± 874.562 μs ┊ GC (mean ± σ): 5.50% ± 10.32%
█▅▆
███▆▃▃▂▂▂▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▃▄▃▃▁▂▁▂▁▁▂▂ ▂
5.47 ms Histogram: frequency by time 8.56 ms <
Memory estimate: 5.83 MiB, allocs estimate: 134608.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the \circ
syntax - some brief internet research suggests it is used for function composition, but I'm not clear how the actual operations differ in a way that would make it faster than Iterators.map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
surprisingly canonicalizing the whole label set ahead of time is even slower
julia> fmap_b = @benchmark plan_edf_to_onda_samples($edf)
BenchmarkTools.Trial: 651 samples with 1 evaluation per sample.
Range (min … max): 7.161 ms … 10.735 ms ┊ GC (min … max): 0.00% … 27.32%
Time (median): 7.358 ms ┊ GC (median): 0.00%
Time (mean ± σ): 7.683 ms ± 903.901 μs ┊ GC (mean ± σ): 4.13% ± 8.41%
▄▄▆█▆▃ ▁▁▁
███████▇▆▆▅▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▇█████▆▆▆ ▇
7.16 ms Histogram: log(frequency) by time 10.3 ms <
Memory estimate: 5.50 MiB, allocs estimate: 118133.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the
\circ
syntax - some brief internet research suggests it is used for function composition, but I'm not clear how the actual operations differ in a way that would make it faster thanIterators.map
Don't wanna put words in phillip's mouth but I think the idea is that any
definitely will "short circuit" (the first thing that returns true
it stops). I think in
will as well and benchmarking seems to bear that out (at least as far as we can tell :).
I would go ahead and bump julia compat to 1.10 (current LTS) and change the CI matrix to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small performance hit isn't a big deal IMHO -- given that anything here is going to involve a fair amount of I/O, I don't imagine that an extra ms even in something called a few dozen times is going to be the dominant factor
yeah I think it starts to become meaningful if you're like iterating through 10,000s of EDFs, but the run-time is already non-trivial there. |
This will avoid a common footgun where folks put in upper or otherwise mixed
case labels when supplying custom labels (e.g., by copy-pasting the actual
mixed-case label from the EDF header). The EDF labels are always converted to
lowercase before matching, so this change essentially makes label matching
case-invariant.