Skip to content

canonicalize input labels to lowercase before matching during import #107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Feb 21, 2025

Conversation

kleinschmidt
Copy link
Member

This will avoid a common footgun where folks put in upper or otherwise mixed
case labels when supplying custom labels (e.g., by copy-pasting the actual
mixed-case label from the EDF header). The EDF labels are always converted to
lowercase before matching, so this change essentially makes label matching
case-invariant.

@@ -147,13 +147,13 @@ function match_edf_label(label, signal_names, channel_name, canonical_names)
# will not match. the fix for this is to preprocess signal headers before
# `plan_edf_to_onda_samples` to normalize known instances (after reviewing the plan)
m = match(r"[\s\[,\]]*(?<signal>.+?)[\s,\]]*\s+(?<spec>.+)"i, label)
if !isnothing(m) && m[:signal] in signal_names
if !isnothing(m) && m[:signal] in Iterators.map(lowercase, signal_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterators.map for laziness? At that point I almost wonder if

Suggested change
if !isnothing(m) && m[:signal] in Iterators.map(lowercase, signal_names)
if !isnothing(m) && any(==(m[:signal]) \circ lowercase, signal_names)

would be faster

[noblock]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good q...I wonder if this is really worth it at all actually. I'll do a bit of benchmarking and try to figure out

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick-and-dirty benchmarks: this method is slower than what's on main by about 20%. using any(==) is marginally slower still

julia> master
BenchmarkTools.Trial: 1121 samples with 1 evaluation per sample.
 Range (min  max):  4.144 ms    7.371 ms  ┊ GC (min  max): 0.00%  40.79%
 Time  (median):     4.300 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.459 ms ± 665.456 μs  ┊ GC (mean ± σ):  3.99% ±  9.35%

  ▆▅▂█▄▁                                                    ▁  
  ██████▇▄▆▁▆▆▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▇▇██▆█ █
  4.14 ms      Histogram: log(frequency) by time      6.92 ms <

 Memory estimate: 3.20 MiB, allocs estimate: 68903.

julia> pr
BenchmarkTools.Trial: 873 samples with 1 evaluation per sample.
 Range (min  max):  5.326 ms    8.560 ms  ┊ GC (min  max): 0.00%  33.95%
 Time  (median):     5.463 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.722 ms ± 688.035 μs  ┊ GC (mean ± σ):  4.51% ±  8.85%

   ▅█▇█                                                        
  ▆████▆▃▃▃▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▄▄▄▄▃▂▂ ▃
  5.33 ms         Histogram: frequency by time        7.61 ms <

 Memory estimate: 5.83 MiB, allocs estimate: 134608.

julia> any
BenchmarkTools.Trial: 846 samples with 1 evaluation per sample.
 Range (min  max):  5.467 ms   10.615 ms  ┊ GC (min  max): 0.00%   0.00%
 Time  (median):     5.575 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.909 ms ± 874.562 μs  ┊ GC (mean ± σ):  5.50% ± 10.32%

  █▅▆                                                          
  ███▆▃▃▂▂▂▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▃▄▃▃▁▂▁▂▁▁▂▂ ▂
  5.47 ms         Histogram: frequency by time        8.56 ms <

 Memory estimate: 5.83 MiB, allocs estimate: 134608.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the \circ syntax - some brief internet research suggests it is used for function composition, but I'm not clear how the actual operations differ in a way that would make it faster than Iterators.map

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

surprisingly canonicalizing the whole label set ahead of time is even slower

julia> fmap_b = @benchmark plan_edf_to_onda_samples($edf)
BenchmarkTools.Trial: 651 samples with 1 evaluation per sample.
 Range (min  max):  7.161 ms   10.735 ms  ┊ GC (min  max): 0.00%  27.32%
 Time  (median):     7.358 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.683 ms ± 903.901 μs  ┊ GC (mean ± σ):  4.13% ±  8.41%

  ▄▄▆█▆▃                                              ▁▁▁      
  ███████▇▆▆▅▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▇█████▆▆▆ ▇
  7.16 ms      Histogram: log(frequency) by time      10.3 ms <

 Memory estimate: 5.50 MiB, allocs estimate: 118133.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the \circ syntax - some brief internet research suggests it is used for function composition, but I'm not clear how the actual operations differ in a way that would make it faster than Iterators.map

Don't wanna put words in phillip's mouth but I think the idea is that any definitely will "short circuit" (the first thing that returns true it stops). I think in will as well and benchmarking seems to bear that out (at least as far as we can tell :).

@palday
Copy link
Member

palday commented Feb 21, 2025

I would go ahead and bump julia compat to 1.10 (current LTS) and change the CI matrix to use 'min' instead of an explicit lower bound

@kleinschmidt kleinschmidt requested a review from palday February 21, 2025 20:51
Copy link
Member

@palday palday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small performance hit isn't a big deal IMHO -- given that anything here is going to involve a fair amount of I/O, I don't imagine that an extra ms even in something called a few dozen times is going to be the dominant factor

@kleinschmidt
Copy link
Member Author

small performance hit isn't a big deal IMHO -- given that anything here is going to involve a fair amount of I/O, I don't imagine that an extra ms even in something called a few dozen times is going to be the dominant factor

yeah I think it starts to become meaningful if you're like iterating through 10,000s of EDFs, but the run-time is already non-trivial there.

@kleinschmidt kleinschmidt merged commit a32ffe5 into master Feb 21, 2025
14 checks passed
@kleinschmidt kleinschmidt deleted the dfk/lowercase-labels branch February 21, 2025 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants