canonicalize input labels to lowercase before matching during import #107

kleinschmidt · 2025-02-21T19:40:04Z

This will avoid a common footgun where folks put in upper or otherwise mixed
case labels when supplying custom labels (e.g., by copy-pasting the actual
mixed-case label from the EDF header). The EDF labels are always converted to
lowercase before matching, so this change essentially makes label matching
case-invariant.

palday · 2025-02-21T19:42:35Z

src/import_edf.jl

@@ -147,13 +147,13 @@ function match_edf_label(label, signal_names, channel_name, canonical_names)
    #   will not match.  the fix for this is to preprocess signal headers before
    #   `plan_edf_to_onda_samples` to normalize known instances (after reviewing the plan)
    m = match(r"[\s\[,\]]*(?<signal>.+?)[\s,\]]*\s+(?<spec>.+)"i, label)
-    if !isnothing(m) && m[:signal] in signal_names
+    if !isnothing(m) && m[:signal] in Iterators.map(lowercase, signal_names)


Iterators.map for laziness? At that point I almost wonder if

Suggested change

if !isnothing(m) && m[:signal] in Iterators.map(lowercase, signal_names)

if !isnothing(m) && any(==(m[:signal]) \circ lowercase, signal_names)

would be faster

[noblock]

yeah, good q...I wonder if this is really worth it at all actually. I'll do a bit of benchmarking and try to figure out

quick-and-dirty benchmarks: this method is slower than what's on main by about 20%. using any(==) is marginally slower still

julia> master BenchmarkTools.Trial: 1121 samples with 1 evaluation per sample. Range (min … max): 4.144 ms … 7.371 ms ┊ GC (min … max): 0.00% … 40.79% Time (median): 4.300 ms ┊ GC (median): 0.00% Time (mean ± σ): 4.459 ms ± 665.456 μs ┊ GC (mean ± σ): 3.99% ± 9.35% ▆▅▂█▄▁ ▁ ██████▇▄▆▁▆▆▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▇▇██▆█ █ 4.14 ms Histogram: log(frequency) by time 6.92 ms < Memory estimate: 3.20 MiB, allocs estimate: 68903. julia> pr BenchmarkTools.Trial: 873 samples with 1 evaluation per sample. Range (min … max): 5.326 ms … 8.560 ms ┊ GC (min … max): 0.00% … 33.95% Time (median): 5.463 ms ┊ GC (median): 0.00% Time (mean ± σ): 5.722 ms ± 688.035 μs ┊ GC (mean ± σ): 4.51% ± 8.85% ▅█▇█ ▆████▆▃▃▃▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▄▄▄▄▃▂▂ ▃ 5.33 ms Histogram: frequency by time 7.61 ms < Memory estimate: 5.83 MiB, allocs estimate: 134608. julia> any BenchmarkTools.Trial: 846 samples with 1 evaluation per sample. Range (min … max): 5.467 ms … 10.615 ms ┊ GC (min … max): 0.00% … 0.00% Time (median): 5.575 ms ┊ GC (median): 0.00% Time (mean ± σ): 5.909 ms ± 874.562 μs ┊ GC (mean ± σ): 5.50% ± 10.32% █▅▆ ███▆▃▃▂▂▂▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▃▄▃▃▁▂▁▂▁▁▂▂ ▂ 5.47 ms Histogram: frequency by time 8.56 ms < Memory estimate: 5.83 MiB, allocs estimate: 134608.

I'm not familiar with the \circ syntax - some brief internet research suggests it is used for function composition, but I'm not clear how the actual operations differ in a way that would make it faster than Iterators.map

surprisingly canonicalizing the whole label set ahead of time is even slower

julia> fmap_b = @benchmark plan_edf_to_onda_samples($edf) BenchmarkTools.Trial: 651 samples with 1 evaluation per sample. Range (min … max): 7.161 ms … 10.735 ms ┊ GC (min … max): 0.00% … 27.32% Time (median): 7.358 ms ┊ GC (median): 0.00% Time (mean ± σ): 7.683 ms ± 903.901 μs ┊ GC (mean ± σ): 4.13% ± 8.41% ▄▄▆█▆▃ ▁▁▁ ███████▇▆▆▅▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▇█████▆▆▆ ▇ 7.16 ms Histogram: log(frequency) by time 10.3 ms < Memory estimate: 5.50 MiB, allocs estimate: 118133.

I'm not familiar with the \circ syntax - some brief internet research suggests it is used for function composition, but I'm not clear how the actual operations differ in a way that would make it faster than Iterators.map

Don't wanna put words in phillip's mouth but I think the idea is that any definitely will "short circuit" (the first thing that returns true it stops). I think in will as well and benchmarking seems to bear that out (at least as far as we can tell :).

Project.toml

palday · 2025-02-21T19:48:11Z

I would go ahead and bump julia compat to 1.10 (current LTS) and change the CI matrix to use 'min' instead of an explicit lower bound

palday

small performance hit isn't a big deal IMHO -- given that anything here is going to involve a fair amount of I/O, I don't imagine that an extra ms even in something called a few dozen times is going to be the dominant factor

kleinschmidt · 2025-02-21T20:52:12Z

small performance hit isn't a big deal IMHO -- given that anything here is going to involve a fair amount of I/O, I don't imagine that an extra ms even in something called a few dozen times is going to be the dominant factor

yeah I think it starts to become meaningful if you're like iterating through 10,000s of EDFs, but the run-time is already non-trivial there.

kleinschmidt added 2 commits February 21, 2025 14:38

canonicalize input labels to lowercase before matching during import

9663dfe

bump

1493fa8

kleinschmidt requested review from palday and rebareh February 21, 2025 19:40

Merge remote-tracking branch 'origin/master' into dfk/lowercase-labels

956602a

palday reviewed Feb 21, 2025

View reviewed changes

Project.toml Show resolved Hide resolved

kleinschmidt added 3 commits February 21, 2025 14:44

use

6c62597

set up tests

4cb9416

targets

b73f5a2

kleinschmidt added 3 commits February 21, 2025 15:23

set up

5ea06de

lts

20b8e66

tidy up

be6247d

kleinschmidt requested a review from palday February 21, 2025 20:51

palday approved these changes Feb 21, 2025

View reviewed changes

kleinschmidt merged commit a32ffe5 into master Feb 21, 2025
14 checks passed

kleinschmidt deleted the dfk/lowercase-labels branch February 21, 2025 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

canonicalize input labels to lowercase before matching during import #107

canonicalize input labels to lowercase before matching during import #107

Uh oh!

kleinschmidt commented Feb 21, 2025

Uh oh!

palday Feb 21, 2025

Uh oh!

kleinschmidt Feb 21, 2025

Uh oh!

kleinschmidt Feb 21, 2025

Uh oh!

rebareh Feb 21, 2025

Uh oh!

kleinschmidt Feb 21, 2025

Uh oh!

kleinschmidt Feb 21, 2025

Uh oh!

Uh oh!

palday commented Feb 21, 2025

Uh oh!

palday left a comment

Uh oh!

kleinschmidt commented Feb 21, 2025

Uh oh!

Uh oh!

Uh oh!

	if !isnothing(m) && m[:signal] in Iterators.map(lowercase, signal_names)
	if !isnothing(m) && any(==(m[:signal]) \circ lowercase, signal_names)

canonicalize input labels to lowercase before matching during import #107

canonicalize input labels to lowercase before matching during import #107

Uh oh!

Conversation

kleinschmidt commented Feb 21, 2025

Uh oh!

palday Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

kleinschmidt Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

kleinschmidt Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

rebareh Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

kleinschmidt Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

kleinschmidt Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

palday commented Feb 21, 2025

Uh oh!

palday left a comment

Choose a reason for hiding this comment

Uh oh!

kleinschmidt commented Feb 21, 2025

Uh oh!

Uh oh!

Uh oh!