Skip to content

Inconsistent and non-deterministic results for find() when there are data points at exactly the threshold distance #4

@jlumpe

Description

@jlumpe

Results of find(VPTree(data, metric), query, r) may include only some of the indices for which metric(query, data[i]) == r. Exactly which subset is included varies between runs of a test script. At first I thought there might be some kind of race condition in _find_threaded but it happens in non-threaded trees as well.

Results are always the same for a given VPTree instance but vary between instances created with the same arguments. Seeding the RNG before construction removes this variation, so it must be due to the random shuffle of the data before building the tree.

Reproducing example:

using VPTrees
using DataStructures: counter


function build_and_find(data, metric, query, r, threaded)
    tree = VPTree(data, metric, threaded=threaded)
    return sort(find(tree, query, r))
end


hamming(a, b) = count_ones(xor(a, b))

query = 0b00000001
data = [
    # 3 @ D == 1
    0b0000000,
    0b0010001,
    0b0000101,
    # 3 @ D == 2
    0b0000100,
    0b0101001,
    0b0010011,
    # 4 @ D == 3
    0b1001000,
    0b0010010,
    0b0011011,
    0b1100101,
    # 10 @ D >= 4
    0b01011000,
    0b11101010,
    0b11011000,
    0b11001110,
    0b00111100,
    0b01100100,
    0b11100000,
    0b00111010,
    0b01001010,
    0b00111010,
]

print(hamming.(query, data))
[1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 6, 5, 6, 5, 4, 4, 5, 4, 5]
counter(build_and_find(data, hamming, query, 3, false) for _ in 1:10000)
DataStructures.Accumulator{Array{Int64,1},Int64} with 11 entries:
  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] => 7616
  [1, 2, 3, 4, 5, 6, 7, 8, 10]    => 232
  [1, 2, 3, 4, 5, 6, 7, 10]       => 83
  [1, 2, 3, 4, 5, 6, 7, 9, 10]    => 585
  [1, 2, 3, 4, 5, 6, 8, 10]       => 13
  [1, 2, 3, 4, 5, 6, 7, 8]        => 47
  [1, 2, 3, 4, 5, 6, 7, 9]        => 15
  [1, 2, 3, 4, 5, 6, 7, 8, 9]     => 474
  [1, 2, 3, 4, 5, 6, 8, 9]        => 241
  [1, 2, 3, 4, 5, 6, 9, 10]       => 26
  [1, 2, 3, 4, 5, 6, 8, 9, 10]    => 668
counter(build_and_find(data, hamming, query, 3, true) for _ in 1:10000)
DataStructures.Accumulator{Array{Int64,1},Int64} with 11 entries:
  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] => 7564
  [1, 2, 3, 4, 5, 6, 7, 8, 10]    => 241
  [1, 2, 3, 4, 5, 6, 7, 10]       => 77
  [1, 2, 3, 4, 5, 6, 7, 9, 10]    => 574
  [1, 2, 3, 4, 5, 6, 8, 10]       => 21
  [1, 2, 3, 4, 5, 6, 7, 8]        => 51
  [1, 2, 3, 4, 5, 6, 7, 9]        => 23
  [1, 2, 3, 4, 5, 6, 7, 8, 9]     => 499
  [1, 2, 3, 4, 5, 6, 8, 9]        => 256
  [1, 2, 3, 4, 5, 6, 9, 10]       => 44
  [1, 2, 3, 4, 5, 6, 8, 9, 10]    => 650

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions