Skip to content

Efficiently store and search for emojis #12189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
roryabraham opened this issue Oct 27, 2022 · 60 comments
Closed

Efficiently store and search for emojis #12189

roryabraham opened this issue Oct 27, 2022 · 60 comments
Assignees
Labels
Engineering Internal Requires API changes or must be handled by Expensify staff Monthly KSv2 NewFeature Something to build that is a new item.

Comments

@roryabraham
Copy link
Contributor

roryabraham commented Oct 27, 2022

If you haven’t already, check out our contributing guidelines for onboarding and email [email protected] to request to join our Slack channel!


Action Performed:

Start typing :smile

Expected Result:

The soon-to-be-implemented emoji suggestions UI should produce not only results that begin with smile, like smile and smiley, but also things that contain smile as a substring, such as sweat_smile.

Actual Result:

Only emoji that begin with smile will come up in the search results.

Workaround:

n/a

Platform:

All

View all open jobs on GitHub

Upwork Automation - Do Not Edit
  • Upwork Job URL: https://www.upwork.com/jobs/~01b83383f908c1ef35
  • Upwork Job ID: 1613221194285375488
  • Last Price Increase: 2023-01-11
@roryabraham roryabraham added External Added to denote the issue can be worked on by a contributor Engineering Weekly KSv2 NewFeature Something to build that is a new item. labels Oct 27, 2022
@roryabraham roryabraham self-assigned this Oct 27, 2022
@melvin-bot
Copy link

melvin-bot bot commented Oct 27, 2022

Triggered auto assignment to @adelekennedy (External), see https://stackoverflow.com/c/expensify/questions/8582 for more details.

@melvin-bot melvin-bot bot added Daily KSv2 and removed Weekly KSv2 labels Oct 27, 2022
@roryabraham
Copy link
Contributor Author

cc @stitesExpensify in case you want to be co-assigned on this one as well

@melvin-bot
Copy link

melvin-bot bot commented Oct 27, 2022

Triggered auto assignment to Contributor-plus team member for initial proposal review - @rushatgabhane (External)

@melvin-bot melvin-bot bot added the Help Wanted Apply this label when an issue is open to proposals by contributors label Oct 27, 2022
@melvin-bot
Copy link

melvin-bot bot commented Oct 27, 2022

Current assignee @roryabraham is eligible for the External assigner, not assigning anyone new.

@melvin-bot melvin-bot bot changed the title Search for emojis using a suffix tree instead of a trie to get smarter matches [$250] Search for emojis using a suffix tree instead of a trie to get smarter matches Oct 27, 2022
@roryabraham
Copy link
Contributor Author

This is hinted at in the issue title, but I think that a suffix tree will be the appropriate data structure to use for this search, since the emoji dataset is so large (and will likely continue to grow over time)

@roryabraham
Copy link
Contributor Author

Also, note that this does not necessarily have to be on HOLD for #12188, because even though we don't have the UI yet, we should be able to verify the correct behavior with unit tests, which will be a hard requirement for any PR to implement this issue.

@roryabraham
Copy link
Contributor Author

@adelekennedy I think $500 is an appropriate starting value for this issue

@stitesExpensify stitesExpensify self-assigned this Oct 27, 2022
@xyclos
Copy link

xyclos commented Oct 29, 2022

Not sure how to implement a suffix tree, but there are a few really good and fast fuzzy search libs out there. One that comes to mind is fuse.js

const list = [
  {
    "text": "smile",
    "emoji": "🙂"
  },
  {
    "text": "smiley",
    "emoji": "😁"
  },
  {
    "text": "sweat_smile",
    "emoji": "😅"
  },
  {
    "text": "cheese_wedge",
    "emoji": "🧀"
  },
]

const options = {
  includeScore: true,
  shouldSort: true,
  includeMatches: true,
  findAllMatches: true,
  threshold: 0.6,
  distance: 100,
  keys: ["text"]
};

const fuse = new Fuse(list, options);

// Change the pattern
const pattern = "smile"

return fuse.search(pattern)

Result:

[
  {
    "item": {
      "text": "smile",
      "emoji": "🙂"
    },
    "refIndex": 0,
    "matches": [
      {
        "indices": [
          [
            0,
            4
          ]
        ],
        "value": "smile",
        "key": "text"
      }
    ],
    "score": 2.220446049250313e-16
  },
  {
    "item": {
      "text": "smiley",
      "emoji": "😁"
    },
    "refIndex": 1,
    "matches": [
      {
        "indices": [
          [
            0,
            4
          ]
        ],
        "value": "smiley",
        "key": "text"
      }
    ],
    "score": 0.001
  },
  {
    "item": {
      "text": "sweat_smile",
      "emoji": "😅"
    },
    "refIndex": 2,
    "matches": [
      {
        "indices": [
          [
            0,
            0
          ],
          [
            2,
            2
          ],
          [
            6,
            10
          ]
        ],
        "value": "sweat_smile",
        "key": "text"
      }
    ],
    "score": 0.06
  }
]

@leeoniya
Copy link

leeoniya commented Oct 30, 2022

there are a few really good and fast fuzzy search libs out there. One that comes to mind is fuse.js

i wouldn't call Fuse fast or good, exactly. but i've done a pretty broad comparison of fuzzy match libs in the course of building uFuzzy.

https://github.com/leeoniya/uFuzzy#performance

@melvin-bot melvin-bot bot added the Overdue label Oct 30, 2022
@rushatgabhane
Copy link
Member

rushatgabhane commented Oct 30, 2022

@xyclos quick thoughts: I don't think filtering a list of emojis is a good usecase for an approximate string match algorithm.

We're searching on the keywords array. And fuse.js uses the bittap algorithm, which computes similarity between two strings using levenshtein distance (source).

This makes the result set have 522 results when entering smile (for eg: smirk is also in the result set because it has a levenshtein dist of 2 from smile).

Imo, 522 results would confuse the user. So something more predictable and exact would be nice.
What do you think? I'm curious about your thoughts

@melvin-bot melvin-bot bot removed the Overdue label Oct 30, 2022
@rushatgabhane
Copy link
Member

rushatgabhane commented Oct 30, 2022

@leeoniya 1,030 ms for uFuzzy vs 35,600 ms for fuse.js

That's pretty impressive! ✨

image

@stitesExpensify
Copy link
Contributor

I agree with @rushatgabhane about not using a fuzzy search. For reference the Trie which is currently implemented finds emojis in 1-6ms.

I think a suffix tree as mentioned in the OP (which is very similar to a trie) is the correct data structure, and we should stick to that.

@leeoniya
Copy link

👍 excited to see what you folks come up with! i'd be interested to add a Ukkonen suffix tree implementation to the bench table as well. btw, uFuzzy is not based on string distance metrics, since that typically produces garbage results!

out of curiosity i loaded up all the emoji keywords from https://github.com/Expensify/App/blob/main/assets/emojis.js and got 0.3ms matches, or 0.7ms for out-of-order partial terms. this is with 0 mem overhead, since there's no index.

image

@stitesExpensify
Copy link
Contributor

stitesExpensify commented Oct 31, 2022

@leeoniya those results are crazy fast! If you want to create a proposal, that sounds very promising!

I am a bit confused how there is 0 memory overhead since we would still have to create the data structure right? I'm also curious how long it takes to create that structure for our use case (seems like it should be fast using Ukkonen's algo).

@leeoniya
Copy link

leeoniya commented Oct 31, 2022

the data structure is just the haystack string[]. if you dont have this array in mem already and have to concat/extract the keywords arrays, there would be some minimal additional allocs up front, but uFuzzy itself doesnt do anything to your haystack once it's made. the lib is regexp based and just processes that haystack in a linear/dumb fashion. pretty wild how fast it turned out! 🤯

EDIT: to be clear, there is obviously some overhead during searching but it's the lowest i've measured of any lib (about 20mb when searching 162k phrases, that haystack is an additional/persistent 8mb in mem but it's not an extra 8mb cause you at minimum need those strings in memory regardless)

@Karim-30
Copy link
Contributor

Karim-30 commented Nov 2, 2022

@roryabraham we already have this feature because we are inserting the emoji name and its keywords into the Trie

i.e when inserting this emoji

    {
        code: '😀',
        keywords: [
            'face',
            'grin',
        ],
        name: 'grinning',
    },

we aren't inserting a single emoji {name:'grinning', {code: '😀'}}

actually we are inserting something like this

{name:'grinning', {code: '😀'}}
{name:'face', {code: '', suggestions: [name:'grinning', {code: '😀'}]}
{name:'grin', {code: '', suggestions: [name:'grinning', {code: '😀'}]}

so when typing :fa in the composer we will see the grinning emoji

Screen.Recording.2022-11-02.at.12.26.55.PM.mov

@melvin-bot
Copy link

melvin-bot bot commented Jan 11, 2023

Current assignee @rushatgabhane is eligible for the Internal assigner, not assigning anyone new.

@rushatgabhane
Copy link
Member

rushatgabhane commented Jan 11, 2023

Totally agree we should go the design doc route

It'll be great if this can be public. I'd love to review it as well

@stitesExpensify stitesExpensify changed the title [$250] Search for emojis using a suffix tree instead of a trie to get smarter matches Efficiently store and search for emojis Jan 11, 2023
@rushatgabhane
Copy link
Member

Is this still a monthly issue?

@stitesExpensify
Copy link
Contributor

Yes, this is not currently being discussed as we're focused on the emoji reactions doc. I'll move it back to weekly once it becomes a priority again

@melvin-bot melvin-bot bot added the Overdue label Feb 23, 2023
@JmillsExpensify
Copy link

Still lower in priority compared to emoji reactions.

@melvin-bot melvin-bot bot removed the Overdue label Feb 27, 2023
@rushatgabhane
Copy link
Member

We can probably lift the hold on this one?

@stitesExpensify
Copy link
Contributor

I'm going to wait until after emojis is all wrapped up. This is still very low priority IMO since our current system works reasonably well

@rushatgabhane
Copy link
Member

let's do nothing then? what's the problem this issue is tryna solve?

@stitesExpensify
Copy link
Contributor

Our colon code results could be improved quite a bit. Specifically that codes like :smile don't show :sweat_smile

@rushatgabhane
Copy link
Member

okayy good to hear

@stitesExpensify
Copy link
Contributor

Still waiting on this for now since it's just a polish issue

@melvin-bot melvin-bot bot removed the Overdue label Apr 17, 2023
@JmillsExpensify
Copy link

For what it's worth, I have been hitting this one more and more. I agree it's a polish issue, it's probably just very easy to get used to emoji search on other apps, remember those search terms, and then come up short using our search. I'd like to see us re-prioritize this one. Could we open it up?

@JmillsExpensify
Copy link

Still not a priority.

@melvin-bot melvin-bot bot removed the Overdue label May 22, 2023
@melvin-bot melvin-bot bot added the Overdue label Jun 22, 2023
@JmillsExpensify
Copy link

Same

@melvin-bot melvin-bot bot removed the Overdue label Jun 28, 2023
@roryabraham
Copy link
Contributor Author

What do we think here? Should we have an agency pick this one up?

@roryabraham
Copy link
Contributor Author

Said differently, I think we should either move forward with this or just close it

@stitesExpensify
Copy link
Contributor

Chatting with Rory 1:1, I think we should just close this. We can come back to it when we are doing custom emojis which are going to be stored on the server and have a lot of other considerations for search etc, so doing this now would just be doubling our work. Closing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Engineering Internal Requires API changes or must be handled by Expensify staff Monthly KSv2 NewFeature Something to build that is a new item.
Projects
None yet
Development

No branches or pull requests

10 participants