Skip to content

Spans created by detect_multiple_languages_of sometimes skip the last characters #247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MikaelFDA opened this issue Dec 13, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@MikaelFDA
Copy link

MikaelFDA commented Dec 13, 2024

Hi !

We have a sentence with several languages but also arbitrary characters (such as numbers, etc.).
When these characters are at the end of the text, the spans returned may not take this part into account. Part of the text is therefore lost.

In my case, it created a miss match between my "original document" and the document with lang.

Here an example to reproduce :

import lingua # 2.0.2
from lingua import Language

text = "Salut le début de la phrase est en français, then it's a little bit of english and finally x: \n\n\nx\n\n\n 1 x:4\n 4678 :::\naz 27/05/2024 \n312120 - 2024 page 5/5"

langs = [Language.FRENCH, Language.ENGLISH, Language.SPANISH]
detector = lingua.LanguageDetectorBuilder.from_languages(*langs).build()
spans = detector.detect_multiple_languages_of(text)

print("Text size=", len(text))
print("Last span=", spans[-1].end_index)

for span in spans:
    print(repr(span))
"Text size= 155"
"Last span= 152"

DetectionResult(start_index=0, end_index=45, word_count=9, language=Language.FRENCH)
DetectionResult(start_index=45, end_index=91, word_count=9, language=Language.ENGLISH)
DetectionResult(start_index=91, end_index=152, word_count=6, language=Language.FRENCH)

So my questions is : Is it the expected behavior or a bug ?

I expected the ‘lost’ text to be detected as part of the last span.
on my example : DetectionResult(start_index=91, end_index=155, word_count=6, language=Language.FRENCH)

@pemistahl pemistahl changed the title "detect_multiple_languages_of" spans sometimes skip the last caracters Spans created by detect_multiple_languages_of sometimes skip the last characters Dec 30, 2024
@pemistahl pemistahl added the bug Something isn't working label Feb 11, 2025
@pemistahl
Copy link
Owner

Hi @MikaelFDA, thanks for your report and sorry for my late reply. I confirm that this is a bug. I will provide a fix before creating the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants