Spans created by `detect_multiple_languages_of` sometimes skip the last characters #247

MikaelFDA · 2024-12-13T13:33:25Z

Hi !

We have a sentence with several languages but also arbitrary characters (such as numbers, etc.).
When these characters are at the end of the text, the spans returned may not take this part into account. Part of the text is therefore lost.

In my case, it created a miss match between my "original document" and the document with lang.

Here an example to reproduce :

import lingua # 2.0.2
from lingua import Language

text = "Salut le début de la phrase est en français, then it's a little bit of english and finally x: \n\n\nx\n\n\n 1 x:4\n 4678 :::\naz 27/05/2024 \n312120 - 2024 page 5/5"

langs = [Language.FRENCH, Language.ENGLISH, Language.SPANISH]
detector = lingua.LanguageDetectorBuilder.from_languages(*langs).build()
spans = detector.detect_multiple_languages_of(text)

print("Text size=", len(text))
print("Last span=", spans[-1].end_index)

for span in spans:
    print(repr(span))

"Text size= 155"
"Last span= 152"

DetectionResult(start_index=0, end_index=45, word_count=9, language=Language.FRENCH)
DetectionResult(start_index=45, end_index=91, word_count=9, language=Language.ENGLISH)
DetectionResult(start_index=91, end_index=152, word_count=6, language=Language.FRENCH)

So my questions is : Is it the expected behavior or a bug ?

I expected the ‘lost’ text to be detected as part of the last span.
on my example : DetectionResult(start_index=91, end_index=155, word_count=6, language=Language.FRENCH)

pemistahl · 2025-02-11T09:56:05Z

Hi @MikaelFDA, thanks for your report and sorry for my late reply. I confirm that this is a bug. I will provide a fix before creating the next release.

pemistahl changed the title ~~"detect_multiple_languages_of" spans sometimes skip the last caracters~~ Spans created by detect_multiple_languages_of sometimes skip the last characters Dec 30, 2024

pemistahl added the bug Something isn't working label Feb 11, 2025

pemistahl added this to the Lingua 1.4.1 / 2.1.0 milestone Feb 11, 2025

pemistahl added a commit that referenced this issue Feb 11, 2025

Fix incorrect end index in detection result (#247)

c983a30

pemistahl closed this as completed Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spans created by `detect_multiple_languages_of` sometimes skip the last characters #247

Spans created by `detect_multiple_languages_of` sometimes skip the last characters #247

MikaelFDA commented Dec 13, 2024 •

edited

Loading

pemistahl commented Feb 11, 2025

Uh oh!

Spans created by detect_multiple_languages_of sometimes skip the last characters #247

Spans created by detect_multiple_languages_of sometimes skip the last characters #247

Comments

MikaelFDA commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

pemistahl commented Feb 11, 2025

Uh oh!

Spans created by `detect_multiple_languages_of` sometimes skip the last characters #247

Spans created by `detect_multiple_languages_of` sometimes skip the last characters #247

MikaelFDA commented Dec 13, 2024 •

edited

Loading