Skip to content

Commit ef28e8b

Browse files
committed
Update release notes
1 parent e9cefee commit ef28e8b

19 files changed

+67
-41
lines changed

README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -790,7 +790,7 @@ including mean, median and standard deviation.
790790
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/lightgreen.png"> 67</td>
791791
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 99</td>
792792
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 96</td>
793-
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 96</td>
793+
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 97</td>
794794
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 99</td>
795795
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 98</td>
796796
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 99</td>
@@ -3252,8 +3252,8 @@ including mean, median and standard deviation.
32523252
</tr>
32533253
<tr>
32543254
<td>Ukrainian</td>
3255-
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 92</td>
3256-
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 86</td>
3255+
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 94</td>
3256+
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 88</td>
32573257
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 83</td>
32583258
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 91</td>
32593259
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 95</td>
@@ -3262,7 +3262,7 @@ including mean, median and standard deviation.
32623262
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 81</td>
32633263
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/lightgreen.png"> 77</td>
32643264
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/lightgreen.png"> 78</td>
3265-
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 84</td>
3265+
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 85</td>
32663266
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/lightgreen.png"> 75</td>
32673267
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/lightgreen.png"> 66</td>
32683268
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/lightgreen.png"> 78</td>
@@ -3272,7 +3272,7 @@ including mean, median and standard deviation.
32723272
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/lightgreen.png"> 62</td>
32733273
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/yellow.png"> 46</td>
32743274
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/lightgreen.png"> 62</td>
3275-
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 97</td>
3275+
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 98</td>
32763276
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 92</td>
32773277
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 85</td>
32783278
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 94</td>
@@ -3282,9 +3282,9 @@ including mean, median and standard deviation.
32823282
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 83</td>
32833283
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 88</td>
32843284
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/lightgreen.png"> 75</td>
3285-
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 95</td>
3286-
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 93</td>
3287-
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 97</td>
3285+
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 99</td>
3286+
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 96</td>
3287+
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 98</td>
32883288
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 100</td>
32893289
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 100</td>
32903290
<td><img src="https://raw.githubusercontent.com/pemistahl/lingua-py/pure-python-impl/images/green.png"> 100</td>

RELEASE_NOTES.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## Lingua 1.3.5 (released on 08 Dec 2023)
1+
## Lingua 1.3.5 (released on 03 Apr 2024)
22

33
### Improvements
44

@@ -8,6 +8,17 @@
88
performance was much too slow with the former approach, this change makes
99
sense because adding more memory is quite cheap.
1010

11+
- The language model files are now compressed with the Brotli algorithm which
12+
reduces the file size by 15 %, on average.
13+
14+
- The characters `Щщ` are now correctly identified as possible indicators for
15+
the Ukrainian language, leading to slightly higher accuracy when identifying
16+
Ukrainian texts.
17+
18+
### Miscellaneous
19+
20+
- All dependencies have been updated to their latest versions.
21+
1122
## Lingua 1.3.4 (released on 07 Nov 2023)
1223

1324
### Miscellaneous

accuracy-reports/aggregated-accuracy-values.csv

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Belarusian,NaN,NaN,NaN,NaN,76,42,87,99,84,67,86,100,85,69,87,99,85,69,88,98,92,8
99
Bengali,NaN,NaN,NaN,NaN,63,19,69,99,99,98,99,99,92,92,88,97,98,94,99,100,98,94,99,100,98,94,99,100,100,100,100,100,100,100,100,100,100,100,100,100
1010
Bokmal,50,15,45,90,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,13,3,12,23,NaN,NaN,NaN,NaN,69,53,70,85,75,55,77,91,NaN,NaN,NaN,NaN,50,27,47,75,58,39,59,77
1111
Bosnian,NaN,NaN,NaN,NaN,19,4,15,36,33,19,28,52,5,2,4,8,9,9,10,8,54,54,64,44,65,54,76,64,NaN,NaN,NaN,NaN,29,23,29,36,35,29,35,41
12-
Bulgarian,68,44,67,91,66,32,72,93,70,45,66,98,67,46,62,93,78,56,81,99,89,80,88,98,92,83,95,99,72,50,68,96,78,56,81,96,87,70,91,99
12+
Bulgarian,68,44,67,91,66,32,72,93,70,45,66,98,67,46,62,93,78,56,81,99,89,80,88,98,92,83,95,99,72,50,68,97,78,56,81,96,87,70,91,99
1313
Catalan,59,32,62,81,38,4,30,79,48,19,42,84,38,5,29,81,57,33,57,83,63,42,63,85,66,44,67,88,55,26,52,87,58,33,60,82,70,51,74,87
1414
Chinese,NaN,NaN,NaN,NaN,33,NaN,2,98,92,92,83,100,96,90,97,100,71,46,68,100,71,46,68,100,71,46,68,100,64,39,55,97,100,100,100,100,100,100,100,100
1515
Croatian,NaN,NaN,NaN,NaN,51,34,47,73,42,26,42,58,48,16,38,90,47,28,42,72,72,62,79,76,81,64,87,93,73,50,71,98,60,36,57,86,73,53,74,90
@@ -67,7 +67,7 @@ Thai,NaN,NaN,NaN,NaN,100,100,100,100,99,100,100,98,100,100,100,100,100,100,100,1
6767
Tsonga,NaN,NaN,NaN,NaN,61,19,68,97,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,72,46,73,97,84,66,89,98
6868
Tswana,NaN,NaN,NaN,NaN,56,17,57,94,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,71,44,73,96,84,65,88,99
6969
Turkish,76,55,78,96,66,30,71,97,69,41,70,97,67,50,67,84,86,70,88,100,86,70,88,100,86,70,88,100,82,62,84,100,87,71,91,99,94,84,98,100
70-
Ukrainian,78,62,75,97,77,46,88,99,81,62,83,98,76,54,77,96,91,78,94,100,95,90,95,100,98,94,98,100,83,66,85,97,86,75,92,93,92,84,97,95
70+
Ukrainian,78,62,75,97,77,46,88,99,81,62,83,98,76,54,77,96,91,78,94,100,95,90,95,100,98,94,98,100,83,66,85,98,88,75,92,96,94,85,98,99
7171
Urdu,NaN,NaN,NaN,NaN,61,8,75,99,61,39,53,92,58,30,46,99,63,40,50,99,75,59,68,99,80,68,74,99,83,67,84,97,79,65,78,94,90,80,94,96
7272
Vietnamese,NaN,NaN,NaN,NaN,63,NaN,90,100,66,26,74,99,86,65,93,100,89,71,97,100,89,71,97,100,89,71,97,100,93,81,97,100,87,75,87,98,91,79,94,99
7373
Welsh,69,58,60,90,72,34,85,98,69,43,66,98,49,11,39,95,64,35,61,96,69,41,71,96,72,46,74,97,85,69,88,99,82,61,87,99,91,78,96,99

accuracy-reports/lingua-high-accuracy/Bulgarian.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
##### Bulgarian #####
22

3-
>>> Accuracy on average: 86.80%
3+
>>> Accuracy on average: 86.70%
44

55
>> Detection of 1000 single words (average length: 8 chars)
6-
Accuracy: 70.20%
7-
Erroneously classified as Macedonian: 12.80%, Russian: 9.40%, Serbian: 4.10%, Ukrainian: 1.60%, Kazakh: 0.80%, Belarusian: 0.60%, Mongolian: 0.50%
6+
Accuracy: 69.90%
7+
Erroneously classified as Macedonian: 12.80%, Russian: 9.40%, Serbian: 4.10%, Ukrainian: 1.90%, Kazakh: 0.80%, Belarusian: 0.60%, Mongolian: 0.50%
88

99
>> Detection of 1000 word pairs (average length: 17 chars)
1010
Accuracy: 91.20%

accuracy-reports/lingua-high-accuracy/Russian.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
##### Russian #####
22

3-
>>> Accuracy on average: 89.70%
3+
>>> Accuracy on average: 89.67%
44

55
>> Detection of 1000 single words (average length: 8 chars)
6-
Accuracy: 76.50%
7-
Erroneously classified as Ukrainian: 6.30%, Bulgarian: 5.50%, Serbian: 3.40%, Belarusian: 3.30%, Macedonian: 3.00%, Mongolian: 1.10%, Kazakh: 0.90%
6+
Accuracy: 76.40%
7+
Erroneously classified as Ukrainian: 6.40%, Bulgarian: 5.50%, Serbian: 3.40%, Belarusian: 3.30%, Macedonian: 3.00%, Mongolian: 1.10%, Kazakh: 0.90%
88

99
>> Detection of 1000 word pairs (average length: 16 chars)
1010
Accuracy: 94.80%
Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
##### Ukrainian #####
22

3-
>>> Accuracy on average: 92.23%
3+
>>> Accuracy on average: 93.77%
44

55
>> Detection of 1000 single words (average length: 8 chars)
6-
Accuracy: 84.40%
7-
Erroneously classified as Russian: 4.90%, Serbian: 3.40%, Bulgarian: 2.60%, Belarusian: 2.20%, Macedonian: 1.40%, Mongolian: 0.90%, Kazakh: 0.20%
6+
Accuracy: 85.00%
7+
Erroneously classified as Russian: 4.50%, Serbian: 3.40%, Bulgarian: 2.40%, Belarusian: 2.20%, Macedonian: 1.40%, Mongolian: 0.90%, Kazakh: 0.20%
88

99
>> Detection of 1000 word pairs (average length: 17 chars)
10-
Accuracy: 97.30%
11-
Erroneously classified as Russian: 1.00%, Bulgarian: 0.50%, Serbian: 0.50%, Macedonian: 0.40%, Belarusian: 0.30%
10+
Accuracy: 97.50%
11+
Erroneously classified as Russian: 0.80%, Bulgarian: 0.50%, Serbian: 0.50%, Macedonian: 0.40%, Belarusian: 0.30%
1212

1313
>> Detection of 1000 sentences (average length: 108 chars)
14-
Accuracy: 95.00%
15-
Erroneously classified as Kazakh: 4.10%, Belarusian: 0.30%, Macedonian: 0.30%, Russian: 0.30%
14+
Accuracy: 98.80%
15+
Erroneously classified as Belarusian: 0.30%, Kazakh: 0.30%, Macedonian: 0.30%, Russian: 0.30%
1616

accuracy-reports/lingua-low-accuracy/Bulgarian.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
##### Bulgarian #####
22

3-
>>> Accuracy on average: 77.83%
3+
>>> Accuracy on average: 77.77%
44

55
>> Detection of 1000 single words (average length: 8 chars)
6-
Accuracy: 56.40%
7-
Erroneously classified as Macedonian: 13.20%, Russian: 12.50%, Serbian: 6.20%, Kazakh: 3.60%, Ukrainian: 3.50%, Belarusian: 2.40%, Mongolian: 2.20%
6+
Accuracy: 56.20%
7+
Erroneously classified as Macedonian: 13.20%, Russian: 12.50%, Serbian: 6.20%, Ukrainian: 3.80%, Kazakh: 3.60%, Belarusian: 2.40%, Mongolian: 2.10%
88

99
>> Detection of 1000 word pairs (average length: 17 chars)
1010
Accuracy: 80.60%

accuracy-reports/lingua-low-accuracy/Russian.txt

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
##### Russian #####
22

3-
>>> Accuracy on average: 78.47%
3+
>>> Accuracy on average: 78.37%
44

55
>> Detection of 1000 single words (average length: 8 chars)
6-
Accuracy: 59.20%
7-
Erroneously classified as Ukrainian: 8.40%, Macedonian: 8.20%, Bulgarian: 6.60%, Serbian: 5.30%, Belarusian: 4.80%, Mongolian: 4.00%, Kazakh: 3.50%
6+
Accuracy: 59.10%
7+
Erroneously classified as Ukrainian: 8.60%, Macedonian: 8.20%, Bulgarian: 6.60%, Serbian: 5.30%, Belarusian: 4.80%, Mongolian: 4.00%, Kazakh: 3.40%
88

99
>> Detection of 1000 word pairs (average length: 16 chars)
10-
Accuracy: 83.90%
11-
Erroneously classified as Macedonian: 4.80%, Ukrainian: 4.30%, Bulgarian: 2.60%, Serbian: 1.80%, Mongolian: 1.10%, Belarusian: 1.00%, Kazakh: 0.50%
10+
Accuracy: 83.70%
11+
Erroneously classified as Macedonian: 4.80%, Ukrainian: 4.50%, Bulgarian: 2.60%, Serbian: 1.80%, Mongolian: 1.10%, Belarusian: 1.00%, Kazakh: 0.50%
1212

1313
>> Detection of 1000 sentences (average length: 65 chars)
1414
Accuracy: 92.30%
Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
##### Ukrainian #####
22

3-
>>> Accuracy on average: 86.33%
3+
>>> Accuracy on average: 87.87%
44

55
>> Detection of 1000 single words (average length: 8 chars)
6-
Accuracy: 74.70%
7-
Erroneously classified as Russian: 6.50%, Serbian: 5.40%, Belarusian: 3.90%, Macedonian: 3.80%, Bulgarian: 2.10%, Kazakh: 1.90%, Mongolian: 1.70%
6+
Accuracy: 75.30%
7+
Erroneously classified as Russian: 6.20%, Serbian: 5.40%, Belarusian: 3.90%, Macedonian: 3.80%, Kazakh: 1.90%, Bulgarian: 1.80%, Mongolian: 1.70%
88

99
>> Detection of 1000 word pairs (average length: 17 chars)
10-
Accuracy: 91.60%
11-
Erroneously classified as Russian: 3.20%, Serbian: 1.90%, Belarusian: 1.00%, Bulgarian: 0.70%, Macedonian: 0.70%, Mongolian: 0.70%, Kazakh: 0.20%
10+
Accuracy: 91.80%
11+
Erroneously classified as Russian: 3.00%, Serbian: 1.90%, Belarusian: 1.00%, Bulgarian: 0.70%, Macedonian: 0.70%, Mongolian: 0.70%, Kazakh: 0.20%
1212

1313
>> Detection of 1000 sentences (average length: 108 chars)
14-
Accuracy: 92.70%
15-
Erroneously classified as Kazakh: 4.10%, Macedonian: 2.10%, Russian: 0.40%, Belarusian: 0.30%, Bulgarian: 0.20%, Serbian: 0.20%
14+
Accuracy: 96.50%
15+
Erroneously classified as Macedonian: 2.10%, Russian: 0.40%, Belarusian: 0.30%, Kazakh: 0.30%, Bulgarian: 0.20%, Serbian: 0.20%
1616

images/plots/barplot-average.png

5 Bytes
Loading

images/plots/barplot-sentences.png

55 Bytes
Loading

images/plots/barplot-single-words.png

1 Byte
Loading

images/plots/barplot-word-pairs.png

-8 Bytes
Loading

images/plots/boxplot-average.png

0 Bytes
Loading

images/plots/boxplot-sentences.png

26 Bytes
Loading

images/plots/boxplot-single-words.png

0 Bytes
Loading

images/plots/boxplot-word-pairs.png

0 Bytes
Loading

lingua/_constant.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,16 @@
8080
"ЁёЫыЭэ": frozenset(
8181
[Language.BELARUSIAN, Language.KAZAKH, Language.MONGOLIAN, Language.RUSSIAN]
8282
),
83-
"ЩщЪъ": frozenset(
83+
"Щщ": frozenset(
84+
[
85+
Language.BULGARIAN,
86+
Language.KAZAKH,
87+
Language.MONGOLIAN,
88+
Language.RUSSIAN,
89+
Language.UKRAINIAN,
90+
]
91+
),
92+
"Ъъ": frozenset(
8493
[Language.BULGARIAN, Language.KAZAKH, Language.MONGOLIAN, Language.RUSSIAN]
8594
),
8695
"Òò": frozenset(

tests/test_detector.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -463,7 +463,13 @@ def test_language_detection_with_rules(word, expected_language):
463463
),
464464
pytest.param(
465465
"плаваща",
466-
[Language.BULGARIAN, Language.KAZAKH, Language.MONGOLIAN, Language.RUSSIAN],
466+
[
467+
Language.BULGARIAN,
468+
Language.KAZAKH,
469+
Language.MONGOLIAN,
470+
Language.RUSSIAN,
471+
Language.UKRAINIAN,
472+
],
467473
),
468474
pytest.param(
469475
"довършат",

0 commit comments

Comments
 (0)