Enhancing the Turkish stop word list with additional common words #14549

HakanBayazitHabes · 2025-04-24T11:51:57Z

Description

This pull request proposes the addition of several frequently used Turkish stopwords to the stopwords.txt file. These words are commonly considered non-informative in NLP tasks and are consistent with standard Turkish stopword sets.

Words like şu, şöyle, şayet, and öz were added
Alphabetically sorted for consistency
Based on analysis of multiple Turkish NLP resources

This enhancement improves coverage and ensures better compatibility with text processing tasks involving Turkish.

lucene/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt

…rmalized

stefanvodita

I think we need to know more around how the list came about and we need some evidence that the new list is better. I'm also unable to tell if the new words are reasonably considered stop words, but maybe a Turkish speaker could weigh in.

stefanvodita · 2025-04-29T09:10:45Z

lucene/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt

@@ -1,39 +1,80 @@
-# Turkish stopwords from LUCENE-559
-# merged with the list from "Information Retrieval on Turkish Texts"
-#   (http://www.users.muohio.edu/canf/papers/JASIST2008offPrint.pdf)


We should have something similar specifying how we sourced the list.

bahadirborasahin · 2025-04-29T10:38:47Z

@stefanvodita

I think we need to know more around how the list came about and we need some evidence that the new list is better. I'm also unable to tell if the new words are reasonably considered stop words, but maybe a Turkish speaker could weigh in.

I had concerns earlier about malformed entries like keţke or onlarýn yet they seem to be fixed in this revision. The suggested words make sense in Turkish (as a stopword), however, I find chance of occurrence of some of them very low if that matters, cuppadak, cumburlok, cumbadak?

@HakanBayazitHabes

Based analysis of multiple Turkish NLP resources

Can we give references to those studies? I am specifically wondering whether we should add all kinds of adverbs/zarf as stopwords as they can potentially provide context? (for example doğru/accurate/true/factual)

Added some words to stopwords.txt file

484a8f4

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Apr 24, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Apr 24, 2025

github-actions bot added the module:analysis label Apr 24, 2025

stefanvodita reviewed Apr 25, 2025

View reviewed changes

lucene/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt Outdated Show resolved Hide resolved

Fix stopwords.txt: Alphabetical order corrected, malformed entries no…

7dd72de

…rmalized

stefanvodita reviewed Apr 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing the Turkish stop word list with additional common words #14549

Enhancing the Turkish stop word list with additional common words #14549

HakanBayazitHabes commented Apr 24, 2025

stefanvodita left a comment

stefanvodita Apr 29, 2025

bahadirborasahin commented Apr 29, 2025

Enhancing the Turkish stop word list with additional common words #14549

Are you sure you want to change the base?

Enhancing the Turkish stop word list with additional common words #14549

Conversation

HakanBayazitHabes commented Apr 24, 2025

Description

stefanvodita left a comment

Choose a reason for hiding this comment

stefanvodita Apr 29, 2025

Choose a reason for hiding this comment

bahadirborasahin commented Apr 29, 2025