Skip to content

Remove CJK language setting and fix reading time on mixed text. Fixes #10031 #10032

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

DeinAlptraum
Copy link

@DeinAlptraum DeinAlptraum commented Jun 18, 2022

This removes the hasCJKLanguage setting, instead determining it dynamically on a word-by-word basis by checking for the presence of CJK characters.

This fixes the reading time for mixed texts, by computing separete word counts for CJK and non-CJK texts, then computing their reading times separately via the two formulas, and finally summing them up.

@CLAassistant
Copy link

CLAassistant commented Jun 18, 2022

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link

This PR has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
Please check https://github.com/gohugoio/hugo/blob/master/CONTRIBUTING.md#code-contribution and verify that this code contribution fits with the description. If yes, tell is in a comment.
This PR will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

@github-actions github-actions bot added the Stale label Jun 19, 2023
@DeinAlptraum
Copy link
Author

Yes, I believe it fits the description, including closing an open issue.

@github-actions github-actions bot removed the Stale label Jun 20, 2023
@DeinAlptraum DeinAlptraum force-pushed the fix-reading-time-mixed-cjk branch from b200173 to fde4893 Compare May 23, 2025 09:23
@DeinAlptraum DeinAlptraum changed the title Fix reading time on mixed CJK/non-CJK text. Fixes #10031 Remove CJK language setting and fix reading time on mixed CJK/non-CJK text. Fixes #10031 May 23, 2025
@DeinAlptraum DeinAlptraum changed the title Remove CJK language setting and fix reading time on mixed CJK/non-CJK text. Fixes #10031 Remove CJK language setting and fix reading time on mixed text. Fixes #10031 May 23, 2025
@DeinAlptraum
Copy link
Author

I have updated this branch to resolve conflicts, and have also improved the implementation slightly.

@jmooring
Copy link
Member

I'm not a linguistics expert, but my understanding is that CJK languages do have explicit word separators like spaces in English. But we're relying on strings.Fields to separate words, so I'm not sure how this addresses #10031.

cc: @davidsneighbour

@DeinAlptraum
Copy link
Author

@jmooring no, there are no word separators in Chinese/Japanese. The concept of a "word" is not even clearly defined in those languages, which is why counting characters is the best one can do.
I just looked up Korean and it seems that they do use spaces. We could use a different formula here or only treat CJ differently if you prefer.

In either case, I don't think this affects the resolution of #10031: the problem is that, with hasCJKLanguage activated, any reading time is computed as if the text consists of only CJK characters. As an example, a text that contains one CJK character and 10,000 non-CJK words currently results in a reading time of about 20mins (rather than the expected 47mins) because it uses the CJK formula, rather than "501 runes per minute" for the one CJK character and "213 words per minute" for the remaining 10,000 words
Splitting at spaces for CJK languages is not an issue even if they have spaces (like Korean) because we only count the number of runes, so this doesn't affect the result anyway.

This PR resolves that issue with mixed text, by computing separate counts for CJK characters as rune count and everything else via word count. This is still not perfect obviously (is Korean read at a similar speed as Chinese? what about Arabic anyway? etc.) but it's an improvement over the current situation at least, and eliminates an unnecessary setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants