BudouX implementation in C

This is a simple implementation of BudouX word boundary breaking for east asian languages. The code tries to be a faithful implementation of the original algorithm, and uses the models from the original project.

Usage

char const sentence[] = "私はその人を常に先生と呼んでいた。\n"
                             "だからここでもただ先生と書くだけで本名は打ち明けない。\n"
                             "これは世間を憚かる遠慮というよりも、その方が私にとって自然だからである。";


// Init iterator, does not allocate, no need to tear down.
boundary_iterator_t iter = boundary_iterator_init_ja_utf8(sentence, -1);

// Iterate and pring all ranges.
int32_t range_start = 0, range_end = 0;
while (boundary_iterator_next(&iter, &range_start, &range_end)) {
    for (int i = range_start; i < range_end; i++)
        printf("%c", sentence[i]);
    printf("|");
}

Outputs:

私は|その|人を|常に|先生と|呼んで|いた。|
だから|ここでも|ただ先生と|書くだけで|本名は|打ち明けない。|
これは|世間を|憚かる|遠慮と|いうよりも、|その方が|私に|とって|自然だからである。|

Models

The models folder contains json files from BudouX project. They have been converted to C headers using the covert.py script:

python .\convert.py .\models\zh-hant.json zh_hant model_zh_hant.h

Similar Project

budoux-c

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
example		example
include		include
models		models
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
convert.py		convert.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BudouX implementation in C

Usage

Models

Similar Project

About

Uh oh!

Releases

Packages

Languages

License

memononen/budouxc

Folders and files

Latest commit

History

Repository files navigation

BudouX implementation in C

Usage

Models

Similar Project

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages