Home Explore Blog CI



ragit

crates/korean/README.md
deccfe39e30bcdbab785fd30eeaf5c6bf4c2182a8888867a000000030000017c
# ragit-korean

Ragit-korean is a very simple korean tokenizer.

Ragit used to use [charabia](https://github.com/meilisearch/charabia) to tokenize cjk documents, but it has too many issues.

1. Charabia bundles cjk dictionaries in the binary, which makes the file 70MiB bigger.
2. It silently converts 완성형 korean to 조합형 korean. That silently messes up tfidf searches.

Chunks
b3b1fafd (1st chunk of `crates/korean/README.md`)
Title: Introduction to Ragit-Korean Tokenizer
Summary
Ragit-korean is a simple Korean tokenizer developed to replace charabia, which had several issues, including a large file size of 70MiB due to bundled CJK dictionaries and silent conversion of 완성형 Korean to 조합형 Korean, affecting the accuracy of tfidf searches and overall performance.