Comparing a Thai Words Segmentation Methods in the LST20 Dataset

Krittapol Damrongkamoltip; Khatcha Ruenlek; Wasit Limprasert; Prachya Boonkwan

doi:10.14456/jcct.2024.7

Authors

Krittapol Damrongkamoltip Student, Data Science and Innovation, College of Interdisciplinary Studies, Thammasat University, Pathum Thani 12121, Thailand
Khatcha Ruenlek Student, Data Science and Innovation, College of Interdisciplinary Studies, Thammasat University, Pathum Thani 12121, Thailand https://orcid.org/0009-0008-0666-2634
Wasit Limprasert Assistant Professor, Dr., Data Science and Innovation, College of Interdisciplinary Studies, Thammasat University, Pathum Thani 12121, Thailand
Prachya Boonkwan Lecturer, Dr., Language and Semantic Technology Research Team, National Electronics and Computer Technology Center, Pathum Thani 12121, Thailand

DOI:

https://doi.org/10.14456/jcct.2024.7

Keywords:

Thai Words, Segmentation Methods, LST20 Dataset

Abstract

In this era of globalization where information is widely available, organizations are increasingly placing importance on using information to enhance their business. Although data is easily available, there are still challenges in natural language processing tasks, especially, the division of Thai words that lacks clarity of word boundaries, etc. This makes it difficult to identify the word groups in a sentence appropriately. Therefore, this study focuses on evaluating the performance of the word segmentation method including the Dictionary use and learning from data using evaluation of word segmentation in six techniques are important goals for the verification of the literal level accuracy and processing time of each method and technique, by the LST20 dataset contains 3,745 documents and covers 15 news categories in results show a more efficient way to learn from data.

Downloads

Download data is not yet available.

References

Aroonmanakun, W. (2002). Collocation and Thai Word Segmentation. Proceedings of SNLP-Oriental COCOSDA, 68-75. https://www.academia.edu/21349075/Collocation_and_Thai_Word_Segmentation.

Boonkwan, P., Luantangsrisuk, V., Phaholphinyo, S., Kriengket, K., Leenoi, D., Phrombut, C., Boriboon, M., Kosawat, K., & Supnithi, T. (2020). The Annotation Guideline of LST20 Corpus. ArXiv. https://doi.org/10.48550/ARXIV.2008.05055.

Chormai, P., Prasertsom, P., & Rutherford, A. (2019). AttaCut: A Fast and Accurate Neural Thai Word Segmenter. ArXiv. https://doi.org/10.48550/ARXIV.1911.07056.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–8451. https://doi.org/10.18653/v1/2020.acl-main.747.

De Vries, W., Wieling, M., & Nissim, M. (2022). Make the Best of Cross-lingual Transfer: Evidence from POS Tagging with over 100 Languages. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 7676–7685. https://doi.org/10.18653/v1/2022.acl-long.529.

Haruechaiyasak, C., Kongyoung, S., & Dailey, M. (2008). A comparative study on Thai word segmentation approaches. 2008 5th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 1, 125–128. https://doi.org/10.1109/ECTICON.2008.4600388.

Kittinaradorn, R., Chaovavanich, K., Achakulvisut, T., Srithaworn, K., Chormai, P., Kaewkasi, C., Ruangrong, T., & Oparad, K. (2019). DeepCut: A Thai Word Tokenization Library using Deep Neural Network. (v1.0). Zenodo. https://doi.org/10.5281/zenodo.3457707.

National Electronics and Computer Technology Center. (2022). LST20 Corpus. https://opend-portal.nectec.or.th/it/dataset/lst20-corpus. (In Thai)

Noyunsan, C., Haruechaiyasak, C., Poltree, S., & Saikaew, K.R. (2014). A Multi-Aspect Comparison and Evaluation on Thai Word Segmentation Programs. Joint International Conference of Semantic Technology. https://ceur-ws.org/Vol-1312/jist2014pd_paper6.pdf.

Poowarawan, Y. (1986). Dictionary-based Thai Syllable Separation. In Proceedings of the Ninth Electronics Engineering Conference, 409–418.

Ruenlek, K., & Damrongkamoltip, K. (2023). A Mockup Dataset for Word Segmentation Based on the LST20 Dataset. https://github.com/Fairpart/A-mockup-dataset-for-word-segmentation-based-on-the-LST20-DATASET.

Theeramunkong, T., Sornlertlamvanich, V., Tanhermhong, T., & Chinnan, W. (2000). Character Cluster based Thai Information Retrieval. Proceedings of the Fifth International Workshop on on Information Retrieval with Asian Languages, 75–80. https://doi.org/10.1145/355214.355225.

Virach, S. (1993). Word Segmentation for Thai in Machine Translation System. Machine Translation. 50-56. (In Thai)