Ayuda
Ir al contenido

Dialnet


Evaluating Sentence Segmentation and Word Tokenization Systems on Estonian Web Texts

    1. [1] University of Tartu

      University of Tartu

      Tartu linn, Estonia

  • Localización: Human Language Technologies – The Baltic Perspective: Proceedings of the Ninth International Conference Baltic HLT 2020 / coord. por Andrius Utka, Jurgita Vaičenonienė, Jolanta Kovalevskaitė, Danguolė Kalinauskaitė, 2024, ISBN 978-1-64368-116-0, págs. 174-181
  • Idioma: inglés
  • Enlaces
  • Resumen
    • Texts obtained from web are noisy and do not necessarily follow the orthographic sentence and word boundary rules. Thus, sentence segmentation and word tokenization systems that have been developed on well-formed texts might not perform so well on unedited web texts. In this paper, we first describe the manual annotation of sentence boundaries of an Estonian web dataset and then present the evaluation results of three existing sentence segmentation and word tokenization systems on this corpus: EstNLTK, Stanza and UDPipe. While EstNLTK obtains the highest performance compared to other systems on sentence segmentation on this dataset, the sentence segmentation performance of Stanza and UDPipe remains well below the results obtained on the more well-formed Estonian UD test set.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno