POS-tagging a bilingual parallel corpus: methods and challenges

Irene Doval ^[1]
1. [1] Universidade de Santiago de Compostela
  
  Universidade de Santiago de Compostela
  
  Santiago de Compostela, España
Localización: Research in Corpus Linguistics (RiCL), ISSN-e 2243-4712, Nº. 5, 2017, págs. 35-46
Idioma: inglés
Enlaces
- Texto completo

Resumen
- This paper reviews the author’s experiences of tokenizing and POS tagging a bilingual parallel corpus, the PaGeS Corpus, consisting mostly of German and Spanish fictional texts. This is part of an ongoing process of annotating the corpus for part-of-speech information. This study discusses the specific problems encountered so far. On the one hand, tagging performance degrades significantly when applied to fictional data and, on the other, pre-existing annotation schemes are all language specific. To further improve accuracy during post-editing, the author has developed a common tagset and identified major error patterns.