Resumen de Dutch Parallel Corpus: A Balanced Copyright-Cleared Parallel Corpus

Ayuda

Resumen de Dutch Parallel Corpus: A Balanced Copyright-Cleared Parallel Corpus

Lieve Macken, Orphée De Clercq, H. Paulussen

This paper presents the Dutch Parallel Corpus, a high-quality parallel corpus for Dutch, French and English consisting of more than ten million words. The corpus contains five different text types and is balanced with respect to text type and translation direction. All texts included in the corpus have been cleared from copyright. We discuss the importance of parallel corpora in various research domains and contrast the Dutch Parallel Corpus with existing parallel corpora. The Dutch Parallel Corpus distinguishes itself from other parallel corpora by having a balanced composition and by its availability to the wide research community, thanks to its copyright clearance. All texts in the corpus are sentence-aligned and further enriched with basic linguistic annotations (lemmas and word class information). Approximately 25,000 words of the Dutch-English part have been manually aligned at the sub-sentential level. Rich metadata facilitates the navigability of the corpus and enables users to select the texts that satisfy their needs. The entire corpus is released as full texts in XML format and is also available via a web interface, which supports basic and complex search queries and presents the results as parallel concordances. The corpus will be distributed by the Flemish-Dutch Human Language Technology Agency (TST-Centrale).

Plan de l'article

1. Introduction
2. Parallel Corpora in Translation Studies
2.1. Parallel Corpus Projects
3. Corpus Design, Copyright Clearance and Metadata
4. Alignment and Linguistic Annotation
4.1. Sentence Alignment
4.2. Sub-sentential Alignment
4.3. Linguistic Annotation
5. Corpus Exploitation
6. Conclusion

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Coordinado por: