Challenges of combining structured and unstructured data in corpus development

Tanja Säily; Jukka Tyrkkö

Ayuda

Challenges of combining structured and unstructured data in corpus development

Tanja Säily ^[1] ; Jukka Tyrkkö ^[2]
1. [1] University of Helsinki
  
  University of Helsinki
  
  Helsinki, Finlandia
2. [2] Linnaeus University
  
  Linnaeus University
  
  Suecia
Localización: Research in Corpus Linguistics (RiCL), ISSN-e 2243-4712, Vol. 9, Nº. Extra 1, 2021 (Ejemplar dedicado a: "Challenges of combining structured and unstructured data in corpus development"), págs. 1-8
Idioma: inglés
Enlaces
- Texto completo
Resumen
- Recent advances in the availability of ever larger and more varied electronic datasets, both historical and modern, provide unprecedented opportunities for corpus linguistics and the digital humanities. However, combining unstructured text with images, video, audio as well as structured metadata poses a variety of challenges to corpus compilers. This paper presents an overview of the topic to contextualise this special issue of Research in Corpus Linguistics. The aim of the special issue is to highlight some of the challenges faced and solutions developed in several recent and ongoing corpus projects. Rather than providing overall descriptions of corpora, each contributor discusses specific challenges they faced in the corpus development process, summarised in this paper. We hope that the special issue will benefit future corpus projects by providing solutions to common problems and by paving the way for new best practices for the compilation and development of rich-data corpora. We also hope that this collection of articles will help keep the conversation going on the theoretical and methodological challenges of corpus compilation