Ayuda
Ir al contenido

Dialnet


The NICT JLE Corpus Expliting the language learners' speech database for research and education

  • Autores: Emi Izumi, Kiyotaka Uchimoto, Hitoshi Isahara
  • Localización: International journal of the computer, the internet and management, ISSN 0858-7027, Vol. 12, Nº. 2 (AGO), 2004 (Ejemplar dedicado a: Suplemento: Proceedings of the International Conference on eLearning for Knowledge-Based Society), págs. 119-125
  • Idioma: inglés
  • Texto completo no disponible (Saber más ...)
  • Resumen
    • Speech corpus of Japanese learner of English, named “The NICT JLE (Japanese Learner of English) Corpus by showing its data collection procedures, annotation schemes and how this corpus can be exploited for language research and education. This corpus consists of the transcripts (2 million words) of the audio-recordings of the English oral proficiency interview test taken by 1,300 Japanese learners. Some of the most unique features of this corpus are that it contains rich information on learners’ proficiency levels and learners’ errors. Learner data has been divided into 9 proficiency levels, and grammatical and lexical errors contained in each learner’s utterance has been tagged with the originally-designed error tagset. Analyzing the error tag information of each proficiency level would give the important clues to describe the developmental stages of learners’ language. In error tagging, we dealt only with the formal aspects of learners’ language such as grammatical and lexical errors. In order to examine what we are unable to examine solely by error-tagged data, two sub-corpora have been compiled. One is the native speakers’ speech data, which would be useful for comparing the utterance of native speakers and Japanese learners. We also compiled a back-translation corpus. It could be compiled mainly by guessing what the learners intended to say in the interview, and express it in Japanese. With the back-translation corpus, we could study how the mother tongue (Japanese) interferes with the second language acquisition, or what kind of things are difficult for Japanese learners to express in English. In NICT, we have been doing several analyses and experiments to see to what extent this corpus can be exploited for language research and education. For example, we evaluated the corpus through the experiment on automatic detection of learners’ errors by using error tag information in the corpus. We did this by using a machine learning model, Maximum Entropy (ME) technique. Since we had obtained the limited amount of error-tagged data, we needed to make some efforts to enlarge training data. We added the correct sentences and artificially-made errors to the training data, and found that it improved accuracy. As a result, we have obtained 51 % recall and 76 % precision rate for the detection of the article errors. We are planning to make this corpus publicly available in September, 2004, so that teachers and researchers in many fields can use the data for their own interests, such as second language acquisition research, syllabus and material design, or the development of computerized pedagogical tools, by combining it with NLP (Natural Language Processing) technology.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno