Ayuda
Ir al contenido

Dialnet


Training deployable general domain MT for a low resource language pair: English–Bangla

  • Autores: Sandipan Dandapat, William Lewis
  • Localización: Proceedings of the 21st Annual Conference of the European Association for Machine Translation: 28-30 May 2018, Universitat d'Alacant, Alacant, Spain / coord. por Juan Antonio Pérez Ortiz, Felipe Sánchez Martínez, Miquel Esplà Gomis, Maja Popovic, Celia Rico Pérez, André Martins, Joachim Van den Bogaert, Mikel L. Forcada Zubizarreta, 2018, ISBN 978-84-09-01901-4, págs. 109-117
  • Idioma: inglés
  • Enlaces
  • Resumen
    • A large percentage of the world’s population speaks a language of the Indian subcontinent, what we will call here Indic languages, comprising languages from both Indo-European (e.g., Hindi, Bangla, Gujarati, etc.) and Dravidian (e.g., Tamil, Telugu, Malayalam, etc.) families, upwards of 1.5 Billion people. A universal characteristic of Indic languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high quality parallel data, can make developing machine translation (MT) for these languages difficult. In this paper, we describe our efforts towards developing general domain English–Bangla MT systems which are deployable to the Web. We initially developed and deployed SMT-based systems, but over time migrated to NMT-based systems. Our initial SMT-based systems had reasonably good BLEU scores, however, using NMT systems, we have gained significant improvement over SMT baselines. This is achieved using a number of ideas to boost the data store and counter data sparsity: crowd translation of intelligently selected monolingual data (throughput enhanced by an IME (Input Method Editor) designed specifically for QWERTY keyboard entry for Devanagari scripted languages), back-translation, different regularization techniques, dataset augmentation and early stopping.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno