Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense Disambiguation

Muhammad Farhat Ullah; Ali Saeed; Naveed Hussain

Ayuda

Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense Disambiguation

Farhat Ullah, Muhammad ^[1] ; Saeed, Ali ^[2] ; Hussain, Naveed ^[2]
1. [1] University of Lahore
  
  University of Lahore
  
  Pakistán
2. [2] University of Central Punjab
  
  University of Central Punjab
  
  Pakistán
Localización: ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal, ISSN-e 2255-2863, Vol. 12, Nº. 1, 2023
Idioma: inglés
Enlaces
- Texto completo
Resumen
- The prime objective of word sense disambiguation (WSD) is to develop such machines that can automatically recognize the actual meaning (sense) of ambiguous words in a sentence. WSD can improve various NLP and HCI challenges. Researchers explored a wide variety of methods to resolve this issue of sense ambiguity. However, majorly, their focus was on English and some other well-reputed languages. Urdu with more than 300 million users and a large amount of electronic text available on the web is still unexplored. In recent years, for a variety of Natural Language Processing tasks, word embedding methods have proven extremely successful. This study evaluates, compares, and applies a variety of word embedding approaches to Urdu Word embedding (both Lexical Sample and All-Words), including pre-trained (Word2Vec, Glove, and FastText) as well as custom-trained (Word2Vec, Glove, and FastText trained on the Ur-Mono corpus). Two benchmark corpora are used for the evaluation in this study: (1) the UAW-WSD-18 corpus and (2) the ULS-WSD-18 corpus. For Urdu All-Words WSD tasks, top results have been achieved (Accuracy=60.07 and F1=0.45) using pre-trained FastText. For the Lexical Sample, WSD has been achieved (Accuracy=70.93 and F1=0.60) using custom-trained GloVe word embedding method.