With expanding evidence on the formulaic nature of human communication, there is a growing need to extend discourse marker research to functionally analogue multi-word expressions. In contrast to the common qualitative approaches to discourse marker identification in corpora, this paper presents a corpus-driven semi-automatic approach to identification of multi-word discourse markers (MWDMs) in the reference corpus of spoken Slovene. Using eight statistical measures, we identified 173 structurally fixed discourse-marking MWEs, distinguished by a high number of tokens, a large proportion of grammatical words and semantic heterogeneity. This is a significantly longer list than would have been gained by manual inspection of smaller corpus samples. Although frequency-based methods produced satisfactory results, best precision in MWDM identification was achieved using the t-score association measure, while the overall poor performance of the mutual information suggests its inadequacy for extraction of MWDMs and other MWEs with similar lexical and distributional features.
© 2001-2024 Fundación Dialnet · Todos los derechos reservados