An empirical study is presented showing how factors such as co-occurrence frequency, linguistic constraints in the candidate data and type of collocation to be identified influence the identification accuracy achieved, on the one hand, by a mere frequency-based approach and, on the other hand, by well known statistical association measures such as mutual information, Dice coefficient, relative entropy and loglikelihood statistics. The empirical results confirm the weakness of the statistical measures with respect to identifying collocations from data with a high proportion of low frequency data, and reveal differences between the individual association measures depending on the class of collocations to be identified, whether they are applied to full or base form data, and whether the test samples contain low frequency data or not.
© 2001-2024 Fundación Dialnet · Todos los derechos reservados