Automatic text re-use detection is the task of determining whether a text has been produced by considering another as its source. Plagiarism, the unacknowledged re-use of text, is probably the most famous kind of re-use. Favoured by the easy access to information through electronic media, plagiarism has raised in recent years, requesting for the attention of experts in text analysis.
Automatic text re-use detection takes advantage of technology on natural language processing and information retrieval in order to compare thousands of documents, looking for the potential source of a presumably case of re-use. Machine translation technology can be used in order to uncover cases of cross-language text re-use. By exploiting such technology, thousands of exhaustive comparisons are possible, also across languages, something impossible to do manually.
In this dissertation we pay special attention to three types of text re-use, namely: (i) cross-language text re-use, (ii) paraphrase text re-use, and (iii) mono- and cross-language re-use within and from Wikipedia.
In the case of cross-language text re-use, we propose a cross-language similarity assessment model based on statistical machine translation. The model is exhaustively compared to other available models up to date, showing to be one of the best options when looking for exact translations, regardless they are automatically or manually created.
In the case of paraphrase, the core of plagiarism, we investigate what types of paraphrase plagiarism cases are most difficult to detect. Our analysis of plagiarism detection from the perspective of paraphrasing represents something never done before.
Our insights include that the most common paraphrasing strategies when plagiarising are lexical changes. These findings should be integrated in the future generation of plagiarism detectors.
Finally, in the case of Wikipedia we explore the encyclopedia as a multi-authoring framework, where texts are re-used within versions of the same article and across languages. Our analysis of multilingualism shows that Wikipedia editions in less-resourced languages tend to be better related to others. We also investigate the feasibility of extracting parallel fragments from the Wikipedia in order to (i) detect cases of cross-language re-use within the encyclopedia and (ii) enriching our cross-language similarity assessment model.
In order to empirically prove our models, we perform millions of mono- and cross-language text comparisons on the basis of different representations and measurement models. In many cases we make it on corpora generated by ourselves, which now are freely available for the interested researcher.
© 2001-2024 Fundación Dialnet · Todos los derechos reservados