Ayuda
Ir al contenido

Dialnet


Information retrieval based on DOM trees

  • Autores: Julián Alarte Aleixandre
  • Directores de la Tesis: Josep Francesc Silva Galiana (dir. tes.)
  • Lectura: En la Universitat Politècnica de València ( España ) en 2023
  • Idioma: inglés
  • Tribunal Calificador de la Tesis: Francisco Javier Oliver Villarroya (presid.), Pascual Julián Iranzo (secret.), José Ángel Olivas Varela (voc.)
  • Programa de doctorado: Programa de Doctorado en Informática por la Universitat Politècnica de València
  • Materias:
  • Enlaces
    • Tesis en acceso abierto en: RiuNet
  • Resumen
    • For several years, the amount of information available on the Web has been growing exponentially. Every day, a huge amount of data is generated and it is made immediately available on the Web. Indexers and crawlers browse the Web daily to find the new information that has been added, and they make it available to answer the users' search queries. However, the amount of information is so huge that it must be preprocessed. Given that users are only interested in the relevant information, it is not necessary for indexers and crawlers to process other boilerplate, redundant or useless elements of the web pages. Processing such irrelevant elements lead to an unnecessary waste of resources, such as storage space, runtime, bandwidth, etc. Different studies have shown that between 40% and 50% of the data on the Web are noisy elements. For this reason, several techniques focused on the detection of both, relevant and irrelevant data, have been developed over the last 20 years. The problems of identifying the relevant content of a web page, its template, its menu, etc. can be faced in various ways, and for this reason, there exist completely different techniques to address those problems. This thesis is focused on the development of information retrieval techniques based on DOM trees. Its goal is to detect different parts of a web page, such as the main content, the template, and the main menu. Most of the existing techniques are focused on the detection of text inside the main content of the web pages, mainly by removing the template of the web page or by inferring the main content. The techniques proposed in this thesis do not only extract text by eliminating the template or inferring the main content, but also extract any other relevant information from web pages such as images, animations, videos, etc. Our techniques are not only useful for indexers and crawlers but also for the user browsing the Web. For instance, in the case of users with functional diversity problems (such as blindness), removing noisy elements can facilitate them to read (or listen to) the web pages. To make the techniques broadly accessible to everybody, we have implemented them as browser extensions, which are compatible with Mozilla-based and Chromium-based browsers. In addition, these tools are publicly available, so any interested person can access them and continue with the research if they wish to do so.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno