Information extraction from template-generated hidden web documents

Yih-Ling Hedley; Muhammad Younas; Anne James; Mark Sanderson

Ayuda

Information extraction from template-generated hidden web documents

Hedley, Yih-Ling ^[1] ; Younas, Muhammad ^[1] ; James, Anne ^[1] ; Sanderson, Mark ^[2]
1. [1] Coventry University
  
  Coventry University
  
  Reino Unido
2. [2] University of Sheffield
  
  University of Sheffield
  
  Reino Unido
Localización: Proceedings of the IADIS International Conference WWW/INTERNET 2004: Madrid, Spain, October 6-9, 2004 / coord. por Pedro Isaías, Nitya Karmakar, Vol. 1, 2004 (Full papers), ISBN 972-99353-0-0, págs. 627-634
Idioma: inglés
Texto completo no disponible (Saber más ...)
Resumen
- The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (such as Google and Yahoo). Databases dynamically generate a list of documents in response to a user query - which are referred to as Hidden Web databases. Such documents are typically presented to users as template-generated Web pages. This paper presents a new approach that identifies Web page templates in order to extract query-related information from documents. We propose two forms of representation to analyse the content of a document - Text with Immediate Adjacent Tag Segments (TIATS) and Text with Neighbouring Adjacent Tag Segments (TNATS). Our techniques exploit tag structures that surround the textual contents of documents in order to detect Web page templates thereby extracting query-related information. Experimental results demonstrate that TNATS detects Web page templates most effectively and extracts information with high recall and precision.