Ayuda
Ir al contenido

Dialnet


Resumen de Efficient Blocking Method for a Large Scale Citation Matching

Mateusz Fedoryszak, Lukasz Bolikowski

  • Most commonly the first part of record deduplication is blocking. During this phase, roughly similar entities are grouped into blocks where more exact clustering is performed. We present a blocking method for citation matching based on hash functions. A blocking workflow implemented in Apache Hadoop is outlined. A few hash functions are proposed and compared with a particular concern about feasibility of their usage with big data. The possibility of combining various hash functions is investigated. Finally, some technical details related to full citation matching workflow implementation are revealed.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus