Slimane Bellaouar, Hadda Cherroun, Djelloul Ziadi
Kernel methods are powerful tools in machine learning. They have to be computationally efficient. This paper builds on our previous work which proposed a list-based approach to compute efficiently the string subsequence kernel (SSK). In this paper we present a novel Geometric-based approach, our main idea is that the SSK computation reduces to the range query problem. We started with the construction of a match list$$L(s,t)=\left\{ (i,j):s_{i}=t_{j}\right\} $$ L(s,t)=(i,j):si=tj where sand tare the strings to be compared; such a match list contains only the required data that contribute to the result. To compute the SSK efficiently, we extended the layered range treedata structure to a layered range sum tree, a range-aggregation data structure. The SSK computation takes $$O(p|L|\log |L|)$$ O(p|L|log|L|) time and $$O(|L|\log |L|)$$ O(|L|log|L|) space, where |L| is the size of the match list and pis the length of the SSK. We present an empirical evaluation of our approach against the dynamic and the sparse dynamic programming approaches both on synthetically generated data and on newswire article data. Experimental results show the efficiency of our approach for large alphabets except for very short strings. So it can be used in many applications like text categorization, information extraction and music genre classification. Moreover, compared to the sparse dynamic approach, the proposed approach outperforms also for long strings.
© 2001-2024 Fundación Dialnet · Todos los derechos reservados