Ayuda
Ir al contenido

Dialnet


Resumen de Relative information completeness

Wenfei Fan, Floris Geerts

  • Foreword to TODS invited papers issue Z. Meral Özsoyoglu Article No.: 23 doi>10.1145/1862919.1862920 Full text: Pdf An architecture for recycling intermediates in a column-store Milena G. Ivanova, Martin L. Kersten, Niels J. Nes, Romulo A.P. Gonçalves Article No.: 24 doi>10.1145/1862919.1862921 Full text: Pdf Automatic recycling of intermediate results to improve both query response time and throughput is a grand challenge for state-of-the-art databases. Tuples are loaded and streamed through a tuple-at-a-time processing pipeline, avoiding materialization ...

    Automatic recycling of intermediate results to improve both query response time and throughput is a grand challenge for state-of-the-art databases. Tuples are loaded and streamed through a tuple-at-a-time processing pipeline, avoiding materialization of intermediates as much as possible. This limits the opportunities for reuse of overlapping computations to DBA-defined materialized views and function/result cache tuning. In contrast, the operator-at-a-time execution paradigm produces fully materialized results in each step of the query plan. To avoid resource contention, these intermediates are evicted as soon as possible.

    In this article we study an architecture that harvests the byproducts of the operator-at-a-time paradigm in a column-store system using a lightweight mechanism, the recycler. The key challenge then becomes the selection of the policies to admit intermediates to the resource pool, to determine their retention period, and devise the eviction strategy when facing resource limitations. The proposed recycling architecture has been implemented in an open-source system. An experimental analysis against the TPC-H ad-hoc decision support benchmark and a complex, real-world application (SkyServer) demonstrates its effectiveness in terms of self-organizing behavior and its significant performance gains. The results indicate the potentials of recycling intermediates and charts a route for further development of database kernels.

    expand I/O efficient algorithms for serial and parallel suffix tree construction Amol Ghoting, Konstantin Makarychev Article No.: 25 doi>10.1145/1862919.1862922 Full text: Pdf Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the ...

    Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that we have a scalable suffix tree construction algorithm.

    The past few years have seen the emergence of several disk-based suffix tree construction algorithms. However, construction times continue to be daunting�for example, indexing the entire human genome still takes over 30 hours on a system with 2 gigabytes of physical memory. In this article, we will empirically demonstrate and argue that all existing suffix tree construction algorithms have a severe limitation�to glean reasonable disk I/O efficiency, the input string being indexed must fit in main memory. This limitation is attributed to the poor locality exhibited by existing suffix tree construction algorithms and inhibits both sequential and parallel scalability. To deal with this limitation, we will show that through careful algorithm design, one of the simplest suffix tree construction algorithms can be rearchitected to build a suffix tree in a tiled manner, allowing the execution to operate within a fixed main memory budget when indexing strings of any size. We will also present a parallel extension of our algorithm that is designed for massively parallel systems like the IBM Blue Gene. An experimental evaluation will show that the proposed approach affords an improvement of several orders of magnitude in serial performance when indexing large strings. Furthermore, the proposed parallel extension is shown to be scalable�it is now possible to index the entire human genome on a 1024 processor IBM Blue Gene system in under 15 minutes.

    expand Space-optimal heavy hitters with strong error bounds Radu Berinde, Piotr Indyk, Graham Cormode, Martin J. Strauss Article No.: 26 doi>10.1145/1862919.1862923 Full text: Pdf The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worst-case guarantees ...

    The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worst-case guarantees on real data. This leads to the question of whether some stronger bounds can be guaranteed. We answer this in the positive by showing that a class of counter-based algorithms (including the popular and very space-efficient Frequent and SpacesSaving algorithms) provides much stronger approximation guarantees than previously known. Specifically, we show that errors in the approximation of individual elements do not depend on the frequencies of the most frequent elements, but only on the frequency of the remaining tail. This shows that counter-based methods are the most space-efficient (in fact, space-optimal) algorithms having this strong error bound.

    This tail guarantee allows these algorithms to solve the sparse recovery problem. Here, the goal is to recover a faithful representation of the vector of frequencies, f. We prove that using space O(k), the algorithms construct an approximation f* to the frequency vector f so that the L1 error ∥∥f-∥f*∥1 is close to the best possible error minf' ∥f' - f∥1, where f' ranges over all vectors with at most k non-zero entries. This improves the previously best known space bound of about O(k log n) for streams without element deletions (where n is the size of the domain from which stream elements are drawn). Other consequences of the tail guarantees are results for skewed (Zipfian) data, and guarantees for accuracy of merging multiple summarized streams.

    expand Relative information completeness Wenfei Fan, Floris Geerts Article No.: 27 doi>10.1145/1862919.1862924 Full text: Pdf This article investigates the question of whether a partially closed database has complete information to answer a query. In practice an enterprise often maintains master data Dm, a closed-world database. We say that a database D ...

    This article investigates the question of whether a partially closed database has complete information to answer a query. In practice an enterprise often maintains master data Dm, a closed-world database. We say that a database D is partially closed if it satisfies a set V of containment constraints of the form q(D) &subse; p(Dm), where q is a query in a language LC and p is a projection query. The part of D not constrained by (Dm, V) is open, from which some tuples may be missing. The database D is said to be complete for a query Q relative to (Dm, V) if for all partially closed extensions D' of D, Q(D') = Q(D), i.e., adding tuples to D either violates some constraints in V or does not change the answer to Q.

    We first show that the proposed model can also capture the consistency of data, in addition to its relative completeness. Indeed, integrity constraints studied for data consistency can be expressed as containment constraints. We then study two problems. One is to decide, given Dm, V, a query Q in a language LQ, and a partially closed database D, whether D is complete for Q relative to (Dm, V). The other is to determine, given Dm, V and Q, whether there exists a partially closed database that is complete for Q relative to (Dm, V). We establish matching lower and upper bounds on these problems for a variety of languages LQ and LC. We also provide characterizations for a database to be relatively complete, and for a query to allow a relatively complete database, when LQ and LC are conjunctive queries


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus