While benefiting performance, any optimization technique further complicates the design of data management systems. The fundamental problem is that the design space grows combinatorially with the number of techniques and technologies, making the current development process unsustainable. In addition to this growing complexity, performance engineering is complicated by tools and paradigms that obfuscate the design space and obstruct plan transformations. Recognizing this, we developed Voodoo: a vector-oriented abstraction layer between a data management system frontend (relational, graph-oriented or others) and the hardware. It supports the implemention of different data models while simplifying the application of many tuning techniques and generating highly efficient executable code. It does so by implementing a novel concept called Controlled Folding. The resulting programming model structures the design space such that conceptually close techniques (e.g., different flavors of parallelism) are encoded in similar plans. This makes equivalent plan transformations simple, non-equivalent transformations hard and emphasizes design choices that are often obscured in languages like C++.
Co-processors such as GPUs have been recognized as beneficial for data intensive applications because they offer orders of magnitude higher bandwidth and processing capacity than CPUs. However, the von-Neumann architecture combined with the limited capacity of the GPU's memory exposes the PCI-Bus as the major bottleneck holding back the adoption of GPUs for analytical data processing.
Based on the insight that GPUs are always connected to a host system with processing resources of its own, I developed the Bitwise Decomposed Processing Model. In this model, the PCI bottleneck is addressed by distributing data as well as processing among CPU and GPU: the GPU holds an approximation of the database and produces an approximate result that is subsequently refined into an accurate answer on the CPU. This strategy yields a speedup of up to 6 times over CPU-only processing even for datasets larger than the GPU's internal memory. This scheme was, to the best of my knowledge, the first to introduce such co-processing into the domain of data management systems.
Virtually all general purpose data management systems co-locate the values of a tuple (N-ary storage). However, co-locating the values of a column (decomposed storage) improves performance of many analytical queries by orders of magnitude. For applications that are not purely analytical, hybrid systems improve cache-locality by adapting the storage model to the workload. However, to process tuples rather than single columns, such systems implement an iterator-based design which hurts CPU efficiency yielding systems that perform worse than implementations of N-ary or decomposed storage. However, I found that hybrid storage can be combined with executable code generation to preserve CPU efficiency. To automatically select a storage strategy, I developed a cache-conscious cost model whith much higher accuracy than previously proposed models. The resulting query processor outperformed decomposed storage by up to 4 times and N-ary storage by more than 10 times.