...and my cloud


My Biography

I am a Postdoc at the Databases group at MIT CSAIL. I spent my PhD years in the Data­base Architec­tures group at CWI in Amster­dam resulting in a PhD from the University of Amsterdam in 2015. I received my master's degree (Diplom) in computer science at Humboldt-Universität zu Berlin in 2010. My research inter­ests lie in analy­tical query pro­cessing on memory-resident data. In par­ticu­lar, I study storage schemes and pro­cessing models for modern hardware.

Selected Publications

Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware


Pirk, H. & Moll, O. & Zaharia, M. & Madden, S. .
, 2016

Abstract: In-memory databases require careful tuning and many engineering tricks to achieve good performance. Such database performance engineering is hard: a plethora of data and hardware-dependent optimization techniques form a design space that is difficult to navigate for a skilled engineer – even more so for a query compiler. To facilitate performance-oriented design exploration and query plan compilation, we present Voodoo, a declarative intermediate algebra that abstracts the detailed architectural properties of the hardware, such as multi- or many-core architectures, caches and SIMD registers, without losing the ability to generate highly tuned code. Because it consists of a collection of declarative, vector-oriented operations, Voodoo is easier to reason about and tune than low-level C and related hardware-focused extensions (Intrinsics, OpenCL, CUDA, etc.). This enables our Voodoo compiler to produce (OpenCL) code that rivals and even outperforms the fastest state-of-the-art in memory databases for both GPUs and CPUs. In addition, Voodoo makes it possible to express techniques as diverse as cache-conscious processing, predication and vectorization (again on both GPUs and CPUs) with just a few lines of code. Central to our approach is a novel idea we termed control vectors, which allows a code generating frontend to expose parallelism to the Voodoo compiler in a abstract manner, enabling portable performance across hardware platforms. We used Voodoo to build an alternative backend for MonetDB, a popular open-source in-memory database. Our backend allows MonetDB to perform at the same level as highly tuned in-memory databases, including HyPeR and Ocelot. We also demonstrate Voodoo’s usefulness when investigating hardware conscious tuning techniques, assessing their performance on different queries, devices and data.

Non-Invasive Progressive Optimization for In-Memory Databases


Zeuch, S. & Pirk, H. & Freytag, J.-C. .
, 2016

Abstract: Progressive optimization introduces robustness for database workloads against wrong estimates, skewed data, correlated attributes, or outdated statistics. Previous work focuses on cardinality estimates and rely on expensive counting methods as well as complex learning algorithms. In this paper, we utilize performance counters to drive progressive optimization during query execution. The main advantages are that performance counters introduce virtually no costs on modern CPUs and their usage enables a non-invasive monitoring. We present fine-grained cost models to detect differences between estimates and actual costs which enables us to kick-start reoptimization. Based on our cost models, we implement an optimization approach that estimates the individual selectivities of a multi-selection query efficiently. Furthermore, we are able to learn properties like sortedness, skew, or correlation during run-time. In our evaluation we show, that the overhead of our approach is negligible, while performance improvements are convincing. Using progressive optimization, we improve runtime up to a factor of three compared to average run-times and up to a factor of 4.5 compared to worst case run-times. As a result, we avoid costly operator execution orders and; thus, making query execution highly robust.

What Makes a Good Physical plan? Experiencing Hardware-Conscious Query Optimization with Candomble


Pirk, H. & Moll, O. & Madden, S. .
, 2016

Abstract: Query optimization is hard and the current proliferation of “modern” hardware does nothing to make it any easier. In addition, the tools that are commonly used by performance engineers, such as compiler intrinsics, static analyzers or hardware performance counters are neither integrated with data management systems nor easy to learn. This fact makes it (unnecessarily) hard to educate engineers, to prototype and to optimize database query plans for modern hardware. To address this problem, we developed a system called Candombl ́e that lets database performance engineers interactively examine, optimize and evaluate query plans using a touch-based interface. Candombl ́e puts attendants in the place of a physical query optimizer that has to rewrite a physical query plan into a better equivalent plan. Attendants experience the challenges when ad-hoc optimizing a physical plan for processing devices such as GPUs and CPUs and capture their gained knowledge in rules to be used by a rule-based optimizer.

Locality-Adaptive Parallel Hash Joins using Hardware Transactional Memory


Shanbhag, A. & Pirk, H. & Madden, S. .
, 2016

Abstract: Previous work has claimed that the best performing implementation of in-memory hash joins is based on (radix-)partitioning of the build-side input. Indeed, despite the overhead of partitioning, the benefits from increased cache-locality and synchronization free parallelism in the build-phase outweigh the costs when the input data is randomly ordered. However, many datasets already exhibit significant spatial locality (i.e., non-randomness) due to the way data items enter the database: through periodic ETL or trickle loaded in the form of transactions. In such cases, the first benefit of partitioning — increased locality — is largely irrelevant. In this paper, we demonstrate how hardware transactional memory (HTM) can render the other benefit, freedom from synchronization, irrelevant as well. Specifically, using careful analysis and engineering, we develop an adaptive hash join implementation that outperforms parallel radix-partitioned hash joins as well as sort-merge joins on data with high spatial locality. In addition, we show how, through lightweight (less than 1% overhead) runtime monitoring of the transaction abort rate, our implementation can detect inputs with low spatial locality and dynamically fall back to radix-partitioning of the build-side input. The result is a hash join implementation that is more than 3 times faster than the state-of-the-art on high-locality data and never more than 1% slower.

... like Commanding an Anthill: A Case for Micro-Distributed (Data) Management Systems


Pirk, H. .
, 2015

Abstract: Computer system architecture has changed: an assembly of autonomous components has replaced the omnipotent CPU and its legion of dumb devices. Database Management System (DBMS) architecture, however, does not yet reflect this change: it is still dominated by a centralized kernel that limits the autonomy of the devices and, thus, their ability to exploit their increased “smartness”. Distributed data management research can serve as an inspiration for an architecture that addresses this problem. However, the respective algorithms were never designed with CPU efficiency in mind implementing principles like dynamic programming and recursion. More than two decades ago, the transition to memory resident databases spawned a plethora of research on CPU-efficient query processors. We predict that hardware heterogeneity will trigger a similar line of research on CPU-efficient distributed algorithms and architectures. In this paper, we examine benefits and challenges that come with such a micro-distributed database management system. We also discuss a number of approaches that we consider steps towards a micro-distributed system.

By their fruits shall ye know them: A Data Analyst's Perspective on Massively Parallel System Design


Pirk, H. & Madden, S. & Stonebraker, M. .
, 2015

Abstract: Increasingly parallel systems promise a remedy for the current stagnation of single-core performance. However, the battle to find the most appropriate architecture for the resulting massively parallel systems is still ongoing. Currently, there are two active contenders: Massively Parallel Single Instruction Multiple Threads (SIMT) systems such as GPGPUs and Many Core Single Instruction Multiple Data (SIMD) systems such as Intel’s Xeon Phi. While the former is more versatile , the latter is an efficient, time-tested technology with a clear migration path. In this study, we provide a data management perspective to the debate: we study the implementation and performance of a set of common data management operations on an SIMT device (an Nvidia GTX 780) and compare it to a Many Core SIMD system (an Intel Xeon Phi). We interpret the results to pinpoint architectural decisions and tradeoffs that lead to suboptimal performance and point out potential areas for improvement in the next generation of these devices.

Waste Not, Want Not! Managing Relational Data In Asymmetric Memories


PhD Thesis at Universiteit van Amsterdam, 2015 - Pirk, H.

Abstract: Computer systems are not the monolithic machines they used to be. In the early days of computer science (until the late 70s), most computer systems included exactly one component to perform a given task: one (type of) disc for persistence, one CPU for processing and one volatile RAM to hold intermediate data. Today, the architecture has developed into a heterogeneous landscape of components: discs, SSDs, RAM, NVRAM, GPUs and CPUs with a hierarchy of caches In this thesis, we study the management of relational data in modern, i.e., asymmetric computer systems. We explore different strategies to identify asymmetries in persistent data, map them to asymmetries in the memory landscape and, eventually, exploit them to increase query processing performance. To this end, we study memory conscious decomposition and storage of data at different granularities: relations, vertical partitions, single attributes as well as individual bits. In the interest of conciseness, we exclude techniques that require auxilliary data structures such as indices or horizontal partitioning which come with significant maintenance overhead. Further, we argue that, when managing memory-resident data, the problem of optimal data placement is tightly connected to the efficiency of the query processing paradigm and can, therefore, not be studied in isolation. Consequently, we also investigate the connection between storage model and processing paradigm. In the case of decomposition at partition granularity we identify Just-in-Time compilation as the only viable query processing model. In the case of distribution at the granularity of individual bits, we develop a novel processing paradigm that efficiently exploits the asymmetries in the underlying data and memory components.

The DBMS – Your Big Data Sommelier


ICDE, 2015 - Kargin, Y. & Kersten, M. L. & Manegold, S. & Pirk, H.

Abstract: When addressing the problem of ``big'' data volume, preparation costs are one of the key challenges: the high costs for loading, aggregating and indexing data leads to a long data-to-insight time. In addition to being a nuisance to the end-user, this latency prevents real-time analytics on ``big'' data. Fortunately, data often comes in semantic chunks such as files that contain data items that share some characteristics such as acquisition time or location. A data management system that exploits this trait can significantly lower the data preparation costs and the associated data-to-insight time by only investing in the preparation of the relevant chunks. In this paper, we develop such a system as an extension of an existing relational DBMS (MonetDB). To this end, we develop a query processing paradigm and data storage model that are partial-loading aware. The result is a system that can make a 1.2 TB dataset (consisting of 4000 chunks) ready for querying in less than 3 minutes on a single server-class machine while maintaining good query processing performance.

Database Cracking: Fancy Scan, Not Poor Man’s Sort!


DaMoN@SIGMOD, , 2014 - Pirk, H. & Petraki, E. & Idreos, S. & Manegold, S. & Kersten, M. L.

Abstract: Database Cracking is an appealing approach to adaptive indexing: on every range-selection query, the data is partitioned using the supplied predicates as pivots. The core of database cracking is, thus, pivoted partitioning. While pivoted partitioning, like scanning, requires a single pass through the data it tends to have much higher costs due to lower CPU efficiency. In this paper, we conduct an in-depth study of the reasons for the low CPU efficiency of pivoted partitioning. Based on the findings, we develop an optimized version with significantly higher (single-threaded) CPU efficiency. We also develop a number of multi-threaded implementations that are effectively bound by memory bandwidth. Combining all of these optimizations we achieve an implementation that has costs close to or better than an ordinary scan on a variety of systems ranging from low-end (cheaper than 300 dollards) desktop machines to high-end (above 60,000 dollars) servers.

Waste Not... Efficient Co-Processing Of Relational Data


ICDE, 2014 - Pirk, H. & Manegold, S. & Kersten, M. L.

Abstract: The variety of memory devices in modern com- puter systems holds opportunities as well as challenges for data management systems. In particular, the exploitation of Graphics Processing Units (GPUs) and their fast memory has been studied quite intensively. However, current approaches treat GPUs as systems in their own right and fail to provide a generic strategy for efficient CPU/GPU cooperation. We propose such a strategy for relational query processing: calculating an approximate result based on lossily compressed, GPU-resident data and refine the result using residuals, i.e., the lost data, on the CPU. We developed the required algorithms, implemented the strategy in an existing DBMS and found up to 8 times performance improvement, even for datasets larger than the available GPU memory.

Hardware-Oblivious Parallelism For In-Memory Column-Stores


VLDB, , 2013 - Heimel, M. & Saecker, M. & Pirk, H. & Manegold, S. & Markl, V.

Abstract: The multi-core architectures of today’s computer systems make parallelism a necessity for performance critical applications. Writing such applications in a generic, hardware-oblivious manner is a challenging problem: Current database systems thus rely on labor-intensive and error-prone manual tuning to exploit the full potential of modern parallel hardware architectures like multi-core CPUs and graphics cards. We propose an alternative design for a parallel database engine, based on a single set of hardware-oblivious operators, which are compiled down to the actual hardware at runtime. This design reduces the development overhead for parallel database engines, while achieving competitive performance to hand-tuned systems. We provide a proof-of-concept for this design by integrating operators written using the parallel programming framework OpenCL into the open-source database MonetDB. Following this approach, we achieve efficient, yet highly portable parallel code without the need for optimization by hand. We evaluated our implementation against MonetDB using TPC-H derived queries and observed a performance that rivals that of MonetDB’s query execution on the CPU and surpasses it on the GPU. In addition, we show that the same set of operators runs nearly unchanged on a GPU, demonstrating the feasibility of our approach.

CPU And Cache Efficient Management Of Memory-Resident Databases


ICDE, , 2013 - Pirk, H. & Funke, F. & Grund, M. & Neumann, T. & Leser, U. & Manegold, S. & Kemper, A. & Kersten, M. L.

Abstract: Memory-Resident Database Management Systems (MRDBMS) have to be optimized for two resources: CPU cycles and memory bandwidth. To optimize for bandwidth in mixed OLTP/OLAP scenarios, the hybrid or Partially Decomposed Storage Model (PDSM) has been proposed. However, in current implementations, bandwidth savings achieved by partial decomposition come at increased CPU costs. To achieve the aspired bandwidth savings without sacrificing CPU efficiency, we combine partially decomposed storage with Just-in-Time (JiT) compilation of queries, thus eliminating CPU inefficient function calls. Since existing cost based optimization components are not designed for JiT-compiled query execution, we also develop a novel approach to cost modeling and subsequent storage layout optimization. Our evaluation shows that the JiT-based processor maintains the bandwidth savings of previously presented hybrid query processors but outperforms them by two orders of magnitude due to increased CPU efficiency.

Efficient Cross-Device Query Processing


PhD Symposium@VLDB, , 2012 - Pirk, H.

Abstract: The increasing diversity of hardware within a single system promises large performance gains but also poses a challenge for data management systems. Strategies for the efficient use of hardware with large performance differences are still lacking. For example, existing research on GPU supported data management largely handles the GPU in isolation from the system’s CPU — The GPU is considered the central processor and the CPU used only to mitigate the GPU’s weaknesses where necessary. To make efficient use of all available devices, we developed a processing strategy that lets unequal devices like GPU and CPU combine their strengths rather than work in isolation. To this end, we decompose relational data into individual bits and place the resulting partitions on the appropriate devices. Operations are processed in phases, each phase executed on one device. This way, we achieve significant performance gains and good load distribution among the available devices in a limited real-life use case. To grow this idea into a generic system, we identify challenges as well as potential hardware configurations and applications that can benefit from this approach.

Scalable Generation Of Synthetic GPS Traces With Real-Life Data Characteristics


TPCTC@VLDB, 2012 - Bösche, K. & Sellam, T. H. J. & Pirk, H. & Beier, R. & Mieth, P. & Manegold, S.

Abstract: Database benchmarking is most valuable if real-life data and workloads are available. However, real-life data (and workloads) are often not publicly available due to IPR constraints or privacy concerns. And even if available, they are often limited regarding scalability and variability of data characteristics. On the other hand, while easily scalable, synthetically generated data often fail to adequately reflect real-life data characteristics. While there are well established synthetic benchmarks and data generators for, e.g., business data (TPC-C, TPC-H), there is no such up-to-date data generator, let alone benchmark, for spatiotemporal and/or moving objects data. In this work, we present a data generator for spatiotemporal data. More specifically, our data generator produces synthetic GPS traces, mimicking the GPS traces that GPS navigation devices generate. To this end, our generator is fed with real-life statistical profiles derived from the user base and uses real-world road network information. Spatial scalability is achieved by choosing statistics from different regions. The data volume can be scaled by tuning the number and length of the generates trajectories. We compare the generated data to real-life data to demonstrate how well the synthetically generated data reflects real-life data characteristics.

Instant-On Scientific Data Warehouses: Lazy ETL For Data-Intensive Research


BIRTE@VLDB, 2012 - Kargin, Y. & Pirk, H. & Ivanova, M. G. & Manegold, S. & Kersten, M. L.

Abstract: In the dawning era of data intensive research, scientific discovery deploys data analysis techniques similar to those that drive business intelligence. Similar to classical Extract, Transform and Load (ETL) processes, data is loaded entirely from external data sources (repositories) into a scientific data warehouse before it can be analyzed. This process is both, time and resource intensive and may not be entirely necessary if only a subset of the data is of interest to a particular user. To overcome this problem, we propose a novel technique to lower the costs for data loading: Lazy ETL. Data is extracted and loaded transparently on-the-fly only for the required data items. Extensive experiments demonstrate the significant reduction of the time from source data availability to query answer compared to state-of-the-art solutions. In addition to reducing the costs for bootstrapping a scientific data warehouse, our approach also reduces the costs for loading new incoming data.

X-Device Query Processing By Bitwise Distribution


DaMoN@SIGMOD, ACM SIGMOD Record, 2012 - Pirk, H. & Sellam, T. H. J. & Manegold, S. & Kersten, M. L.

Abstract: The diversity of hardware components within a single system calls for strategies for efficient cross-device data processing. For exam- ple, existing approaches to CPU/GPU co-processing distribute individual relational operators to the “most appropriate” device. While pleasantly simple, this strategy has a number of problems: it may leave the “inappropriate” devices idle while overloading the “appropriate” device and putting a high pressure on the PCI bus. To address these issues we distribute data among the devices by par- tially decomposing relations at the granularity of individual bits. Each of the resulting bit-partitions is stored and processed on one of the available devices. Using this strategy, we implemented a processor for spatial range queries that makes efficient use of all available devices. The performance gains achieved indicate that bitwise distribution makes a good cross-device processing strategy.

Accelerating Foreign-Key Joins Using Asymmetric Memory Channels


ADMS@VLDB, 2011 - Pirk, H. & Manegold, S. & Kersten, M. L.

Abstract: Indexed Foreign-Key Joins expose a very asymmetric access pattern: the Foreign-Key Index is sequentially scanned whilst the Primary-Key table is target of many quasi-random lookups which is the dominant cost factor. To reduce the costs of the random lookups the fact-table can be (re-) partitioned at runtime to increase access locality on the dimension table, and thus limit the random memory access to inside the CPU's cache. However, this is very hard to optimize and the performance impact on recent architectures is limited because the partitioning costs consume most of the achievable join improvement. GPGPUs on the other hand have an architecture that is well suited for this operation: a relatively slow connection to the large system memory and a very fast connection to the smaller internal device memory. We show how to accelerate Foreign-Key Joins by executing the random table lookups on the GPU's VRAM while sequentially streaming the Foreign- Key-Index through the PCI-E Bus. We also experimentally study the memory access costs on GPU and CPU to provide estimations of the benefit of this technique.

Cache Conscious Data Layouting For In-Memory Databases


MSc. Thesis at Humboldt-Universität zu Berlin, 2010 - Pirk, H.

Abstract: Many applications with manually implemented data management exhibit a data storage pattern in which semantically related data items are stored closer in memory than unrelated data items. The strong sematic relationship between these data items commonly induces contemporary accesses to them. This is called the principle of data locality and has been recognized by hardware vendors. It is commonly exploited to improve the performance of hardware. General Purpose Database Management Systems (DBMSs), whose main goal is to simplify optimal data storage and processing, generally fall short of this claim because the usage pattern of the stored data cannot be anticipated when designing the system. The current interest in column oriented databases indicates that one strategy does not fit all applications. A DBMS that automatically adapts it’s storage strategy to the workload of the database promises a significant performance increase by maximizing the benefit of hardware optimizations that are based on the principle of data locality. This thesis gives an overview of optimizations that are based on the principle of data locality and the effect they have on the data access performance of applications. Based on the findings, a model is introduced that allows an estimation of the costs of data accesses based on the arrangement of the data in the main memory. This model is evaluated through a series of experiments and incorporated into an automatic layouting component for a DBMS. This layouting component allows the calculation of an analytically optimal storage layout. The performance benefits brought by this component are evaluated in an application benchmark.

Full list of publications (CWI scientific repository)



While benefiting performance, any optimization technique further complicates the design of data management systems. The fundamental problem is that the design space grows combinatorially with the number of techniques and technologies, making the current development process unsustainable. In addition to this growing complexity, performance engineering is complicated by tools and paradigms that obfuscate the design space and obstruct plan transformations. Recognizing this, we developed Voodoo: a vector-oriented abstraction layer between a data management system frontend (relational, graph-oriented or others) and the hardware. It supports the implemention of different data models while simplifying the application of many tuning techniques and generating highly efficient executable code. It does so by implementing a novel concept called Controlled Folding. The resulting programming model structures the design space such that conceptually close techniques (e.g., different flavors of parallelism) are encoded in similar plans. This makes equivalent plan transformations simple, non-equivalent transformations hard and emphasizes design choices that are often obscured in languages like C++.

Bitwise Decomposition

Co-processors such as GPUs have been recognized as beneficial for data intensive applications because they offer orders of magnitude higher bandwidth and processing capacity than CPUs. However, the von-Neumann architecture combined with the limited capacity of the GPU's memory exposes the PCI-Bus as the major bottleneck holding back the adoption of GPUs for analytical data processing.
Based on the insight that GPUs are always connected to a host system with processing resources of its own, I developed the Bitwise Decomposed Processing Model. In this model, the PCI bottleneck is addressed by distributing data as well as processing among CPU and GPU: the GPU holds an approximation of the database and produces an approximate result that is subsequently refined into an accurate answer on the CPU. This strategy yields a speedup of up to 6 times over CPU-only processing even for datasets larger than the GPU's internal memory. This scheme was, to the best of my knowledge, the first to introduce such co-processing into the domain of data management systems.

Cache and CPU Efficient Analytics

Virtually all general purpose data management systems co-locate the values of a tuple (N-ary storage). However, co-locating the values of a column (decomposed storage) improves performance of many analytical queries by orders of magnitude. For applications that are not purely analytical, hybrid systems improve cache-locality by adapting the storage model to the workload. However, to process tuples rather than single columns, such systems implement an iterator-based design which hurts CPU efficiency yielding systems that perform worse than implementations of N-ary or decomposed storage. However, I found that hybrid storage can be combined with executable code generation to preserve CPU efficiency. To automatically select a storage strategy, I developed a cache-conscious cost model whith much higher accuracy than previously proposed models. The resulting query processor outperformed decomposed storage by up to 4 times and N-ary storage by more than 10 times.


Data analysis using facet attributes

US Patent, 2011 - M. Behnen, R. Cole, Q. Jin, T. Pfahl, H. Pirk

Abstract: A method, computer program product, and system for analyzing data using a data warehouse application are provided. The method, computer program product, and system provide for displaying a facet in a user interface of the data warehouse application, the facet classifying a plurality of documents, and displaying a facet attribute of the facet in the user interface of the data warehouse application, the facet attribute corresponding to a characteristic associated with each of the plurality of documents classified by the facet.

Cube faceted data analysis

US Patent, 2010 - M. Behnen, Q. Jin, T. Pfahl, H. Pirk

Abstract: Methods, systems, and computer readable medium for displaying results of a search query. In one implementation, the method includes receiving a query, obtaining documents that satisfy the query, constructing a facet hierarchy based on documents that satisfy the query, creating a cube structure based on the facet hierarchy, and displaying a multi-dimensional search interface based on the cube structure.