Refine
Document Type
- Conference proceeding (38)
- Journal article (5)
- Book chapter (5)
Has full text
- yes (48)
Is part of the Bibliography
- yes (48)
Institute
- Informatik (48)
Publisher
- Association for Computing Machinery (15)
- Springer (9)
- IEEE (7)
- OpenProceedings (4)
- Gesellschaft für Informatik e.V (3)
- Universität Konstanz (3)
- IARIA (2)
- Association of Computing Machinery (1)
- CIDR (1)
- SciTePress (1)
Near-data processing in database systems on native computational storage under HTAP workloads
(2022)
Today’s Hybrid Transactional and Analytical Processing (HTAP) systems, tackle the ever-growing data in combination with a mixture of transactional and analytical workloads. While optimizing for aspects such as data freshness and performance isolation, they build on the traditional data-to-code principle and may trigger massive cold data transfers that impair the overall performance and scalability. Firstly, in this paper we show that Near-Data Processing (NDP) naturally fits in the HTAP design space. Secondly, we propose an NDP database architecture, allowing transactionally consistent in-situ executions of analytical operations in HTAP settings. We evaluate the proposed architecture in state-of-the-art key/value-stores and multi-versioned DBMS. In contrast to traditional setups, our approach yields robust, resource- and cost-effcient performance.
Rapidly growing data volumes push today's analytical systems close to the feasible processing limit. Massive parallelism is one possible solution to reduce the computational time of analytical algorithms. However, data transfer becomes a significant bottleneck since it blocks system resources moving data-to-code. Technological advances allow to economically place compute units close to storage and perform data processing operations close to data, minimizing data transfers and increasing scalability. Hence the principle of Near Data Processing (NDP) and the shift towards code-to-data. In the present paper we claim that the development of NDP-system architectures becomes an inevitable task in the future. Analytical DBMS like HPE Vertica have multiple points of impact with major advantages which are presented within this paper.
Data analytics tasks on large datasets are computationally intensive and often demand the compute power of cluster environments. Yet, data cleansing, preparation, dataset characterization and statistics or metrics computation steps are frequent. These are mostly performed ad hoc, in an explorative manner and mandate low response times. But, such steps are I/O intensive and typically very slow due to low data locality, inadequate interfaces and abstractions along the stack. These typically result in prohibitively expensive scans of the full dataset and transformations on interface boundaries.
In this paper, we examine R as analytical tool, managing large persistent datasets in Ceph, a wide-spread cluster file-system. We propose nativeNDP – a framework for Near Data Processing that pushes down primitive R tasks and executes them in-situ, directly within the storage device of a cluster-node. Across a range of data sizes, we show that nativeNDP is more than an order of magnitude faster than other pushdown alternatives.
In the present tutorial we perform a cross-cut analysis of database storage management from the perspective of modern storage technologies. We argue that neither the design of modern DBMS, nor the architecture of modern storage technologies are aligned with each other. Moreover, the majority of the systems rely on a complex multi-layer and compatibility oriented storage stack. The result is needlessly suboptimal DBMS performance, inefficient utilization, or significant write amplification due to outdated abstractions and interfaces. In the present tutorial we focus on the concept of native storage, which is storage operated without intermediate abstraction layers over an open native storage interface and is directly controlled by the DBMS.
Modern mixed (HTAP)workloads execute fast update-transactions and long running analytical queries on the same dataset and system. In multi-version (MVCC) systems, such workloads result in many short-lived versions and long version-chains as well as in increased and frequent maintenance overhead.
Consequently, the index pressure increases significantly. Firstly, the frequent modifications cause frequent creation of new versions, yielding a surge in index maintenance overhead. Secondly and more importantly, index-scans incur extra I/O overhead to determine, which of the resulting tuple versions are visible to the executing transaction (visibility-check) as current designs only store version/timestamp information in the base table – not in the index. Such index-only visibility-check is critical for HTAP workloads on large datasets.
In this paper we propose the Multi Version Partitioned B-Tree (MV-PBT) as a version-aware index structure, supporting index-only visibility checks and flash-friendly I/O patterns. The experimental evaluation indicates a 2x improvement for analytical queries and 15% higher transactional throughput under HTAP workloads. MV-PBT offers 40% higher tx. throughput compared to WiredTiger’s LSM-Tree implementation under YCSB.
An index in a Multi-Version DBMS (MV-DBMS) has to reflect different tuple versions of a single data item. Existing approaches follow the paradigm of logically separating the tuple version data from the data item, e.g. an index is only allowed to return at most one version of a single data item (while it may return multiple data items that match a search criteria). Hence to determine the valid (and therefore visible) tuple version of a data item, the MV-DBMS first fetches all tuple versions that match the search criteria and subsequently filters visible versions using visibility checks. This involves I/O storage accesses to tuple versions that do not have to be fetched. In this vision paper we present the Multi Version Index (MV-IDX) approach that allows index-only visibility checks which significantly reduce the amount of I/O storage accesses as well as the index maintenance overhead. The MV-IDX achieves significantly lower response times and higher transactional throughput on OLTP workloads.
Characteristics of modern computing and storage technologies fundamentally differ from traditional hardware. There is a need to optimally leverage their performance, endurance and energy consumption characteristics. Therefore, existing architectures and algorithms in modern high performance database management systems have to be redesigned and advanced. Multi Version Concurrency Control (MVCC) approaches in data-base management systems maintain multiple physically independent tuple versions. Snapshot isolation approaches enable high parallelism and concurrency in workloads with almost serializable consistency level. Modern hardware technologies benefit from multi-version approaches. Indexing multi-version data on modern hardware is still an open research area. In this paper, we provide a survey of popular multi-version indexing approaches and an extended scope of high performance single-version approaches. An optimal multi-version index structure brings look-up efficiency of tuple versions, which are visible to transactions, and effort on index maintenance in balance for different workloads on modern hardware technologies.
Many future Services Oriented Architecture (SOA) systems may be pervasive SmartLife applications that provide real-time support for users in everyday tasks and situations. Development of such applications will be challenging, but in this position paper we argue that their ongoing maintenance may be even more so. Ontological modelling of the application may help to ease this burden, but maintainers need to understand a system at many levels, from a broad architectural perspective down to the internals of deployed components. Thus we will need consistent models that span the range of views, from business processes through system architecture to maintainable code. We provide an initial example of such a modelling approach and illustrate its application in a semantic browser to aid in software maintenance tasks.
Nowadays almost every major company has a monitoring system and produces log data to analyse their systems. To perform analysation on the log data and to extract experience for future decisions it is important to transform and synchronize different time series. For synchronizing multiple time series several methods are provided so that they are leading to a synchronized uniform time series. This is achieved by using discretisation and approximation methodics. Furthermore the discretisation through ticks is demonstrated, as well as the respectivly illustrated results.
We introduce IPA-IDX – an approach to handle index modifications modern storage technologies (NVM, Flash) as physical in-place appends, using simplified physiological log records. IPA-IDX provides similar performance and longevity advantages for indexes as basic IPA [5] does for tables. The selective application of IPA-IDX and basic IPA to certain regions and objects, lowers the GC overhead by over 60%, while keeping the total space overhead to 2%. The combined effect of IPA and IPA-IDX increases performance by 28%.
Introduction to the special issue on self‑managing and hardware‑optimized database systems 2022
(2023)
Data management systems have evolved in terms of functionality, performance characteristics, complexity, and variety during the last 40 years. Particularly, the relational database management systems and the big data systems (e.g., Key-Value stores, Document stores, Graph stores and Graph Computation Systems, Spark, MapReduce/Hadoop, or Data Stream Processing Systems) have evolved with novel additions and extensions. However, the systems administration and tasks have become highly complex and expensive, especially given the simultaneous and rapid hardware evolution in processors, memory, storage, or networking. These developments present new open problems and challenges to data management systems as well as new opportunities.
The SMDB (International Workshop on Self-Managing Database Systems) and HardBD&Active (Joint International Workshop on Big Data Management on Emerging Hardware and Data Management on Virtualized Active Systems) workshops organized in conjunction with the IEEE ICDE (International Conference on Data Engineering) offered two distinct platforms for examining the above system-related challenges from different perspectives. The SMDB workshop looks into developing autonomic or self-* features in database and data management systems to tackle complex administrative tasks, while the HardBD&Active workshop focuses on harnessing hardware technologies to enhance efficiency and performance of data processing and management tasks. As a result of these workshops, we are delighted to present the third special issue of DAPD titled “Self-Managing and Hardware-Optimized Database Systems 2022,” which showcases the best contributions from the SMDB 2021/2022 and HardBD&Active 2021/2022 workshops.
Database Management Systems (DBMS) need to handle large updatable datasets in on-line transaction processing (OLTP) workloads. Most modern DBMS provide snapshots of data in multi-version concurrency control (MVCC) transaction management scheme. Each transaction operates on a snapshot of the database, which is calculated from a set of tuple versions. High parallelism and resource-efficient append-only data placement on secondary storage is enabled. One major issue in indexing tuple versions on modern hardware technologies is the high write amplification for tree-indexes.
Partitioned B-Trees (PBT) [5] is based on the structure of the ubiquitous B+ Tree [8]. They achieve a near optimal write amplification and beneficial sequential writes on secondary storage. Yet they have not been implemented in a MVCC enabled DBMS to date.
In this paper we present the implementation of PBTs in PostgreSQL extended with SIAS. Compared to PostgreSQL’s B+–Trees PBTs have 50% better transaction throughput under TPC-C and a 30% improvement to standard PostgreSQL with Heap-Only Tuples.
In the present paper we demonstrate a novel approach to handling small updates on Flash called In-Place Appends (IPA). It allows the DBMS to revisit the traditional write behavior on Flash. Instead of writing whole database pages upon an update in an out-of-place manner on Flash, we transform those small updates into update deltas and append them to a reserved area on the very same physical Flash page. In doing so we utilize the commonly ignored fact that under certain conditions Flash memories can support in-place updates to Flash pages without a preceding erase operation.
The approach was implemented under Shore-MT and evaluated on real hardware. Under standard update-intensive workloads we observed 67% less page invalidations resulting in 80% lower garbage collection overhead, which yields a 45% increase in transactional throughput, while doubling Flash longevity at the same time. The IPA outperforms In-Page Logging (IPL) by more than 50%.
We showcase a Shore-MT based prototype of the above approach, operating on real Flash hardware – the OpenSSD Flash research platform. During the demonstration we allow the users to interact with the system and gain hands on experience of its performance under different demonstration scenarios. These involve various workloads such as TPC-B, TPC-C or TATP.
A transaction is a demarcated sequence of application operations, for which the following properties are guaranteed by the underlying transaction processing system (TPS): atomicity, consistency, isolation, and durability (ACID). Transactions are therefore a general abstraction, provided by TPS that simplifies application development by relieving transactional applications from the burden of concurrency and failure handling. Apart from the ACID properties, a TPS must guarantee high and robust performance (high transactional throughput and low response times), high reliability (no data loss, ability to recover last consistent state, fault tolerance), and high availability (infrequent outages, short recovery times).
The architectures and workhorse algorithms of a high-performance TPS are built around the properties of the underlying hardware. The introduction of nonvolatile memories (NVM) as novel storage technology opens an entire new problem space, with the need to revise aspects such as the virtual memory hierarchy, storage management and data placement, access paths, and indexing. NVM are also referred to as storage-class memory (SCM).
Under update intensive workloads (TPC, LinkBench) small updates dominate the write behavior, e.g. 70% of all updates change less than 10 bytes across all TPC OLTP workloads. These are typically performed as in-place updates and result in random writes in page-granularity, causing major write-overhead on Flash storage, a write amplification of several hundred times and lower device longevity.
In this paper we propose an approach that transforms those small in-place updates into small update deltas that are appended to the original page. We utilize the commonly ignored fact that modern Flash memories (SLC, MLC, 3D NAND) can handle appends to already programmed physical pages by using various low-level techniques such as ISPP to avoid expensive erases and page migrations. Furthermore, we extend the traditional NSM page-layout with a delta-record area that can absorb those small updates. We propose a scheme to control the write behavior as well as the space allocation and sizing of database pages.
The proposed approach has been implemented under Shore- MT and evaluated on real Flash hardware (OpenSSD) and a Flash emulator. Compared to In-Page Logging it performs up to 62% less reads and writes and up to 74% less erases on a range of workloads. The experimental evaluation indicates: (i) significant reduction of erase operations resulting in twice the longevity of Flash devices under update-intensive workloads; (ii) 15%-60% lower read/write I/O latencies; (iii) up to 45% higher transactional throughput; (iv) 2x to 3x reduction in overall write
amplification.
Near-Data Processing is a promising approach to overcome the limitations of slow I/O interfaces in the quest to analyze the ever-growing amount of data stored in database systems. Next to CPUs, FPGAs will play an important role for the realization of functional units operating close to data stored in non-volatile memories such as Flash.It is essential that the NDP-device understands formats and layouts of the persistent data, to perform operations in-situ. To this end, carefully optimized format parsers and layout accessors are needed. However, designing such FPGA-based Near-Data Processing accelerators requires significant effort and expertise. To make FPGA-based Near-Data Processing accessible to non-FPGA experts, we will present a framework for the automatic generation of FPGA-based accelerators capable of data filtering and transformation for key-value stores based on simple data-format specifications.The evaluation shows that our framework is able to generate accelerators that are almost identical in performance compared to the manually optimized designs of prior work, while requiring little to no FPGA-specific knowledge and additionally providing improved flexibility and more powerful functionality.
Blockchains yield to new workloads in database management systems and K/V-stores. Distributed Ledger Technology (DLT) is a technique for managing transactions in ’trustless’ distributed systems. Yet, clients of nodes in blockchain networks are backed by ’trustworthy’ K/V-Stores, like LevelDB or RocksDB in Ethereum, which are based on Log-Structured Merge Trees (LSM Trees). However, LSM-Trees do not fully match the properties of blockchains and enterprise workloads.
In this paper, we claim that Partitioned B-Trees (PBT) fit the properties of this DLT: uniformly distributed hash keys, immutability, consensus, invalid blocks, unspent and off-chain transactions, reorganization and data state / version ordering in a distributed log-structure. PBT can locate records of newly inserted key-value pairs, as well as data of unspent transactions, in separate partitions in main memory. Once several blocks acquire consensus, PBTs evict a whole partition, which becomes immutable, to secondary storage. This behavior minimizes write amplification and enables a beneficial sequential write pattern on modern hardware. Furthermore, DLT implicate some type of log-based versioning. PBTs can serve as MV-store for data storage of logical blocks and indexing in multi-version concurrency control (MVCC) transaction processing.
In this paper we build on our research in data management on native Flash storage. In particular we demonstrate the advantages of intelligent data placement strategies. To effectively manage phsical Flash space and organize the data on it, we utilize novel storage structures such as regions and groups. These are coupled to common DBMS logical structures, thus require no extra overhead for the DBA. The experimental results indicate an improvement of up to 2x, which doubles the longevity of Flash SSD. During the demonstration the audience can experience the advantages of the proposed approach on real Flash hardware.
In the present tutorial we perform a cross-cut analysis of database systems from the perspective of modern storage technology, namely Flash memory. We argue that neither the design of modern DBMS, nor the architecture of flash storage technologies are aligned with each other. The result is needlessly suboptimal DBMS performance and inefficient flash utilisation as well as low flash storage endurance and reliability. We showcase new DBMS approaches with improved algorithms and leaner architectures, designed to leverage the properties of modern storage technologies. We cover the area of transaction management and multi-versioning, putting a special emphasis on: (i) version organisation models and invalidation mechanisms in multi-versioning DBMS; (ii) Flash storage management especially on append-based storage in tuple granularity; (iii) Flash-friendly buffer management; as well as (iv) improvements in the searching and indexing models. Furthermore, we present our NoFTL approach to native Flash access that integrates parts of the flash-management functionality into the DBMS yielding significant performance increase and simplification of the I/O stack. In addition, we cover the basics of building large Flash storage for DBMS and revisit some of the RAID techniques and principles.
The use of Wireless Sensor and Actuator Networks (WSAN) as an enabling technology for Cyber-Physical Systems has increased significantly in recent past. The challenges that arise in different application areas of Cyber- Physical Systems, in general, and in WSAN in particular, are getting the attention of academia and industry both. Since reliability issues for message delivery in wireless communication are of critical importance for certain safety related applications, it is one of the areas that has received significant focus in the research community. Additionally, the diverse needs of different applications put different demands on the lower layers in the protocol stack, thus necessitating such mechanisms in place in the lower layers which enable them to dynamically adapt. Another major issue in the realization of networked wirelessly communicating cyber-physical systems, in general, and WSAN, in particular, is the lack of approaches that tackle the reliability, configurability and application awareness issues together. One could consider tackling these issues in isolation. However, the interplay between these issues create such challenges that make the application developers spend more time on meeting these challenges, and that too not in very optimal ways, than spending their time on solving the problems related to the application being developed. Starting from some fundamental concepts, general issues and problems in cyber-physical systems, this chapter discusses such issues like energy-efficiency, application and channel-awareness for networked wirelessly communicating cyber-physical systems. Additionally, the chapter describes a middleware approach called CEACH, which is an acronym for Configurable, Energy-efficient, Application- and Channel-aware Clustering based middleware service for cyber-physical systems. The state of-the art in the area of cyberphysical systems with a special focus on communication reliability, configurability, application- and channel-awareness is described in the chapter. The chapter also describes how these features have been considered in the CEACH approach. Important node level and network level characteristics and their significance vis-àvis the design of applications for cyber physical systems is also discussed. The issue of adaptively controlling the impact of these factors vis-à-vis the application demands and network conditions is also discussed. The chapter also includes a description of Fuzzy-CEACH which is an extension of CEACH middleware service and which uses fuzzy logic principles. The fuzzy descriptors used in different stages of Fuzzy-CEACH have also been described. The fuzzy inference engine used in the Fuzzy-CEACH cluster head election process is described in detail. The Rule-Bases used by fuzzy inference engine in different stages of Fuzzy-CEACH is also included to show an insightful description of the protocol. The chapter also discusses in detail the experimental results validating the authenticity of the presented concepts in the CEACH approach. The applicability of the CEACH middleware service in different application scenarios in the domain of cyberphysical systems is also discussed. The chapter concludes by shedding light on the Publish-Subscribe mechanisms in distributed event-based systems and showing how they can make use of the CEACH middleware to reliably communicate detected events to the event-consumers or the actuators if the WSAN is modeled as a distributed event-based system.