nativeNDP: processing big data analytics on native storage nodes
- Data analytics tasks on large datasets are computationally intensive and often demand the compute power of cluster environments. Yet, data cleansing, preparation, dataset characterization and statistics or metrics computation steps are frequent. These are mostly performed ad hoc, in an explorative manner and mandate low response times. But, such steps are I/O intensive and typically very slow due to low data locality, inadequate interfaces and abstractions along the stack. These typically result in prohibitively expensive scans of the full dataset and transformations on interface boundaries. In this paper, we examine R as analytical tool, managing large persistent datasets in Ceph, a wide-spread cluster file-system. We propose nativeNDP – a framework for Near Data Processing that pushes down primitive R tasks and executes them in-situ, directly within the storage device of a cluster-node. Across a range of data sizes, we show that nativeNDP is more than an order of magnitude faster than other pushdown alternatives.
Author of HS Reutlingen | Vinçon, Tobias; Riegger, Christian; Petrov, Ilia |
---|---|
DOI: | https://doi.org/10.1007/978-3-030-28730-6_9 |
ISBN: | 978-3-030-28730-6 |
Erschienen in: | Advances in databases and information systems : 23rd European Conference, ADBIS 2019, Bled, Slovenia, September 8–11, 2019, proceedings. - (Lecture notes in computer science ; 11695) |
Publisher: | Springer |
Place of publication: | Cham |
Editor: | Tatjana Welzer |
Document Type: | Conference proceeding |
Language: | English |
Publication year: | 2019 |
Tag: | cluster; in-storage processing; native storage; near-data processing |
Page Number: | 12 |
First Page: | 139 |
Last Page: | 150 |
PPN: | Im Katalog der Hochschule Reutlingen ansehen |
DDC classes: | 004 Informatik |
Open access?: | Nein |
Licence (German): | In Copyright - Urheberrechtlich geschützt |