Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content
10.1145/3183713.3196927acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections

Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging

Authors: Niv Dayan, Stratos IdreosAuthors Info & Claims
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
Pages 505 - 520
Published: 27 May 2018 Publication History

Abstract

In this paper, we show that all mainstream LSM-tree based key-value stores in the literature and in industry are suboptimal with respect to how they trade off among the I/O costs of updates, point lookups, range lookups, as well as the cost of storage, measured as space-amplification. The reason is that they perform expensive merge operations in order to (1) bound the number of runs that a lookup has to probe, and to (2) remove obsolete entries to reclaim space. However, most of these merge operations reduce point lookup cost, long range lookup cost, and space-amplification by a negligible amount.
To address this problem, we expand the LSM-tree design space with Lazy Leveling, a new design that prohibits merge operations at all levels of LSM-tree but the largest. We show that Lazy Leveling improves the worst-case cost complexity of updates while maintaining the same bounds on point lookup cost, long range lookup cost, and space-amplification. To be able to navigate between Lazy Leveling and other designs, we make the LSM-tree design space fluid by introducing Fluid LSM-tree, a generalization of LSM-tree that can be parameterized to assume all existing LSM-tree designs. We show how to fluidly transition from Lazy Leveling to (1) designs that are more optimized for updates by merging less at the largest level, and (2) designs that are more optimized for small range lookups by merging more at all other levels.
We put everything together to design Dostoevsky, a key-value store that navigates the entire Fluid LSM-tree design space based on the application workload and hardware to maximize throughput using a novel closed-form performance model. We implemented Dostoevsky on top of RocksDB, and we show that it strictly dominates state-of-the-art LSM-tree based key-value stores in terms of performance and space-amplification.

References

[1]
A. Aggarwal and J. S. Vitter. The Input/Output Complexity of Sorting and Related Problems. Communications of the ACM, 31(9):1116--1127, 1988.
[2]
D. Agrawal, D. Ganesan, R. Sitaraman, Y. Diao, and S. Singh. Lazy-Adaptive Tree: An Optimized Index Structure for Flash Devices. Proceedings of the VLDB Endowment, 2(1):361--372, 2009.
[3]
M. Y. Ahmad and B. Kemme. Compaction management in distributed key-value datastores. Proceedings of the VLDB Endowment, 8(8):850--861, 2015.
[4]
D. G. Andersen, J. Franklin, M. Kaminsky, A. Phanishayee, L. Tan, and V. Vasudevan. FAWN: A Fast Array of Wimpy Nodes. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), pages 1--14, 2009.
[5]
Apache. Accumulo. https://accumulo.apache.org/, 2016.
[6]
Apache. Cassandra. http://cassandra.apache.org, 2016.
[7]
Apache. HBase. http://hbase.apache.org/, 2016.
[8]
T. G. Armstrong, V. Ponnekanti, D. Borthakur, and M. Callaghan. LinkBench: a Database Benchmark Based on the Facebook Social Graph. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 1185--1196, 2013.
[9]
M. Athanassoulis, S. Chen, A. Ailamaki, P. B. Gibbons, and R. Stoica. MaSM: Efficient Online Updates in Data Warehouses. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 865--876, 2011.
[10]
M. Athanassoulis, S. Chen, A. Ailamaki, P. B. Gibbons, and R. Stoica. Online Updates on Data Warehouses via Judicious Use of Solid-State Storage. ACM Transactions on Database Systems (TODS), 40(1), 2015.
[11]
A. Badam, K. Park, V. S. Pai, and L. L. Peterson. HashCache: Cache Storage for the Next Billion. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 123--136, 2009.
[12]
O. Balmau, D. Didona, R. Guerraoui, W. Zwaenepoel, H. Yuan, A. Arora, K. Gupta, and P. Konka. TRIAD: Creating synergies between memory, disk and log in log structured key-value stores. In USENIX Annual Technical Conference, 2017.
[13]
M. A. Bender, M. Farach-Colton, J. T. Fineman, Y. R. Fogel, B. C. Kuszmaul, and J. Nelson. Cache-Oblivious Streaming B-trees. In Proceedings of the Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 81--92, 2007.
[14]
B. H. Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM, 13(7):422--426, 1970.
[15]
G. S. Brodal and R. Fagerberg. Lower Bounds for External Memory Dictionaries. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 546--554, 2003.
[16]
N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, H. C. Li, M. Marchukov, D. Petrov, L. Puzar, Y. J. Song, and V. Venkataramani. TAO: Facebook's Distributed Data Store for the Social Graph. In Proceedings of the USENIX Annual Technical Conference (ATC), pages 49--60, 2013.
[17]
Y. Bu, V. Borkar, J. Jia, M. J. Carey, and Condie. Pregelix: Big (ger) graph analytics on a dataflow engine. In Proceedings of the International Conference on Very Large Data Bases (VLDB), pages 161--172, 2014.
[18]
Z. Cao, S. Chen, F. Li, M. Wang, and X. S. Wang. LogKV: Exploiting Key-Value Stores for Log Processing. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR), 2013.
[19]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A Distributed Storage System for Structured Data. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 205--218, 2006.
[20]
J. Chen, C. Douglas, M. Mutsuzaki, P. Quaid, R. Ramakrishnan, S. Rao, and R. Sears. Walnut: A Unified Cloud Object Store. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 743--754, 2012.
[21]
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the ACM Symposium on Cloud Computing (SoCC), pages 143--154, New York, NY, USA, 2010. ACM.
[22]
N. Dayan, M. Athanassoulis, and S. Idreos. Monkey: Optimal Navigable Key-Value Store. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 79--94, 2017.
[23]
N. Dayan, P. Bonnet, and S. Idreos. GeckoFTL: Scalable Flash Translation Techniques For Very Large Flash Devices. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 327--342, 2016.
[24]
B. Debnath, S. Sengupta, and J. Li. FlashStore: high throughput persistent key-value store. Proceedings of the VLDB Endowment, 3(1--2):1414--1425, 2010.
[25]
B. Debnath, S. Sengupta, and J. Li. SkimpyStash: RAM space skimpy key-value store on flash-based storage. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 25--36, 2011.
[26]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's Highly Available Key-value Store. ACM SIGOPS Operating Systems Review, 41(6):205--220, 2007.
[27]
S. Dong, M. Callaghan, L. Galanis, D. Borthakur, T. Savor, and M. Strum. Optimizing Space Amplification in RocksDB. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR), 2017.
[28]
Facebook. RocksDB. https://github.com/facebook/rocksdb, 2016.
[29]
Facebook. MyRocks. http://myrocks.io/, 2017.
[30]
B. Fitzpatrick and A. Vorobey. Memcached: a distributed memory object caching system, 2011.
[31]
G. Golan-Gueta, E. Bortnikov, E. Hillel, and I. Keidar. Scaling Concurrent Log-Structured Data Stores. In Proceedings of the ACM European Conference on Computer Systems (EuroSys), pages 32:1---32:14, 2015.
[32]
Google. LevelDB. https://github.com/google/leveldb/, 2016.
[33]
H. V. Jagadish, P. P. S. Narayan, S. Seshadri, S. Sudarshan, and R. Kanneganti. Incremental Organization for Data Recording and Warehousing. In Proceedings of the International Conference on Very Large Data Bases (VLDB), pages 16--25, 1997.
[34]
A. Lakshman and P. Malik. Cassandra - A Decentralized Structured Storage System. ACM SIGOPS Operating Systems Review, 44(2):35--40, 2010.
[35]
Y. Li, B. He, J. Yang, Q. Luo, K. Yi, and R. J. Yang. Tree Indexing on Solid State Drives. Proceedings of the VLDB Endowment, 3(1--2):1195--1206, 2010.
[36]
H. Lim, D. G. Andersen, and M. Kaminsky. Towards Accurate and Fast Evaluation of Multi-Stage Log-structured Designs. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST), pages 149--166, 2016.
[37]
H. Lim, B. Fan, D. G. Andersen, and M. Kaminsky. SILT: A Memory-Efficient, High-Performance Key-Value Store. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), pages 1--13, 2011.
[38]
LinkedIn. Online reference. http://www.project-voldemort.com, 2016.
[39]
L. Lu, T. S. Pillai, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. WiscKey: Separating Keys from Values in SSD-conscious Storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST), pages 133--148, 2016.
[40]
S. Nath and A. Kansal. FlashDB: dynamic self-tuning database for NAND flash. ACM/IEEE IPSN, 2007.
[41]
P. E. O'Neil, E. Cheng, D. Gawlick, and E. J. O'Neil. The log-structured merge-tree (LSM-tree). Acta Informatica, 33(4):351--385, 1996.
[42]
A. Papagiannis, G. Saloustros, P. González-Férez, and A. Bilas. Tucana: Design and Implementation of a Fast and Efficient Scale-up Key-value Store. In Proceedings of the USENIX Annual Technical Conference (ATC), pages 537--550, 2016.
[43]
M. Pilman, K. Bocksrocker, L. Braun, R. Marroquin, and D. Kossmann. Fast Scans on Key-Value Stores. Proceedings of the VLDB Endowment, 10(11):1526--1537, 2017.
[44]
J. Pitman. Probability. 1999.
[45]
P. Raju, R. Kadekodi, V. Chidambaram, and I. Abraham. Pebblesdb: Building key-value stores using fragmented log-structured merge trees. Proceedings of the ACM Symposium on Operating Systems Principles, 2017.
[46]
Redis. Online reference. http://redis.io/.
[47]
K. Ren, Q. Zheng, J. Arulraj, and G. Gibson. SlimDB: A Space-Efficient Key-Value Storage Engine For Semi-Sorted Data. Proceedings of the VLDB Endowment, 10(13):2037--2048, 2017.
[48]
R. Sears and R. Ramakrishnan. bLSM: A General Purpose Log Structured Merge Tree. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 217--228, 2012.
[49]
P. Shetty, R. P. Spillane, R. Malpani, B. Andrews, J. Seyster, and E. Zadok. Building Workload-Independent Storage with VT-trees. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST), pages 17--30, 2013.
[50]
S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz. Theory and Practice of Bloom Filters for Distributed Systems. IEEE Communications Serveys & Tutorials, 14(1):131--155, 2012.
[51]
R. Thonangi and J. Yang. On Log-Structured Merge for Solid-State Drives. In Proceedings of the IEEE International Conference on Data Engineering (ICDE), pages 683--694, 2017.
[52]
WiredTiger. WiredTiger. https://github.com/wiredtiger/wiredtiger, 2016.
[53]
X. Wu, Y. Xu, Z. Shao, and S. Jiang. LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data Items. In Proceedings of the USENIX Annual Technical Conference (ATC), pages 71--82, 2015.

Cited By

View all
  • (2024)COLEProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650717(329-346)Online publication date: 27-Feb-2024
  • (2024)FluidKV: Seamlessly Bridging the Gap between Indexing Performance and Memory-Footprint on Ultra-Fast StorageProceedings of the VLDB Endowment10.14778/3648160.364817717:6(1377-1390)Online publication date: 1-Feb-2024
  • (2024)SIndex: An SSD-based Large-scale Indexing with Deterministic Latency for Cloud Block StorageProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673041(1237-1246)Online publication date: 12-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
May 2018
1874 pages
ISBN:9781450347037
DOI:10.1145/3183713
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.
Request Permissions

Check for updates

Author Tags

  1. bloom filters
  2. compaction
  3. lsm-tree

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '18
Sponsor:

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)381
  • Downloads (Last 6 weeks)36
Reflects downloads up to 24 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)COLEProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650717(329-346)Online publication date: 27-Feb-2024
  • (2024)FluidKV: Seamlessly Bridging the Gap between Indexing Performance and Memory-Footprint on Ultra-Fast StorageProceedings of the VLDB Endowment10.14778/3648160.364817717:6(1377-1390)Online publication date: 1-Feb-2024
  • (2024)SIndex: An SSD-based Large-scale Indexing with Deterministic Latency for Cloud Block StorageProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673041(1237-1246)Online publication date: 12-Aug-2024
  • (2024)Advocating for Key-Value Stores with Workload Pattern Aware Dynamic CompactionProceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665955(124-130)Online publication date: 8-Jul-2024
  • (2024)Can Modern LLMs Tune and Configure LSM-based Key-Value Stores?Proceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems10.1145/3655038.3665954(116-123)Online publication date: 8-Jul-2024
  • (2024)Structural Designs Meet Optimality: Exploring Optimized LSM-tree Structures in a Colossal Configuration SpaceProceedings of the ACM on Management of Data10.1145/36549782:3(1-26)Online publication date: 30-May-2024
  • (2024)Optimizing Time Series Queries with VersionsProceedings of the ACM on Management of Data10.1145/36549622:3(1-27)Online publication date: 30-May-2024
  • (2024)GRF: A Global Range Filter for LSM-Trees with Shape EncodingProceedings of the ACM on Management of Data10.1145/36549442:3(1-27)Online publication date: 30-May-2024
  • (2024)CaaS-LSM: Compaction-as-a-Service for LSM-based Key-Value Stores in Storage Disaggregated InfrastructureProceedings of the ACM on Management of Data10.1145/36549272:3(1-28)Online publication date: 30-May-2024
  • (2024)A Contract-aware and Cost-effective LSM Store for Cloud Storage with Low Latency SpikesACM Transactions on Storage10.1145/364385120:2(1-27)Online publication date: 20-Feb-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media

View Table of Contents