EFFICIENT DATA REDUCTION IN HPC AND DISTRIBUTED STORAGE SYSTEMS PDF Download

Are you looking for read ebook online? Search for your book and save it on your Kindle device, PC, phones or tablets. Download EFFICIENT DATA REDUCTION IN HPC AND DISTRIBUTED STORAGE SYSTEMS PDF full book. Access full book title EFFICIENT DATA REDUCTION IN HPC AND DISTRIBUTED STORAGE SYSTEMS by Tong Liu. Download full books in PDF and EPUB format.

EFFICIENT DATA REDUCTION IN HPC AND DISTRIBUTED STORAGE SYSTEMS

EFFICIENT DATA REDUCTION IN HPC AND DISTRIBUTED STORAGE SYSTEMS PDF Author: Tong Liu
Publisher:
ISBN:
Category :
Languages : en
Pages : 138

Book Description
In modern distributed storage systems, space efficiency and system reliability are two major concerns. As a result, contemporary storage systems often employ data deduplication and erasure coding to reduce the storage overhead and provide fault tolerance, respectively. However, little work has been done to explore the relationship between these two techniques.Scientific simulations on high-performance computing (HPC) systems can generate large amounts of floating-point data per run. To mitigate the data storage bottleneck and lower the data volume, it is common for floating-point compressors to be employed. As compared to lossless compressors, lossy compressors, such as SZ and ZFP, can reduce data volume more aggressively while maintaining the usefulness of the data. However, a reduction ratio of more than two orders of magnitude is almost impossible without seriously distorting the data. In deep learning, the autoencoder technique has shown great potential for data compression, in particular with images. Whether the autoencoder can deliver similar performance on scientific data, however, is unknown. Nowadays, modern industry data centers have employed erasure codes to provide reliability for large amounts of data at a low cost. Although erasure codes provide optimal storage efficiency, they suffer from high repair costs compared to traditional three-way replication: when a data miss occurs in a data center, erasure codes would require high disk usage and network bandwidth consumption across nodes and racks to repair the failed data. This dissertation lists our research results on the above three mentioned challenges in order to either optimize or solve the issues for the HPC and distributed storage systems. Details are as follows: To solve the data storage challenge for the erasure-coded deduplication system, we propose Reference-counter Aware Deduplication (RAD), which employs the features of deduplication into erasure coding to improve garbage collection performance when deletion occurs. RAD wisely encodes the data according to the reference counter, which is provided by the deduplication level and thus reduces the encoding overhead when garbage collection is conducted. Further, since the reference counter also represents the reliability levels of the data chunks, we additionally made some effort to explore the trade-offs between storage overhead and reliability level among different erasure codes. The experiment results show that RAD can effectively improve the GC performance by up to 24.8% and the reliability analysis shows that, with certain data features, RAD can provide both better reliability and better storage efficiency compared to the traditional Round-Robin placement. To solve the data processing challenge for HPC system, we for the first time conduct a comprehensive study on the use of autoencoders to compress real-world scientific data and illustrate several key findings on using autoencoders for scientific data reduction. We implement an autoencoder-based prototype with conventional wisdom to reduce floating-point data. Our study shows that the out-of-the-box implementation needs to be further tuned in order to achieve high compression ratios and satisfactory error bounds. Our evaluation results show that, for most of the test datasets, the autoencoder outperforms SZ and ZFP by 2 to 4X in compression ratios. Our practices and lessons learned can direct future optimizations for using autoencoders to compress scientific data. To solve the data transfer challenge for the distributed storage systems,we propose RPR, a rack-aware pipeline repair scheme for erasure-coded distributed storage systems. RPR for the first time investigates the insights of the racks, and explores the connection between the node level and rack level to help improve the repair performance when a single failure or multiple failures occur in a data center. The evaluation results on several common RS code configurations show that, for single-block failures, our RPR scheme reduces the total repair time by up to 81.5% compared to the traditional RS code repair method and 50.2% compared to the state-of-the-art CAR algorithm. For multi-block failures, RPR reduces the total repair time and cross-rack data transfer traffic by up to 64.5% and 50%, respectively, over the traditional repair.

EFFICIENT DATA REDUCTION IN HPC AND DISTRIBUTED STORAGE SYSTEMS

EFFICIENT DATA REDUCTION IN HPC AND DISTRIBUTED STORAGE SYSTEMS PDF Author: Tong Liu
Publisher:
ISBN:
Category :
Languages : en
Pages : 138

Book Description
In modern distributed storage systems, space efficiency and system reliability are two major concerns. As a result, contemporary storage systems often employ data deduplication and erasure coding to reduce the storage overhead and provide fault tolerance, respectively. However, little work has been done to explore the relationship between these two techniques.Scientific simulations on high-performance computing (HPC) systems can generate large amounts of floating-point data per run. To mitigate the data storage bottleneck and lower the data volume, it is common for floating-point compressors to be employed. As compared to lossless compressors, lossy compressors, such as SZ and ZFP, can reduce data volume more aggressively while maintaining the usefulness of the data. However, a reduction ratio of more than two orders of magnitude is almost impossible without seriously distorting the data. In deep learning, the autoencoder technique has shown great potential for data compression, in particular with images. Whether the autoencoder can deliver similar performance on scientific data, however, is unknown. Nowadays, modern industry data centers have employed erasure codes to provide reliability for large amounts of data at a low cost. Although erasure codes provide optimal storage efficiency, they suffer from high repair costs compared to traditional three-way replication: when a data miss occurs in a data center, erasure codes would require high disk usage and network bandwidth consumption across nodes and racks to repair the failed data. This dissertation lists our research results on the above three mentioned challenges in order to either optimize or solve the issues for the HPC and distributed storage systems. Details are as follows: To solve the data storage challenge for the erasure-coded deduplication system, we propose Reference-counter Aware Deduplication (RAD), which employs the features of deduplication into erasure coding to improve garbage collection performance when deletion occurs. RAD wisely encodes the data according to the reference counter, which is provided by the deduplication level and thus reduces the encoding overhead when garbage collection is conducted. Further, since the reference counter also represents the reliability levels of the data chunks, we additionally made some effort to explore the trade-offs between storage overhead and reliability level among different erasure codes. The experiment results show that RAD can effectively improve the GC performance by up to 24.8% and the reliability analysis shows that, with certain data features, RAD can provide both better reliability and better storage efficiency compared to the traditional Round-Robin placement. To solve the data processing challenge for HPC system, we for the first time conduct a comprehensive study on the use of autoencoders to compress real-world scientific data and illustrate several key findings on using autoencoders for scientific data reduction. We implement an autoencoder-based prototype with conventional wisdom to reduce floating-point data. Our study shows that the out-of-the-box implementation needs to be further tuned in order to achieve high compression ratios and satisfactory error bounds. Our evaluation results show that, for most of the test datasets, the autoencoder outperforms SZ and ZFP by 2 to 4X in compression ratios. Our practices and lessons learned can direct future optimizations for using autoencoders to compress scientific data. To solve the data transfer challenge for the distributed storage systems,we propose RPR, a rack-aware pipeline repair scheme for erasure-coded distributed storage systems. RPR for the first time investigates the insights of the racks, and explores the connection between the node level and rack level to help improve the repair performance when a single failure or multiple failures occur in a data center. The evaluation results on several common RS code configurations show that, for single-block failures, our RPR scheme reduces the total repair time by up to 81.5% compared to the traditional RS code repair method and 50.2% compared to the state-of-the-art CAR algorithm. For multi-block failures, RPR reduces the total repair time and cross-rack data transfer traffic by up to 64.5% and 50%, respectively, over the traditional repair.

Optimizations for Energy-aware, High-performance and Reliable Distributed Storage Systems

Optimizations for Energy-aware, High-performance and Reliable Distributed Storage Systems PDF Author: Cengiz Karakoyunlu
Publisher:
ISBN:
Category :
Languages : en
Pages : 374

Book Description


Speeding Up Distributed Storage and Computing Systems Using Codes

Speeding Up Distributed Storage and Computing Systems Using Codes PDF Author: Kang Wook Lee
Publisher:
ISBN:
Category :
Languages : en
Pages : 155

Book Description
Modern data centers have been providing exponentially increasing computing and storage resources, which have been fueling core applications ranging from search engines in the early 2000's to the real-time, large-scale data analysis of today. All these breakthroughs were made possible only due to the scalability in computing and storage resources offered by modern large-scale clusters, comprising individually small and unreliable low-end devices. Given the individually unpredictable nature of the underlying devices in these systems, we face the constant challenge of securing predictable and high-quality performance of such systems in the face of uncertainty. In this thesis, distributed storage and computing systems are viewed through a coding-theoretic lens. The role of codes in providing resiliency against noise has been studied for decades in many other engineering contexts, especially in communication systems, and codes are parts of our everyday infrastructure such as smartphones, WiFi, cellular systems, etc. Since the performance of distributed systems is significantly affected by anomalous system behavior and bottlenecks, which we call "system noise", there is an exciting opportunity for codes to endow distributed systems with robustness against such system noise. Our key observation - channel noise in communication systems is equivalent to system noise in distributed systems - forms the key motivation of this thesis, and raises the fundamental question: "can we use codes to guarantee robust speedups in distributed storage and computing systems?". In this thesis, three main layers of distributed computing and storage systems - storage layer, computation layer, and communication layer - are robustified through coding-theoretic tools. For the storage layer, we show that coded distributed storage systems allow faster data retrieval in addition to the other known advantages such as higher data durability and lower storage overhead; for the computation layer, we inject computing redundancy into distributed algorithms that are robust to stragglers or nodes that are substantially slower than the other nodes; for the communication layer, we propose a novel data caching and communication protocol, based on coding-theoretic principles that can significantly reduce the network overhead of the data shuffling operation, which is necessary to achieve higher statistical efficiency when running parallel/distributed machine learning algorithms.

Efficient Erasure Coding in Distributed Storage Systems

Efficient Erasure Coding in Distributed Storage Systems PDF Author: Jun Li
Publisher:
ISBN:
Category :
Languages : en
Pages :

Book Description
Distributed storage systems store a substantial amount of data on many commodity servers. As servers failures are common, it is critical for distributed storage systems to store redundancy to tolerate such failures. Conventionally, a distributed storage system replicates data as the redundancy. Recently, erasure coding has been increasingly replacing replication thanks to its lower storage overhead. However, in many scenarios, erasure coding incurs additional overhead, such as higher network traffic, or lowers the performance of data accesses. In this dissertation, we address some of such challenges in two broad areas. Erasure coding with the optimal network overhead. Traditional erasure codes incur high network overhead when data needs to be reconstructed after a server failure. We study the problem of constructing erasure codes that consume the optimal network traffic to reconstruct data from multiple failures. We start from a new construction of minimum-storage cooperative regenerating (MSCR) codes that reconstruct data from two failures with the optimal network traffic. We show that an existing minimum-storage regenerating (MSR) code is also an MSCR code for two failures, and vice versa. For more general cases, we propose Beehive codes that optimize the volume of network traffic to reconstruct data from more than two failures, with storage overhead only slightly higher than optimum. I/O efficient erasure coding and systems. Traditionally erasure coding incurs higher I/O overhead because of its encoding and decoding operations. In this dissertation, we propose solutions to minimize the overhead of writing and reading erasure-coded data. On the input side, we design and implement Mist, a new mechanism for disseminating erasure-coded data efficiently to multiple receiving servers in data centers. On the output side, we exploit the demand skewness in distributed storage systems and propose Zebra, a framework that encodes data into multiple tiers dynamically by their demand to reduce the overall overhead to read erasure-coded data. We also investigate the data parallelism of erasure coding, which may affect the performance of running parallel data processing jobs on the erasure-coded data, such as MapReduce, and construct Carousel codes that allow data parallelism to be expanded into an arbitrary number.

Data Intensive Distributed Computing: Challenges and Solutions for Large-scale Information Management

Data Intensive Distributed Computing: Challenges and Solutions for Large-scale Information Management PDF Author: Kosar, Tevfik
Publisher: IGI Global
ISBN: 1615209727
Category : Computers
Languages : en
Pages : 353

Book Description
"This book focuses on the challenges of distributed systems imposed by the data intensive applications, and on the different state-of-the-art solutions proposed to overcome these challenges"--Provided by publisher.

Energy-Efficient Distributed Computing Systems

Energy-Efficient Distributed Computing Systems PDF Author: Albert Y. Zomaya
Publisher: John Wiley & Sons
ISBN: 1118342003
Category : Computers
Languages : en
Pages : 605

Book Description
The energy consumption issue in distributed computing systems raises various monetary, environmental and system performance concerns. Electricity consumption in the US doubled from 2000 to 2005. From a financial and environmental standpoint, reducing the consumption of electricity is important, yet these reforms must not lead to performance degradation of the computing systems. These contradicting constraints create a suite of complex problems that need to be resolved in order to lead to 'greener' distributed computing systems. This book brings together a group of outstanding researchers that investigate the different facets of green and energy efficient distributed computing. Key features: One of the first books of its kind Features latest research findings on emerging topics by well-known scientists Valuable research for grad students, postdocs, and researchers Research will greatly feed into other technologies and application domains

Large-scale Distributed Systems and Energy Efficiency

Large-scale Distributed Systems and Energy Efficiency PDF Author: Jean-Marc Pierson
Publisher: John Wiley & Sons
ISBN: 1118981111
Category : Computers
Languages : en
Pages : 335

Book Description
Addresses innovations in technology relating to the energy efficiency of a wide variety of contemporary computer systems and networks With concerns about global energy consumption at an all-time high, improving computer networks energy efficiency is becoming an increasingly important topic. Large-Scale Distributed Systems and Energy Efficiency: A Holistic View addresses innovations in technology relating to the energy efficiency of a wide variety of contemporary computer systems and networks. After an introductory overview of the energy demands of current Information and Communications Technology (ICT), individual chapters offer in-depth analyses of such topics as cloud computing, green networking (both wired and wireless), mobile computing, power modeling, the rise of green data centers and high-performance computing, resource allocation, and energy efficiency in peer-to-peer (P2P) computing networks. Discusses measurement and modeling of the energy consumption method Includes methods for energy consumption reduction in diverse computing environments Features a variety of case studies and examples of energy reduction and assessment Timely and important, Large-Scale Distributed Systems and Energy Efficiency is an invaluable resource for ways of increasing the energy efficiency of computing systems and networks while simultaneously reducing the carbon footprint.

Benchmarking, Consistency, Distributed Database Management Systems, Distributed Systems, Eventual Consistency

Benchmarking, Consistency, Distributed Database Management Systems, Distributed Systems, Eventual Consistency PDF Author: Bermbach, David
Publisher: KIT Scientific Publishing
ISBN: 3731501864
Category : Computers
Languages : en
Pages : 202

Book Description
Cloud storage services and NoSQL systems typically offer only "Eventual Consistency", a rather weak guarantee covering a broad range of potential data consistency behavior. The degree of actual (in-)consistency, however, is unknown. This work presents novel solutions for determining the degree of (in-)consistency via simulation and benchmarking, as well as the necessary means to resolve inconsistencies leveraging this information.

Introduction and Implementation of Data Reduction Pools and Deduplication

Introduction and Implementation of Data Reduction Pools and Deduplication PDF Author: Jon Tate
Publisher: IBM Redbooks
ISBN: 0738457310
Category : Computers
Languages : en
Pages : 124

Book Description
Continuing its commitment to developing and delivering industry-leading storage technologies, IBM® introduces Data Reduction Pools (DRP) and Deduplication powered by IBM SpectrumTM Virtualize, which are innovative storage features that deliver essential storage efficiency technologies and exceptional ease of use and performance, all integrated into a proven design. This book discusses Data Reduction Pools (DRP) and Deduplication and is intended for experienced storage administrators who are fully familiar with IBM Spectrum Virtualize, SAN Volume Controller, and the Storwize family of products.

Data Deduplication for High Performance Storage System

Data Deduplication for High Performance Storage System PDF Author: Dan Feng
Publisher: Springer Nature
ISBN: 9811901120
Category : Computers
Languages : en
Pages : 170

Book Description
This book comprehensively introduces data deduplication technologies for storage systems. It first presents the overview of data deduplication including its theoretical basis, basic workflow, application scenarios and its key technologies, and then the book focuses on each key technology of the deduplication to provide an insight into the evolution of the technology over the years including chunking algorithms, indexing schemes, fragmentation reduced schemes, rewriting algorithm and security solution. In particular, the state-of-the-art solutions and the newly proposed solutions are both elaborated. At the end of the book, the author discusses the fundamental trade-offs in each of deduplication design choices and propose an open-source deduplication prototype. The book with its fundamental theories and complete survey can guide the beginners, students and practitioners working on data deduplication in storage system. It also provides a compact reference in the perspective of key data deduplication technologies for those researchers in developing high performance storage solutions.