Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning PDF Download

Are you looking for read ebook online? Search for your book and save it on your Kindle device, PC, phones or tablets. Download Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning PDF full book. Access full book title Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning by Dachuan Huang. Download full books in PDF and EPUB format.

Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning

Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning PDF Author: Dachuan Huang
Publisher:
ISBN:
Category : Computer engineering
Languages : en
Pages :

Book Description
Our society is experiencing a rapid growth of data amount because of the widely used mobile devices, sensors, and computers. Most recent estimations show that every day 2.5 exabytes data are generated worldwide. The analysis to this amount of data could enable more intelligent business decisions, faster scientific discoveries, and more accurate society services. Traditional data processing techniques in one single machine, such as relational database management systems, quickly showed their limitations when handling large amount of data. To satisfy the ever-growing demand for large scale data analysis, various public and commercial data analysis distributed systems are built up such as High Performance Computing and Cloud Computing systems. These data processing distributed systems, with their excellent concurrency, scalability, and fault tolerance, are gaining more attention nowadays in research institution and industry. People are already enjoying the benefits of collecting and analyzing large amount of data on some maturely deployed data processing distributed systems. Unfortunately data processing distributed systems have their own performance problems. More specifically, in device layer, the system is suering from long seeking latency problem in hard disks, which reduces I/O throughput when meeting random access I/O pattern. In framework layer, the system is experiencing straggler problem in parallel jobs, where the slowest task alone would prolong the job execution time even though all other tasks finished at an much earlier time. In algorithm layer, the system faces diculty to decide intermediate cache size, where the following phase's speed-up benefit is outweighed by the overhead incurred by writing and reading a large intermediate cache file. This thesis is to solve these problems, hence to improve distributed system performance, by exploiting data placement and partitioning. Specifically, we propose the following solutions to address the aforementioned three problems. Firstly, we propose to use a hybrid storage system with hard disk drives and solid state drives in HPC environment, where input data's layout is re-organized to hide the long seeking latency in hard disks. Secondly, we propose to use logical data partitioning strategies for input data, so that the distributed system could benefit from fine-grained task's ability of solving straggler problem without paying the prohibitive overhead. Lastly, when intermediate data can be saved to speed up the following job's execution, we propose an online analyzer to decide how much data to place into cache. We have designed and implemented prototypes for each work, and evaluated them with representative workloads and datasets on widely used distributed system platforms PVFS and Hadoop. Our evaluation results can achieve almost optimal results, which fit the theoretical performance improvement expectation. For device layer, we could achieve low latency storage device with aordable cost. In framework layer, we could achieve minimal phase execution time when meeting stragglers. In algorithm layer, we could achieve near optimal job execution time for MapReduce FIM algorithms. Furthermore, our prototypes have low system overhead, which is a necessity for wide application in practice.

Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning

Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning PDF Author: Dachuan Huang
Publisher:
ISBN:
Category : Computer engineering
Languages : en
Pages :

Book Description
Our society is experiencing a rapid growth of data amount because of the widely used mobile devices, sensors, and computers. Most recent estimations show that every day 2.5 exabytes data are generated worldwide. The analysis to this amount of data could enable more intelligent business decisions, faster scientific discoveries, and more accurate society services. Traditional data processing techniques in one single machine, such as relational database management systems, quickly showed their limitations when handling large amount of data. To satisfy the ever-growing demand for large scale data analysis, various public and commercial data analysis distributed systems are built up such as High Performance Computing and Cloud Computing systems. These data processing distributed systems, with their excellent concurrency, scalability, and fault tolerance, are gaining more attention nowadays in research institution and industry. People are already enjoying the benefits of collecting and analyzing large amount of data on some maturely deployed data processing distributed systems. Unfortunately data processing distributed systems have their own performance problems. More specifically, in device layer, the system is suering from long seeking latency problem in hard disks, which reduces I/O throughput when meeting random access I/O pattern. In framework layer, the system is experiencing straggler problem in parallel jobs, where the slowest task alone would prolong the job execution time even though all other tasks finished at an much earlier time. In algorithm layer, the system faces diculty to decide intermediate cache size, where the following phase's speed-up benefit is outweighed by the overhead incurred by writing and reading a large intermediate cache file. This thesis is to solve these problems, hence to improve distributed system performance, by exploiting data placement and partitioning. Specifically, we propose the following solutions to address the aforementioned three problems. Firstly, we propose to use a hybrid storage system with hard disk drives and solid state drives in HPC environment, where input data's layout is re-organized to hide the long seeking latency in hard disks. Secondly, we propose to use logical data partitioning strategies for input data, so that the distributed system could benefit from fine-grained task's ability of solving straggler problem without paying the prohibitive overhead. Lastly, when intermediate data can be saved to speed up the following job's execution, we propose an online analyzer to decide how much data to place into cache. We have designed and implemented prototypes for each work, and evaluated them with representative workloads and datasets on widely used distributed system platforms PVFS and Hadoop. Our evaluation results can achieve almost optimal results, which fit the theoretical performance improvement expectation. For device layer, we could achieve low latency storage device with aordable cost. In framework layer, we could achieve minimal phase execution time when meeting stragglers. In algorithm layer, we could achieve near optimal job execution time for MapReduce FIM algorithms. Furthermore, our prototypes have low system overhead, which is a necessity for wide application in practice.

On Exploiting Location Flexibility in Data-intensive Distributed Systems

On Exploiting Location Flexibility in Data-intensive Distributed Systems PDF Author: Boyang Yu
Publisher:
ISBN:
Category :
Languages : en
Pages :

Book Description
With the fast growth of data-intensive distributed systems today, more novel and principled approaches are needed to improve the system efficiency, ensure the service quality to satisfy the user requirements, and lower the system running cost. This dissertation studies the design issues in the data-intensive distributed systems, which are differentiated from other systems by the heavy workload of data movement and are characterized by the fact that the destination of each data flow is limited to a subset of available locations, such as those servers holding the requested data. Besides, even among the feasible subset, different locations may result in different performance.The studies in this dissertation improve the data-intensive systems by exploiting the data storage location flexibility. It addresses how to reasonably determine the data placement based on the measured request patterns, to improve a series of performance metrics, such as the data access latency, system throughput and various costs, by the proposed hypergraph models for data placement. To implement the proposal with a lower overhead, a sketch-based data placement scheme is presented, which constructs the sparsified hypergraph under a distributed and streaming-based system model, achieving a good approximation on the performance improvement. As the network can potentially become the bottleneck of distributed data-intensive systems due to the frequent data movement among storage nodes, the online data placement by reinforcement learning is proposed which intelligently determines the storage locations of each data item at the moment that the item is going to be written or updated, with the joint-awareness of network conditions and request patterns. Meanwhile, noticing that distributed memory caches are effective measures in lowering the workload to the backend storage systems, the auto-scaling of memory cache clusters is studied, which tries to balance the energy cost of the service and the performance ensured.As the outcome of this dissertation, the designed schemes and methods essentially help to improve the running efficiency of data-intensive distributed systems. Therefore, they can either help to improve the user-perceived service quality under the same level of system resource investment, or help to lower the monetary expense and energy consumption in maintaining the system under the same performance standard. From the two perspectives, both the end users and the system providers could obtain benefits from the results of the studies.

Compiling Parallel Loops for High Performance Computers

Compiling Parallel Loops for High Performance Computers PDF Author: David E. Hudak
Publisher: Springer Science & Business Media
ISBN: 1461531640
Category : Computers
Languages : en
Pages : 171

Book Description
4. 2 Code Segments . . . . . . . . . . . . . . . 96 4. 3 Determining Communication Parameters . 99 4. 4 Multicast Communication Overhead · 103 4. 5 Partitioning . . . . . . · 103 4. 6 Experimental Results . 117 4. 7 Conclusion. . . . . . . · 121 5 COLLECTIVE PARTITIONING AND REMAPPING FOR MULTIPLE LOOP NESTS 125 5. 1 Introduction. . . . . . . . . 125 5. 2 Program Enclosure Trees. . 128 5. 3 The CPR Algorithm . . 132 5. 4 Experimental Results. . 141 5. 5 Conclusion. . 146 BIBLIOGRAPHY. 149 INDEX . . . . . . . . 157 LIST OF FIGURES Figure 1. 1 The Butterfly Architecture. . . . . . . . . . 5 1. 2 Example of an iterative data-parallel loop . . 7 1. 3 Contiguous tiling and assignment of an iteration space. 13 2. 1 Communication along a line segment. . . 24 2. 2 Access pattern for the access offset, (3,2). 25 2. 3 Decomposing an access vector along an orthogonal basis set of vectors. . . . . . . . . . . . . . . . . . . 26 2. 4 An analysis of communication patterns. 29 2. 5 Decomposing a vector along two separate basis sets of vectors. 31 2. 6 Cache lines aligning with borders. 33 2. 7 Cache lines not aligned with borders. 34 2. 8 nh is the difference of nd and nb. 42 2. 9 nh is the sum of nd and nb. 42 2. 10 The ADAPT system. 44 2. 11 Code segment used in experiments. . 46 2. 12 Execution rates for various partitions. 47 2. 13 Execution time of partitions on Multimax. 48 2. 14 Performance increase as processing power increases. 49 2. 15 Percentage miss ratios for various aspect ratios and line sizes.

Global Data Management

Global Data Management PDF Author: Roberto Baldoni
Publisher: IOS Press
ISBN: 1586036297
Category : Business & Economics
Languages : en
Pages : 376

Book Description
Some researcher has created the vision of the 'data utility' as a key enabler towards ubiquitous and pervasive computing. Decentralization and replication would be the approach to make it resistant against security attacks. This book presents an organic view on the research and technologies, which bring us towards the realization of the vision.

Improving Data-shuffle Performance in Data-parallel Distributed Systems

Improving Data-shuffle Performance in Data-parallel Distributed Systems PDF Author: Shweelan Samson
Publisher:
ISBN:
Category :
Languages : en
Pages : 90

Book Description


Database Replication

Database Replication PDF Author: Bettina Kemme
Publisher: Springer Nature
ISBN: 3031018397
Category : Computers
Languages : en
Pages : 141

Book Description
Database replication is widely used for fault-tolerance, scalability and performance. The failure of one database replica does not stop the system from working as available replicas can take over the tasks of the failed replica. Scalability can be achieved by distributing the load across all replicas, and adding new replicas should the load increase. Finally, database replication can provide fast local access, even if clients are geographically distributed clients, if data copies are located close to clients. Despite its advantages, replication is not a straightforward technique to apply, and there are many hurdles to overcome. At the forefront is replica control: assuring that data copies remain consistent when updates occur. There exist many alternatives in regard to where updates can occur and when changes are propagated to data copies, how changes are applied, where the replication tool is located, etc. A particular challenge is to combine replica control with transaction management as it requires several operations to be treated as a single logical unit, and it provides atomicity, consistency, isolation and durability across the replicated system. The book provides a categorization of replica control mechanisms, presents several replica and concurrency control mechanisms in detail, and discusses many of the issues that arise when such solutions need to be implemented within or on top of relational database systems. Furthermore, the book presents the tasks that are needed to build a fault-tolerant replication solution, provides an overview of load-balancing strategies that allow load to be equally distributed across all replicas, and introduces the concept of self-provisioning that allows the replicated system to dynamically decide on the number of replicas that are needed to handle the current load. As performance evaluation is a crucial aspect when developing a replication tool, the book presents an analytical model of the scalability potential of various replication solution. For readers that are only interested in getting a good overview of the challenges of database replication and the general mechanisms of how to implement replication solutions, we recommend to read Chapters 1 to 4. For readers that want to get a more complete picture and a discussion of advanced issues, we further recommend the Chapters 5, 8, 9 and 10. Finally, Chapters 6 and 7 are of interest for those who want get familiar with thorough algorithm design and correctness reasoning. Table of Contents: Overview / 1-Copy-Equivalence and Consistency / Basic Protocols / Replication Architecture / The Scalability of Replication / Eager Replication and 1-Copy-Serializability / 1-Copy-Snapshot Isolation / Lazy Replication / Self-Configuration and Elasticity / Other Aspects of Replication

Big Data 2.0 Processing Systems

Big Data 2.0 Processing Systems PDF Author: Sherif Sakr
Publisher: Springer Nature
ISBN: 3030441873
Category : Computers
Languages : en
Pages : 145

Book Description
This book provides readers the “big picture” and a comprehensive survey of the domain of big data processing systems. For the past decade, the Hadoop framework has dominated the world of big data processing, yet recently academia and industry have started to recognize its limitations in several application domains and thus, it is now gradually being replaced by a collection of engines that are dedicated to specific verticals (e.g. structured data, graph data, and streaming data). The book explores this new wave of systems, which it refers to as Big Data 2.0 processing systems. After Chapter 1 presents the general background of the big data phenomena, Chapter 2 provides an overview of various general-purpose big data processing systems that allow their users to develop various big data processing jobs for different application domains. In turn, Chapter 3 examines various systems that have been introduced to support the SQL flavor on top of the Hadoop infrastructure and provide competing and scalable performance in the processing of large-scale structured data. Chapter 4 discusses several systems that have been designed to tackle the problem of large-scale graph processing, while the main focus of Chapter 5 is on several systems that have been designed to provide scalable solutions for processing big data streams, and on other sets of systems that have been introduced to support the development of data pipelines between various types of big data processing jobs and systems. Next, Chapter 6 focuses on covering the emerging frameworks and systems in the domain of scalable machine learning and deep learning processing. Lastly, Chapter 7 shares conclusions and an outlook on future research challenges. This new and considerably enlarged second edition not only contains the completely new chapter 6, but also offers a refreshed content for the state-of-the-art in all domains of big data processing over the last years. Overall, the book offers a valuable reference guide for professional, students, and researchers in the domain of big data processing systems. Further, its comprehensive content will hopefully encourage readers to pursue further research on the subject.

Networks of the Future

Networks of the Future PDF Author: Mahmoud Elkhodr
Publisher: CRC Press
ISBN: 1498783988
Category : Computers
Languages : en
Pages : 513

Book Description
Provides a comprehensive introduction to the latest research in networking Explores implementation issues and research challenges Focuses on applications and enabling technologies Covers wireless technologies, Big Data, IoT, and other emerging research areas Features contributions from worldwide experts

Robust Data Partitioning for Ad-hoc Query Processing

Robust Data Partitioning for Ad-hoc Query Processing PDF Author: Qui T. Nguyen
Publisher:
ISBN:
Category :
Languages : en
Pages : 62

Book Description
Data partitioning can significantly improve query performance in distributed database systems. Most proposed data partitioning techniques choose the partitioning based on a particular expected query workload or use a simple upfront scheme, such as uniform range partitioning or hash partitioning on a key. However, these techniques do not adequately address the case where the query workload is ad-hoc and unpredictable, as in many analytic applications. The HYPER-PARTITIONING system aims to ll that gap, by using a novel space-partitioning tree on the space of possible attribute values to dene partitions incorporating all attributes of a dataset. The system creates a robust upfront partitioning tree, designed to benet all possible queries, and then adapts it over time in response to the actual workload. This thesis evaluates the robustness of the upfront hyper-partitioning algorithm, describes the implementation of the overall HYPER-PARTITIONING system, and shows how hyper-partitioning improves the performance of both selection and join queries.

Using Statistical Analysis to Improve Data Partitioning in Algorithms for Data Parallel Processing Implementation

Using Statistical Analysis to Improve Data Partitioning in Algorithms for Data Parallel Processing Implementation PDF Author: Manuel E. Hidalgo Murillo
Publisher:
ISBN:
Category : Multiprocessors
Languages : en
Pages : 138

Book Description
"In multiprocessor systems, data parallelism is the execution of the same task on data distributed across multiple processors. It involves splitting the data set into smaller data partitions or batches. The process to split the data among the different processors is call “Data Partitioning” and it is an important factor of efficiency for data parallel processing implementation. Data partitioning influences the workload in each processing unit and the network traffic between processes. A poor partition quality can lead to serious performance problems. This research presents a data partitioning method that can be used to improve the performance of data parallel implementations. The proposed method relies on using an initial screening experiment to run a portion of data units. Regression is then used to create a prediction model of the processing times for each data unit. Using the estimated processing time, load balancing is achieved by implementing a greedy algorithm to distribute the units in a parallel environment. Discrete event simulation is used as the application of this research. Comparisons between equal data partitioning and the methodology proposed in this research indicate that time savings and equal load balancing can be achieved."--Abstract.