Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning PDF Download

Are you looking for read ebook online? Search for your book and save it on your Kindle device, PC, phones or tablets. Download Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning PDF full book. Access full book title Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning by Dachuan Huang. Download full books in PDF and EPUB format.

Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning

Improving Performance in Data Processing Distributed Systems by Exploiting Data Placement and Partitioning PDF Author: Dachuan Huang
Publisher:
ISBN:
Category : Computer engineering
Languages : en
Pages :

Book Description
Our society is experiencing a rapid growth of data amount because of the widely used mobile devices, sensors, and computers. Most recent estimations show that every day 2.5 exabytes data are generated worldwide. The analysis to this amount of data could enable more intelligent business decisions, faster scientific discoveries, and more accurate society services. Traditional data processing techniques in one single machine, such as relational database management systems, quickly showed their limitations when handling large amount of data. To satisfy the ever-growing demand for large scale data analysis, various public and commercial data analysis distributed systems are built up such as High Performance Computing and Cloud Computing systems. These data processing distributed systems, with their excellent concurrency, scalability, and fault tolerance, are gaining more attention nowadays in research institution and industry. People are already enjoying the benefits of collecting and analyzing large amount of data on some maturely deployed data processing distributed systems. Unfortunately data processing distributed systems have their own performance problems. More specifically, in device layer, the system is suering from long seeking latency problem in hard disks, which reduces I/O throughput when meeting random access I/O pattern. In framework layer, the system is experiencing straggler problem in parallel jobs, where the slowest task alone would prolong the job execution time even though all other tasks finished at an much earlier time. In algorithm layer, the system faces diculty to decide intermediate cache size, where the following phase's speed-up benefit is outweighed by the overhead incurred by writing and reading a large intermediate cache file. This thesis is to solve these problems, hence to improve distributed system performance, by exploiting data placement and partitioning. Specifically, we propose the following solutions to address the aforementioned three problems. Firstly, we propose to use a hybrid storage system with hard disk drives and solid state drives in HPC environment, where input data's layout is re-organized to hide the long seeking latency in hard disks. Secondly, we propose to use logical data partitioning strategies for input data, so that the distributed system could benefit from fine-grained task's ability of solving straggler problem without paying the prohibitive overhead. Lastly, when intermediate data can be saved to speed up the following job's execution, we propose an online analyzer to decide how much data to place into cache. We have designed and implemented prototypes for each work, and evaluated them with representative workloads and datasets on widely used distributed system platforms PVFS and Hadoop. Our evaluation results can achieve almost optimal results, which fit the theoretical performance improvement expectation. For device layer, we could achieve low latency storage device with aordable cost. In framework layer, we could achieve minimal phase execution time when meeting stragglers. In algorithm layer, we could achieve near optimal job execution time for MapReduce FIM algorithms. Furthermore, our prototypes have low system overhead, which is a necessity for wide application in practice.