Abstract-Lately, the large increasing in data amount results in compound and large data-sets that caused the appearance of "Big Data" concept which gained the attention of industrial organizations as weil as academic communities. Big data APIs that need large memory can benefit from Phoenix MapReduce implementation for shared-memory machines, instead of large, distributed clusters of computers. This paper evaluates the design and the prototype of Phoenix, Phoenix performance, as weil as Phoenix limitations. This paper also suggests some new approaches to get over of some Phoenix limitation and enhance its performance on large-scale shared memory. The major contribution of this work is finding new approaches that get over the pairs limitation in phoenix framework using hash tables with B+Trees and get over the collisions problem of hash tables.
Abstract: The MapReduce programming model simplifies large-scale data processing on commodity cluster by exploiting parallel map tasks and reduces tasks. Although many efforts have been made to improve the performance of MapReduce jobs, they ignore the network traffic generated in the shuffle phase, which plays a critical role in performance enhancement. Traditionally, a hash function is used to partition intermediate data among reduce tasks, which, however, is not traffic-efficient because network topology and data size associated with each key are not taken into consideration. In this paper, we study to reduce network traffic cost for a MapReduce job by designing a novel intermediate data partition scheme. Furthermore, we jointly consider the aggregator placement problem, where each aggregator can reduce merged traffic from multiple map tasks. A decomposition-based distributed algorithm is proposed to deal with the large-scale optimization problem for big data application and an online algorithm is also designed to adjust data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost under both offline and online cases.