IEEE 2017 / 18 - Big Data with HADOOP Projects







IEEE 2017: Efficient Processing of Skyline Queries Using MapReduce
Abstract: The skyline operator has attracted considerable attention recently due to its broad applications. However, computing a skyline is challenging today since we have to deal with big data. For data intensive applications, the MapReduce framework has been widely used recently. In this paper, we propose the efficient parallel algorithm SKY-MR+ for processing skyline queries using MapReduce. We first build a quadtree-based histogram for space partitioning by deciding whether to split each leaf node judiciously based on the benefit of splitting in terms of the estimated execution time. In addition, we apply the dominance power filtering method to effectively prune non-skyline points in advance. We next partition data based on the regions divided by the quadtree and compute candidate skyline points for each partition using MapReduce. Finally, we check whether each skyline candidate point is actually a skyline point in every partition using MapReduce.


IEEE 2017: Attribute-Based Storage Supporting Secure Deduplication of Encrypted Data in Cloud
Abstract: Attribute-based encryption (ABE) has been widely used in cloud computing where a data provider outsources his/her encrypted data to a cloud service provider, and can share the data with users possessing specific credentials (or attributes). However, the standard ABE system does not support secure deduplication, which is crucial for eliminating duplicate copies of identical data in order to save storage space and network bandwidth. In this paper, we present an attribute-based storage system with secure deduplication in a hybrid cloud setting, where a private cloud is responsible for duplicate detection and a public cloud manages the storage. Compared with the prior data deduplication systems, our system has two advantages. Firstly, it can be used to confidentially share data with users by specifying access policies rather than sharing decryption keys. Secondly, it achieves the standard notion of semantic security for data confidentiality while existing systems only achieve it by defining a weaker security notion. In addition, we put forth a methodology to modify a ciphertext over one access policy into ciphertexts of the same plaintext but under other access policies without revealing the underlying plaintext.


IEEE 2017: FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters

Abstract: Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally partitioning data among a group of computing nodes. We start this study by discovering a serious performance problem of the existing parallel Frequent Itemset Mining algorithms. Given a large dataset, data partitioning strategies in the existing solutions suffer high communication and mining overhead induced by redundant transactions transmitted among computing nodes. We address this problem by developing a data partitioning approach called FiDoop-DP using the MapReduce programming model. The overarching goal of FiDoop-DP is to boost the performance of parallel Frequent Itemset Mining on Hadoop clusters. At the heart of FiDoop-DP is the Voronoi diagram-based data partitioning technique, which exploits correlations among transactions. Incorporating the similarity metric and the Locality-Sensitive Hashing technique, FiDoop-DP places highly similar transactions into a data partition to improve locality without creating an excessive number of redundant transactions.



IEEE 2017: Privacy-Preserving Data Encryption Strategy for Big Data in Mobile Cloud Computing
Abstract: Privacy has become a considerable issue when the applications of big data are dramatically growing in cloud computing. The benefits of the implementation for these emerging technologies have improved or changed service models and improve application performances in various perspectives. However, the remarkably growing volume of data sizes has also resulted in many challenges in practice. The execution time of the data encryption is one of the serious issues during the data processing and transmissions. Many current applications abandon data encryptions in order to reach an adoptive performance level companioning with privacy concerns. In this paper, we concentrate on privacy and propose a novel data encryption approach, which is called Dynamic Data Encryption Strategy (D2ES). Our proposed approach aims to selectively encrypt data and use privacy classification methods under timing constraints. This approach is designed to maximize the privacy protection scope by using a selective encryption strategy within the required execution time requirements. The performance of D2ES has been evaluated in our experiments, which provides the proof of the privacy enhancement.


 IEEE 2017: SocialQ&A: An Online Social Network Based Question and Answer System

Abstract:  Question and Answer (Q&A) systems play a vital role in our daily life for information and knowledge sharing. Users post questions and pick questions to answer in the system. Due to the rapidly growing user population and the number of questions, it is unlikely for a user to stumble upon a question by chance that (s)he can answer. Also, altruism does not encourage all users to provide answers, not to mention high quality answers with a short answer wait time. The primary objective of this paper is to improve the performance of Q&A systems by actively forwarding questions to users who are capable and willing to answer the questions. To this end, we have designed and implemented Social Q&A, an online social network based Q&A system. Social Q&A leverages the social network properties of common-interest and mutual-trust friend relationship to identify an asker through friendship who are most likely to answer the question, and enhance the user security. We also improve Social Q&A with security and efficiency enhancements by protecting user privacy and identifies, and retrieving answers automatically for recurrent questions. We describe the architecture and algorithms, and conducted comprehensive large-scale simulation to evaluate Social Q&A in comparison with other methods. Our results suggest that social networks can be leveraged to improve the answer quality and asker’s waiting time. We also implemented a real prototype of Social Q&A, and analyze the Q&A behavior of real users and questions from a small-scale real-world Social Q&A system.

 

IEEE 2017: Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset
Abstract:  Clustering techniques have been widely adopted in many real world data analysis applications, such as customer behavior analysis, medical data Analysis, digital forensics, etc. With the explosion of data in today’s big data era, a major trend to handle a clustering over large-scale datasets is outsourcing it to HDFS platforms. This is because cloud computing offers not only reliable services with performance guarantees, but also savings on in-house IT infrastructures. However, as datasets used for clustering may contain sensitive information, e.g., patient health information, commercial data, and behavioral data, etc, directly outsourcing them to any Distributed servers inevitably raise privacy concerns. In this paper, we propose a practical privacy-preserving K-means clustering scheme that can be efficiently outsourced to HDFS servers. 

    

IEEE 2017: Detecting and Analyzing Urban Regions with High Impact of Weather Change on Transport
Abstract: In this work, we focus on two fundamental questions that are unprecedentedly important to urban planners to understand the functional characteristics of various urban regions throughout a city, namely, (i) how to identify regional weather-traffic sensitivity index throughout a city, that indicates the degree to which the region traffic in a city is impacted by weather changes; (ii) among complex regional features, such as road structure and population density, how to dissect the most influential regional features that drivethe urban region traffic to be more vulnerable to weather changes. However, these two questions are nontrivial to answer, because urban traffic changes dynamically over time and is essentially affected by many other factors, which may dominate the overall impact. We make the first study on these questions, by developing a weather-traffic index (WTI) system. The system includes two main
Components: weather-traffic index establishment and key factor analysis. Using the proposed system, we conducted comprehensive empirical study in Shanghai, and the weather-traffic indices extracted have been validated to be surprisingly consistent with real world observations. Further regional key factor analysis yields interesting results. For example, house age has significant impact on the weather-traffic index, which sheds light on future urban planning and reconstruction.

 

IEEE 2017: A Secure and Verifiable Access Control Scheme for Big Data Storage in Clouds
Abstract: Due to the complexity and volume, outsourcing ciphertexts to a cloud is deemed to be one of the most effective approaches for big data storage and access. Nevertheless, verifying the access legitimacy of a user and securely updating a ciphertext in the cloud based on a new access policy designated by the data owner are two critical challenges to make cloud-based big data storage practical and effective. Traditional approaches either completely ignore the issue of access policy update or delegate the update to a third party authority; but in practice, access policy update is important for enhancing security and dealing with the dynamism caused by user join and leave activities. In this paper, we propose a secure and verifiable access control scheme based on the NTRU cryptosystem for big data storage in clouds. We first propose a new NTRU decryption algorithm to overcome the decryption failures of the original NTRU, and then detail our scheme and analyze its correctness, security strengths, and computational efficiency. Our scheme allows the cloud server to efficiently update the ciphertext when a new access policy is specified by the data owner, who is also able to validate the update to counter against cheating behaviors of the cloud. It also enables (i) the data owner and eligible users to effectively verify the legitimacy of a user for accessing the data, and (ii) a user to validate the information provided by other users for correct plaintext recovery. Rigorous analysis indicates that our scheme can prevent eligible users from cheating and resist various attacks such as the
collusion attack.

IEEE 2017: Efficient Processing of Skyline Queries Using MapReduce

Abstract: The skyline operator has attracted considerable attention recently due to its broad applications. However, computing a skyline is challenging today since we have to deal with big data. For data intensive applications, the MapReduce framework has been widely used recently. In this paper, we propose the efficient parallel algorithm SKY-MR+ for processing skyline queries using MapReduce. We first build a quadtree-based histogram for space partitioning by deciding whether to split each leaf node judiciously based on the benefit of splitting in terms of the estimated execution time. In addition, we apply the dominance power filtering method to effectively prune non-skyline points in advance. We next partition data based on the regions divided by the quadtree and compute candidate skyline points for each partition using MapReduce. Finally, we check whether each skyline candidate point is actually a skyline point in every partition using MapReduce.





IEEE 2016: FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
Abstract: Existing parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the MapReduce programming model. To achieve compressed storage and avoid building conditional pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional FP trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial third MapReduce job, the mappers independently decompose itemsets, the reducers perform combination operations by constructing small ultrametric trees, and the actual mining of these trees separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster is sensitive to data distribution and dimensions, because itemsets with different lengths have different decomposition and construction costs. To improve FiDoop’s performance, we develop a workload balance metric to measure load balance across the cluster’s computing nodes. We develop FiDoop-HD, an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis. Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution is efficient and scalable different lengths have different decomposition and construction costs. To improve FiDoop’s performance, we develop a workload balance metric to measure load balance across the cluster’s computing nodes. We develop FiDoop-HD, an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis. Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution is efficient and scalable

IEEE 2016: Sentiment Analysis of Top Colleges in India Using Twitter Data
Abstract: In today’s world, opinions and reviews accessible to us are one of the most critical factors in formulating our views and influencing the success of a brand, product or service. With the advent and growth of social media in the world, stakeholders often take to expressing their opinions on popular social media, namely Twitter. While Twitter data is extremely informative, it presents a challenge for analysis because of its humongous and disorganized nature. This paper is a thorough effort to dive into the novel domain of performing sentiment analysis of people’s opinions regarding top colleges in India. Besides taking additional preprocessing measures like the expansion of net lingo and removal of duplicate tweets, a probabilistic model based on Bayes’ theorem was used for spelling correction, which is overlooked in other research studies. This paper also highlights a comparison between the results obtained by exploiting the following machine learning algorithms: Na├»ve Bayes and Support Vector Machine and an Artificial Neural Network model: Multilayer Perceptron. Furthermore, a contrast has been presented between four different kernels of SVM: RBF, linear, polynomial and sigmoid.



IEEE 2016 : Phoenix: A MapReduce Implementation with New Enhancements 
      IEEE 2016  Data Mining

Abstract-Lately, the large increasing in data amount results in compound and large data-sets that caused the appearance of "Big Data" concept which gained the attention of industrial organizations as weil as academic communities. Big data APIs that need large memory can benefit from Phoenix MapReduce implementation for shared-memory machines, instead of large, distributed clusters of computers. This paper evaluates the design and the prototype of Phoenix, Phoenix performance, as weil as Phoenix limitations. This paper also suggests some new approaches to get over of some Phoenix limitation and enhance its performance on large-scale shared memory. The major contribution of this work is finding new approaches that get over the pairs limitation in phoenix framework using hash tables with B+Trees and get over the collisions problem of hash tables.



IEEE 2016 :  On Traffic-Aware Partition and Aggregation in MapReduce for Big Data Applications
     IEEE 2016  Data Mining


Abstract: The MapReduce programming model simplifies large-scale data processing on commodity cluster by exploiting parallel map tasks and reduces tasks. Although many efforts have been made to improve the performance of MapReduce jobs, they ignore the network traffic generated in the shuffle phase, which plays a critical role in performance enhancement. Traditionally, a hash function is used to partition intermediate data among reduce tasks, which, however, is not traffic-efficient because network topology and data size associated with each key are not taken into consideration. In this paper, we study to reduce network traffic cost for a MapReduce job by designing a novel intermediate data partition scheme. Furthermore, we jointly consider the aggregator placement problem, where each aggregator can reduce merged traffic from multiple map tasks. A decomposition-based distributed algorithm is proposed to deal with the large-scale optimization problem for big data application and an online algorithm is also designed to adjust data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost under both offline and online cases.



IEEE 2016 : The SP Theory of Intelligence: Distinctive Features and Advantages
 IEEE 2016  Data Mining

Abstract: This paper aims to highlight distinctive features of the SP theory of intelligence, realized in the SP computer model, and its apparent advantages compared with some AI-related alternatives. Perhaps most importantly, the theory simplies and integrates observations and concepts in AI-related areas, and has potential to simplify and integrate of structures and processes in computing systems. Unlike most other AI-related theories, the SP theory is itself a theory of computing, which can be the basis for new architectures for computers. Fundamental in the theory is information compression via the matching and uni_cation of patterns and, more specifically, via a concept of multiple alignment. The theory promotes transparency in the representation and processing of knowledge, and unsupervised learning of natural structures via information compression. It provides an interpretation of aspects of mathematics and an interpretation of phenomena in human perception and cognition. Abstract concepts in the theory may be realized in terms of neurons and their inter-connections (SP-neural). These features and advantages of the SP system are discussed in relation to AI-related alternatives: the concept of minimum length encoding and related concepts, how computational and energy efficiency in computing may be achieved, deep learning in neural networks, united theories of cognition and related research, universal search, Bayesian networks and some other models for AI, IBM's Watson, solving problems associated with big data and in the development of intelligence in autonomous robots, pattern recognition and vision, the learning and processing of natural language, exact and inexact forms of reasoning, representation and processing of diverse forms of knowledge, and software engineering. In conclusion, the SP system can provide a _rm foundation for the long-term development of AI and related bareas, and at the same time, it may deliver useful results on relatively short timescales.



IEEE 2016 : A Parallel Patient Treatment Time Prediction Algorithm and Its Applications in Hospital Queuing-Recommendation in a Big Data Environment
IEEE 2016  Data Mining

Abstract: Effective patient queue management to minimize patient wait delays and patient overcrowding is one of the major challenges faced by hospitals. Unnecessary and annoying waits for long periods result in substantial human resource and time wastage and increase the frustration endured by patients. For each patient in the queue, the total treatment time of all the patients before him is the time that he must wait.It would be convenient and preferable if the patients could receive the most efficient treatment plan and know the predicted waiting time through a mobile application that updates in real time. Therefore, we propose a Patient Treatment Time Prediction (PTTP) algorithm to predict the waiting time for each treatment task for a patient. We use realistic patient data from various hospitals to obtain a patient treatment time model for each task. Based on this large-scale, realistic dataset, the treatment time for each patient in the current queue of each task is predicted. Based on the predicted waiting time, a Hospital Queuing-Recommendation (HQR) system is developed. HQR calculates and predicts an efficient and convenient treatment plan recommended for the patient. Because of the large-scale, realistic dataset and the requirement for real-time response, the PTTP algorithm and HQR system mandate efficiency and low-latency response. We use an Apache Spark-based cloud implementation at the National Supercomputing Center in Changsha to achieve the aforementioned goals. Extensive experimentation and simulation results demonstrate the effectiveness and applicability of our proposed model to recommend an effective treatment plan for patients to minimize their wait times in hospitals.


IEEE 2016 : Protection of Big Data Privacy
IEEE 2016 Big Data Applications

Abstract— In recent years, big data have become a hot research topic. The increasing amount of big data also increases the chance of breaching the privacy of individuals. Since big data require high computational power and large storage, distributed systems are used. As multiple parties are involved in these systems, the risk of privacy violation is increased. There have been a number of privacy-preserving mechanisms developed for privacy protection at different stages (e.g., data generation, data storage, and data processing) of a big data life cycle. The goal of this paper is to provide a comprehensive overview of the privacy preservation mechanisms in big data and present the challenges for existing mechanisms. In particular, in this paper, we illustrate the infrastructure of big data and the state-of-the-art privacy-preserving mechanisms in each stage of the big data life cycle. Furthermore, we discuss the challenges and future research directions related to privacy preservation in big data.



IEEE 2016 :  Towards a Virtual Domain Based Authentication on MapReduce
IEEE 2016  Data Mining


Abstract: This paper has proposed a novel authentication solution for the MapReduce (MR) model, a new distributed and parallel computing paradigm commonly deployed to process BigData by major IT players, such as Facebook and Yahoo. It identifies a set of security, performance, and scalability requirements that are specified from a comprehensive study of a job execution process using MR and security threats and attacks in this environment. Based on the requirements, it critically analyzes the state-of-the-art authentication solutions, discovering that the authentication services currently proposed for the MR model is not adequate. This paper then presents a novel layered authentication solution for the MR model and describes the core components of this solution, which includes the virtual domain based authentication framework (VDAF). These novel ideas are significant, because, first, the approach embeds the characteristics of MR-in-cloud deployments into security solution designs, and this will allow the MR model be delivered as a software as a service in a public cloud environment along with our proposed authentication solution; second, VDAF supports the authentication of every interactions by any MR components involved in a job execution flow, so long as the interactions are for accessing resources of the job; third, this continuous authentication service is provided in such a manner that the costs incurred in providing the authentication service should be as low as possible.


IEEE 2016 :   CMiner: Opinion Extraction and Summarization for Chinese Microblogs
IEEE 2016  Data Mining

 Abstract—Sentiment analysis of microblog texts has drawn lots of attention in both the academic and industrial fields. However, most of the current work only focuses on polarity classification. In this paper, we present an opinion mining system for Chinese microblogs called CMiner. Instead of polarity classification, CMiner focuses on more complicated opinion mining tasks -opinion target extraction and opinion summarization. Novel algorithms are developed for the two tasks and integrated into the end-to-end system. CMiner can help to effectively understand the users’ opinion towards different opinion targets in a microblog topic. Specially, we develop an unsupervised label propagation algorithm for opinion target extraction. The opinion targets of all messages in a topic are collectively extracted based on the assumption that similar messages may focus on similar opinion targets. In addition, we build an aspect-based opinion summarization framework for microblog topics. After getting the opinion targets of all the microblog messages in a topic, we cluster the opinion targets into several groups and extract representative targets and summaries for each group. A co-ranking algorithm is proposed to rank both the opinion targets and microblog sentences simultaneously. Experimental results on a benchmark dataset show the effectiveness of our system and the algorithms.



IEEE 2015: A Hierarchical Distributed Processing Framework for Big Image Data
Abstract: This paper introduces an effective processing framework nominated ICP (Image Cloud Processing) to powerfully cope with the data explosion in image processing field. While most previous researches focus on optimizing the image processing algorithms to gain higher efficiency, our work dedicates to providing a general framework for those image processing algorithms, which can be implemented in parallel so as to achieve a boost in time efficiency without compromising the results performance along with the increasing image scale. The proposed ICP framework consists of two mechanisms, i.e. SICP (Static ICP) and DICP (Dynamic ICP). Specifically, SICP is aimed at processing the big image data pre-stored in the distributed system, while DICP is proposed for dynamic input. To accomplish SICP, two novel data representations named P-Image and Big-Image are designed to cooperate with MapReduce to achieve more optimized configuration and higher efficiency. DICP is implemented through a parallel processing procedure working with the traditional processing mechanism of the distributed system. Representative results of comprehensive experiments on the challenging ImageNet dataset are selected to validate the capacity of our proposed ICP framework over the traditional state-of-the-art methods, both in time efficiency and quality of results.




IEEE 2015 :Research Directions for Engineering Big Data Analytics Software
IEEE 2016  Data Mining

Abstract: Many software startups and research and development efforts are actively taking place to harness the power of big data and create software with potential to improve almost every aspect of human life. As these efforts continue to increase, full consideration needs to be given to engineering aspects of big data software. Since these systems exist to make predictions on complex and continuous massive datasets, they pose unique problems during specification, design, and verification of software that needs to be delivered on-time and within budget. But, given the nature of big data software, can this be done? Does big data software engineering really work? This article explores details of big data software, discusses the main problems encountered when engineering big data software, and proposes avenues for future research.


IEEE 2015 : An Aggregatable Name-Based Routing for Energy-Efficient Data Sharing in Big Data Era
IEEE 2015  Transaction on Big Data

Abstract The MapReduce programming model simplifies large-scale data processing on commodity cluster by exploiting parallel map tasks and reduce tasks. Although many efforts have been made to improve the performance of MapReduce jobs, they ignore the network traffic generated in the shuffle phase, which plays a critical role in performance enhancement. Traditionally, a hash function is used to partition intermediate data among reduce tasks, which, however, is not traffic-efficient because network topology and data size associated with each key are not taken into consideration. In this paper, we study to reduce network traffic cost for a MapReduce job by designing a novel intermediate data partition scheme. Furthermore, we jointly consider the aggregator placement problem, where each aggregator can reduce merged traffic from multiple map tasks. A decomposition-based distributed algorithm is proposed to deal with the large-scale optimization problem for big data application and an online algorithm is also designed to adjust data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost under both offline and online cases.


IEEE 2015 : Secure Sensitive Data Sharing on a Big Data Platform
IEEE 2015  Transaction on Big Data

Abstract Users store vast amounts of sensitive data on a big data platform. Sharing sensitive data will help enterprises reduce the cost of providing users with personalized services and provide value-added data services. However, secure data sharing is problematic. This paper proposes a framework for secure sensitive data sharing on a big data platform, including secure data delivery, storage, usage, and destruction on a semi-trusted big data sharing platform. We present a proxy re-encryption algorithm based on heterogeneous ciphertext transformation and a user process protection method based on a virtual machine monitor, which provides support for the realization of system functions. The framework protects the security of users’ sensitive data effectively and shares these data safely. At the same time, data owners retain complete control of their own data in a sound environment for modern Internet information security.


IEEE 2015 : Research Directions for Engineering Big Data Analytics Software 
IEEE 2015  Transaction on Big Data

Abstract Many software startups and research and development efforts are actively taking place to harness the power of big data and create software with potential to improve almost every aspect of human life. As these efforts continue to increase, full consideration needs to be given to engineering aspects of big data software. Since these systems exist to make predictions on complex and continuous massive datasets, they pose unique problems during specification, design, and verification of software that needs to be delivered on-time and within budget. But, given the nature of big data software, can this be done? Does big data software engineering really work? This article explores details of big data software, discusses the main problems encountered when engineering big data software, and proposes avenues for future research.


IEEE 2015 : Massive MIMO as a Big Data System: Random Matrix Models and Testbed
IEEE 2015  Transaction on Big Data

Abstract— The paper has two parts. The first one deals with how to use large random matrices as building blocks to model the massive data arising from the massive (or large-scale) MIMO system. As a result, we apply this model for distributed spectrum sensing and network monitoring. The part boils down to the streaming, distributed massive data, for which a new algorithm is obtained and its performance is derived using the central limit theorem that is recently obtained in the literature. The second part deals with the large-scale testbed using software-defined radios (particularly USRP) that takes us more than four years to develop this 70-node network testbed. To demonstrate the power of the software defined radio, we reconfigure our testbed quickly into a testbed for massive MIMO. The massive data of this testbed is of central interest in this paper. It is for the first time for us to model the experimental data arising from this testbed. To our best knowledge, we are not aware of other similar work.


  
IEEE 2015 : Big Data and the SP Theory of Intelligence
IEEE 2015  Transaction on Big Data

Abstract This paper is about how the SP theory of intelligence and its realization in the SP machine may, with advantage, be applied to the management and analysis of big data. The SP system_introduced in this paper and fully described elsewhere_may  help to overcome the problem of variety in big data; it has potential as a universal framework for the representation and processing of diverse kinds of knowledge, helping to reduce the diversity of formalisms and formats for knowledge, and the different ways in which they are processed. It has strengths in the unsupervised learning or discovery of structure in data, in pattern recognition, in the parsing and production of natural language, in several kinds of reasoning, and more. It lends itself to the analysis of streaming data, helping to overcome the problem of velocity in big data. Central in the workings of the system is lossless compression of information: making big data smaller and reducing problems of storage and management. There is potential for substantial economies in the transmission of data, for big cuts in the use of energy in computing, for faster processing, and for smaller and lighter computers. The system provides a handle on the problem of veracity in big data, with potential to assist in the management of errors and uncertainties in data. It lends itself to the visualization of knowledge structures and inferential processes. A high-parallel, open-source version of the SP machine would provide a means for researchers everywhere to explore what can be done with the system and to create new versions of it.


IEEE 2015 : Cost Minimization for Big Data Processing in Geo-Distributed Data Centers
IEEE 2015  Transaction on Big Data

Abstract—  The explosive growth of demands on big data processing imposes a heavy burden on computation, storage, and communication in data centers, which hence incurs considerable operational expenditure to data center providers. Therefore, cost minimization has become an emergent issue for the upcoming big data era. Different from conventional cloud services, one of the main features of big data services is the tight coupling between data and computation as computation tasks can be conducted only when the corresponding data are available. As a result, three factors, i.e., task assignment, data placement, and data movement, deeply influence the operational expenditure of data centers. In this paper, we are motivated to study the cost minimization problem via a joint optimization of these three factors for big data services in geo-distributed data centers. To describe the task completion time with the consideration of both data transmission and computation, we propose a 2-D Markov chain and derive the average task completion time in closed-form. Furthermore, we model the problem as a mixed-integer nonlinear programming and propose an efficient solution to linearize it. The high efficiency of our proposal is validated by extensive simulation- based studies.


9 comments:

  1. Nice blog it is very useful for me. I have to share my website link. Please come and visit my webiste.
    Each and every year we are providing Cheap and best students Final year projects at Madurai.

    ReplyDelete
  2. Very good Article.. Since Java is a platform independent language can be used anywhere like windows,Linux etc.
    Java Course Online
    Java Training Institutes in Chennai
    J2EE training

    ReplyDelete
  3. Very nice work, well written article with lots of valuable information. We have learnt new things from it, thanks Keep up the Good Work. We are IEEE Project Training Institute at bangalore to assist students in their final year projects.

    Give a look on Big Data Projects.

    For more wide branches of Final Year IEEE Projects refer these
    Java Projects
    Android Projects
    Matlab Projects
    Embedded Projects
    VLSI Projects
    Dotnet Projects
    NS2 Simulation Projects
    NS3 Simulation Projects

    ReplyDelete
  4. Nice information about Big Data.
    The best place to learn Big Data is steinmetzils
    100% Job assurence is provided by them.
    visit: http://www.steinmetzils.com/

    ReplyDelete
  5. Nice information about Big Data.
    The best place to learn Big Data is steinmetzils
    100% Job assurence is provided by them.
    visit: http://www.steinmetzils.com/

    ReplyDelete
  6. Good day. I was impressed with your article. Keep it up . You can also visit my site if you have time. Thank you and Bless you always.


    Data Analytics Courses in Chennai

    ReplyDelete
  7. Thanks for sharing as it is an excellent post would love to read your future post -for more knowledge IEEE Projects | IEEE Projects For ISE

    ReplyDelete
  8. Just found your post by searching on the Google, I am Impressed and Learned Lot of new thing from your post. I am new to blogging and always try to learn new skill as I believe that blogging is the full time job for learning new things day by day.
    "Emergers Technologies"

    ReplyDelete
  9. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Map Reduce Design Patterns
    MaxMunus Offer World Class Virtual Instructor led training on Map Reduce Design Patterns. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Nitesh Kumar
    MaxMunus
    E-mail: nitesh@maxmunus.com
    Skype id: nitesh_maxmunus
    Ph:(+91) 8553912023
    http://www.maxmunus.com/


    ReplyDelete