IEEE 2018 / 19 - Big Data with HADOOP Projects

IEEE 2018 : MR-Mafia: Parallel Subspace Clustering Algorithm Based on MapReduce For Large Multi-dimensional Datasets

Abstract : The mission of subspace clustering is to find hidden clusters exist in different subspaces within a dataset. In recent years, with the exponential growth of data size and data dimensions, traditional subspace clustering algorithms become inefficient as well as ineffective while extracting knowledge in the big dataenvironment, resulting in an emergent need to design efficient parallel distributed subspace clustering algorithms to handle large multi-dimensional data with an acceptable computational cost. In this paper, we introduce MR-Mafia: a parallel mafia subspace clustering algorithm based on MapReduce. The algorithm takes advantage of MapReduce's data partitioning and task parallelism and achieves a good tradeoff between the cost for disk accesses and communication cost. The experimental results show near linear speedups and demonstrate the high scalability and great application prospects of the proposed algorithm.



IEEE 2018 : Ciphertext-Policy Attribute-Based Signcryption With Verifiable Outsourced Designcryption for Sharing Personal Health Records

Abstract : Personal Health Record (PHR) is a patient-centric model of health information exchange, which greatly facilitates the storage, access and share of personal health information. In order to share the valuable resources and reduce the operational cost, the PHR service providers would like to store the PHR applications and health information data in the cloud. The private health information may be exposed to unauthorized organizations or individuals since the patient lost the physical control of their health information. Ciphertext-Policy Attribute-Based Signcryption (CP-ABSC) is a promising solution to design cloud-assisted PHR secure sharing system. It provides fine-grained access control, confidentiality, authenticity and sender privacy of PHR data. However, a large number of pairing and modular exponentiation computations bring heavy computational overhead during designcryption process. In order to reconcile the conflict of high computational overhead and low efficiency in the designcryption process, an outsourcing scheme is proposed in this paper. In our scheme, the heavy computations are outsourced to Ciphertext Transformed Server (CTS), only leaving a small computational overhead for the PHR user. At the same time, the extra communication overhead in our scheme is actually tolerable. Furthermore, theoretical analysis and the desired securing properties including confidentiality, unforgeability and verifiability have been proved formally in the random oracle model. Experimental evaluation indicates that the proposed scheme is practical and feasible.


IEEE 2018 : Client Side Secure Image Deduplication Using DICE Protocol

Abstract : With the advent of cloud computing, secured data deduplication has gained a lot of popularity. Many techniques have been proposed in the literature of this ongoing research area. Among these techniques, the Message Locked Encryption (MLE) scheme is often mentioned. Researchers have introduced MLE based protocols which provide secured deduplication of data, where the data is generally in text form. As a result, multimedia data such as images and video, which are larger in size compared to text files, have not been given much attention. Applying secured data deduplication to such data files could significantly reduce the cost and space required for their storage. In this paper we present a secure deduplication scheme for near identical (NI) images using the Dual Integrity Convergent Encryption (DICE) protocol, which is a variant of the MLE based scheme. In the proposed scheme, an image is decomposed into blocks and the DICE protocol is applied on each block separately rather than on the entire image. As a result, the blocks that are common between two or more NI images are stored only once at the cloud. We provide detailed analyses on the theoretical, experimental and security aspects of the proposed scheme.


IEEE 2018 : Capacity-aware Key Partitioning Scheme for Heterogeneous Big Data Analytic Engines

Abstract : Big data and cloud computing became the centre of interest for the past decade. With the increase of data size and different cloud application, the idea of big data analytics become very popular both in industry and academia. The research communities in industry and academia never stopped trying to come up with the fast, robust, and fault tolerant analytic engines. MapReduce becomes one of the popular big data analytic engine over the past few years. Hadoop is a standard implementation of MapReduce framework for running data-intensive applications on the clusters of commodity servers. By thoroughly studying the framework we find out that the shuffle phase, all-to-all input data fetching phase in reduce task significantly affect the application performance. There is a problem of variance in both the intermediate key's frequencies and their distribution among data nodes throughout the cluster in Hadoop's MapReduce system. This variance in system causes network overhead which leads to unfairness on the reduce input among different data nodes in the cluster. Because of the above problems, applications experience performance degradation due to shuffle phase of MapReduce applications. We develop a new novel algorithm; unlike previous systems our algorithm considers each node's capabilities as heuristics to decide a better available trade-offfor the locality and fairness in the system. By comparing with the default Hadoop's partitioning algorithm and Leen partitioning algorithm: a). In case of 2 million key-value pairs to process, on the average our approach achieve better resource utilization by about 19%, and 9%, in that order; b). In case of 3 million key-value pairs to process, our approach achieve near optimal resource utilization by about 15%, and 7%, respectively.


IEEE 2018 : Secure Identity-based Data Sharing and Profile Matching for Mobile Healthcare Social Networks in Cloud Computing

Abstract : Cloud computing and social networks are changing the way of healthcare by providing real-time data sharing in a cost-effective manner. However, data security issue is one of the main obstacles to the wide application of mobile healthcare social
networks (MHSN), since health information is considered to be highly sensitive. In this paper, we introduce a secure data sharing and profile matching scheme for MHSN in cloud computing. The patients can outsource their encrypted health records to cloud
storage with identity-based broadcast encryption (IBBE) technique, and share them with a group of doctors in a secure and efficient manner. We then present an attribute-based conditional data re-encryption construction, which permits the doctors who satisfy the pre-defined conditions in the ciphertext to authorize the cloud platform to convert a ciphertext into a new ciphertext of an identity-based encryption scheme for specialist without leaking any sensitive information. Further, we provide a profile matching mechanism in MHSN based on identity-based encryption with equality test, that helps patients to find friends in a privacy-preserving way, and achieve flexible authorization on the encrypted health records with resisting the keywords guessing attack. Moreover, this mechanism reduces the computation cost on patient side. The security analysis and experimental evaluation show that our scheme is practical for protecting the data security and privacy in MHSN.





IEEE 2017: Efficient Processing of Skyline Queries Using MapReduce  

Abstract: The skyline operator has attracted considerable attention recently due to its broad applications. However, computing a skyline is challenging today since we have to deal with big data. For data intensive applications, the MapReduce framework has been widely used recently. In this paper, we propose the efficient parallel algorithm SKY-MR+ for processing skyline queries using MapReduce. We first build a uadtree-based histogram for space partitioning by deciding whether to split each leaf node judiciously based on the benefit of splitting in terms of the estimated execution time. In addition, we apply the dominance power filtering method to effectively prune non-skyline points in advance. We next partition data based on the regions divided by the quadtree and compute candidate skyline points for each partition using MapReduce. Finally, we check whether each skyline candidate point is actually a skyline point in every partition using MapReduce.Read More



IEEE 2017: Attribute-Based Storage Supporting Secure Deduplication of Encrypted Data in Cloud
Abstract: Attribute-based encryption (ABE) has been widely used in cloud computing where a data provider outsources his/her encrypted data to a cloud service provider, and can share the data with users possessing specific credentials (or attributes). However, the standard ABE system does not support secure deduplication, which is crucial for eliminating duplicate copies of identical data in order to save storage space and network bandwidth. In this paper, we present an attribute-based storage system with secure deduplication in a hybrid cloud setting, where a private cloud is responsible for duplicate detection and a public cloud manages the storage. Compared with the prior data deduplication systems, our system has two advantages. Firstly, it can be used to confidentially share data with users by specifying access policies rather than sharing decryption keys. Secondly, it achieves the standard notion of semantic security for data confidentiality while existing systems only achieve it by defining a weaker security notion. In addition, we put forth a methodology to modify a ciphertext over one access policy into ciphertexts of the same plaintext but under other access policies without revealing the underlying plaintext.Read More


IEEE 2017: FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters

Abstract: Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally partitioning data among a group of computing nodes. We start this study by discovering a serious performance problem of the existing parallel Frequent Itemset Mining algorithms. Given a large dataset, data partitioning strategies in the existing solutions suffer high communication and mining overhead induced by redundant transactions transmitted among computing nodes. We address this problem by developing a data partitioning approach called FiDoop-DP using the MapReduce programming model. The overarching goal of FiDoop-DP is to boost the performance of parallel Frequent Itemset Mining on Hadoop clusters. At the heart of FiDoop-DP is the Voronoi diagram-based data partitioning technique, which exploits correlations among transactions. Incorporating the similarity metric and the Locality-Sensitive Hashing technique, FiDoop-DP places highly similar transactions into a data partition to improve locality without creating an excessive number of redundant transactions.Read more



IEEE 2017: Privacy-Preserving Data Encryption Strategy for Big Data in Mobile Cloud Computing
Abstract: Privacy has become a considerable issue when the applications of big data are dramatically growing in cloud computing. The benefits of the implementation for these emerging technologies have improved or changed service models and improve application performances in various perspectives. However, the remarkably growing volume of data sizes has also resulted in many challenges in practice. The execution time of the data encryption is one of the serious issues during the data processing and transmissions. Many current applications abandon data encryptions in order to reach an adoptive performance level companioning with privacy concerns. In this paper, we concentrate on privacy and propose a novel data encryption approach, which is called Dynamic Data Encryption Strategy (D2ES). Our proposed approach aims to selectively encrypt data and use privacy classification methods under timing constraints. This approach is designed to maximize the privacy protection scope by using a selective encryption strategy within the required execution time requirements. The performance of D2ES has been evaluated in our experiments, which provides the proof of the privacy enhancement.Read More



 IEEE 2017: SocialQ&A: An Online Social Network Based Question and Answer System

Abstract:  Question and Answer (Q&A) systems play a vital role in our daily life for information and knowledge sharing. Users post questions and pick questions to answer in the system. Due to the rapidly growing user population and the number of questions, it is unlikely for a user to stumble upon a question by chance that (s)he can answer. Also, altruism does not encourage all users to provide answers, not to mention high quality answers with a short answer wait time. The primary objective of this paper is to improve the performance of Q&A systems by actively forwarding questions to users who are capable and willing to answer the questions. To this end, we have designed and implemented Social Q&A, an online social network based Q&A system. Social Q&A leverages the social network properties of common-interest and mutual-trust friend relationship to identify an asker through friendship who are most likely to answer the question, and enhance the user security. We also improve Social Q&A with security and efficiency enhancements by protecting user privacy and identifies, and retrieving answers automatically for recurrent questions. We describe the architecture and algorithms, and conducted comprehensive large-scale simulation to evaluate Social Q&A in comparison with other methods. Our results suggest that social networks can be leveraged to improve the answer quality and asker’s waiting time. We also implemented a real prototype of Social Q&A, and analyze the Q&A behavior of real users and questions from a small-scale real-world Social Q&A system.Read more

 

IEEE 2017: Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset
Abstract:  Clustering techniques have been widely adopted in many real world data analysis applications, such as customer behavior analysis, medical data Analysis, digital forensics, etc. With the explosion of data in today’s big data era, a major trend to handle a clustering over large-scale datasets is outsourcing it to HDFS platforms. This is because cloud computing offers not only reliable services with performance guarantees, but also savings on in-house IT infrastructures. However, as datasets used for clustering may contain sensitive information, e.g., patient health information, commercial data, and behavioral data, etc, directly outsourcing them to any Distributed servers inevitably raise privacy concerns. In this paper, we propose a practical privacy-preserving K-means clustering scheme that can be efficiently outsourced to HDFS servers. Read More

    

IEEE 2017: Detecting and Analyzing Urban Regions with High Impact of Weather Change on Transport
Abstract: In this work, we focus on two fundamental questions that are unprecedentedly important to urban planners to understand the functional characteristics of various urban regions throughout a city, namely, (i) how to identify regional weather-traffic sensitivity index throughout a city, that indicates the degree to which the region traffic in a city is impacted by weather changes; (ii) among complex regional features, such as road structure and population density, how to dissect the most influential regional features that drivethe urban region traffic to be more vulnerable to weather changes. However, these two questions are nontrivial to answer, because urban traffic changes dynamically over time and is essentially affected by many other factors, which may dominate the overall impact. We make the first study on these questions, by developing a weather-traffic index (WTI) system. The system includes two main
Components: weather-traffic index establishment and key factor analysis. Using the proposed system, we conducted comprehensive empirical study in Shanghai, and the weather-traffic indices extracted have been validated to be surprisingly consistent with real world observations. Further regional key factor analysis yields interesting results. For example, house age has significant impact on the weather-traffic index, which sheds light on future urban planning and reconstruction.Read More

IEEE 2017: Big Data Analytics for User Activity Analysis and User Anomaly Detection in Online Reviews 


Abstract: Mobile wireless networks can leverage spatio-temporal information about user and network condition to embed the system with end-to-end visibility and intelligence. Big data analytics has emerged as a promising approach to unearth meaningful insights and to build artificially intelligent models with assistance of machine learning tools.The ubiquity of smart phones has led to the emergence of mobile Crowd sourcing tasks such as the detection of spatial events when Smart phone users move around in their daily lives. However, the credibility of those detected events can be negatively impacted by unreliable participants with low-quality data. Consequently, a major challenge in quality control is to discover true events from diverse and noisy participants’ reports. This truth discovery problem is uniquely distinct from its online counterpart in that it involves uncertainties in both participants’ mobility and reliability. The proposed model is thus capable of efficiently handling various types of uncertainties and automatically discovering truth without any supervision or the need of location tracking.
Read More



IEEE 2017: Cost-Aware Big Data Processing across Geo-distributed Datacenters



Abstract :With the globalization of service, organizations continuously produce large volumes of data that need to be analysed over geo-dispersed locations. Traditionally central approach hat moving all data to a single cluster is inefficient or infeasible due to the imitations such as the scarcity of wide-area bandwidth and the low latency requirement of data processing. Processing big data across geo-distributed datacenters continues to gain popularity in recent years. However, managing distributed MapReduce computations across geo-distributed datacenters poses a number of technical challenges: how to allocate data among a selection of geo-distributed datacenters to reduce the communication cost, how to determine the VM (Virtual Machine) provisioning strategy that offers high performance and low cost, and what criteria should be used to select a datacenter as the final reducer for big data analytics jobs. In this paper, these challenges is addressed by balancing bandwidth cost, storage cost, computing cost, migration cost, and latency cost, between the two MapReduce phases across datacenters. We formulate this complex cost optimization problem for data movement, resource provisioning and reducer selection into a joint stochastic integer nonlinear optimization problem by minimizing the five cost factors simultaneously. The Lyapunov framework is integrated into our study and an efficient online algorithm that is able to minimize the long-term time-averaged operation cost is further designed. Theoretical analysis shows that our online algorithm can provide a near optimum solution with a provable gap and can guarantee that the data processing can be completed within pre-defined bounded delays. Experiments on WorldCup98 web site trace validate the theoretical analysis results and demonstrate that our approach is close to the offline-optimum performance and superior to some representative approaches.Read More

IEEE 2017: A Secure and Verifiable Access Control Scheme for Big Data Storage in Clouds
Abstract: Due to the complexity and volume, outsourcing ciphertexts to a cloud is deemed to be one of the most effective approaches for big data storage and access. Nevertheless, verifying the access legitimacy of a user and securely updating a ciphertext in the cloud based on a new access policy designated by the data owner are two critical challenges to make cloud-based big data storage practical and effective. Traditional approaches either completely ignore the issue of access policy update or delegate the update to a third party authority; but in practice, access policy update is important for enhancing security and dealing with the dynamism caused by user join and leave activities. In this paper, we propose a secure and verifiable access control scheme based on the NTRU cryptosystem for big data storage in clouds. We first propose a new NTRU decryption algorithm to overcome the decryption failures of the original NTRU, and then detail our scheme and analyze its correctness, security strengths, and computational efficiency. Our scheme allows the cloud server to efficiently update the ciphertext when a new access policy is specified by the data owner, who is also able to validate the update to counter against cheating behaviors of the cloud. It also enables (i) the data owner and eligible users to effectively verify the legitimacy of a user for accessing the data, and (ii) a user to validate the information provided by other users for correct plaintext recovery. Rigorous analysis indicates that our scheme can prevent eligible users from cheating and resist various attacks such as the collusion attack. Read more

IEEE 2017: Efficient Processing of Skyline Queries Using MapReduce

Abstract: The skyline operator has attracted considerable attention recently due to its broad applications. However, computing a skyline is challenging today since we have to deal with big data. For data intensive applications, the MapReduce framework has been widely used recently. In this paper, we propose the efficient parallel algorithm SKY-MR+ for processing skyline queries using MapReduce. We first build a quadtree-based histogram for space partitioning by deciding whether to split each leaf node judiciously based on the benefit of splitting in terms of the estimated execution time. In addition, we apply the dominance power filtering method to effectively prune non-skyline points in advance. We next partition data based on the regions divided by the quadtree and compute candidate skyline points for each partition using MapReduce. Finally, we check whether each skyline candidate point is actually a skyline point in every partition using MapReduce.





IEEE 2016: FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce

Abstract: Existing parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the MapReduce programming model. To achieve compressed storage and avoid building conditional pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional FP trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial third MapReduce job, the mappers independently decompose itemsets, the reducers perform combination operations by constructing small ultrametric trees, and the actual mining of these trees separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster is sensitive to data distribution and dimensions, because itemsets with different lengths have different decomposition and construction costs. To improve FiDoop’s performance, we develop a workload balance metric to measure load balance across the cluster’s computing nodes. We develop FiDoop-HD, an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis. Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution is efficient and scalable different lengths have different decomposition and construction costs. To improve FiDoop’s performance, we develop a workload balance metric to measure load balance across the cluster’s computing nodes. We develop FiDoop-HD, an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis. Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution is efficient and scalable

IEEE 2016: Sentiment Analysis of Top Colleges in India Using Twitter Data
Abstract: In today’s world, opinions and reviews accessible to us are one of the most critical factors in formulating our views and influencing the success of a brand, product or service. With the advent and growth of social media in the world, stakeholders often take to expressing their opinions on popular social media, namely Twitter. While Twitter data is extremely informative, it presents a challenge for analysis because of its humongous and disorganized nature. This paper is a thorough effort to dive into the novel domain of performing sentiment analysis of people’s opinions regarding top colleges in India. Besides taking additional preprocessing measures like the expansion of net lingo and removal of duplicate tweets, a probabilistic model based on Bayes’ theorem was used for spelling correction, which is overlooked in other research studies. This paper also highlights a comparison between the results obtained by exploiting the following machine learning algorithms: Naïve Bayes and Support Vector Machine and an Artificial Neural Network model: Multilayer Perceptron. Furthermore, a contrast has been presented between four different kernels of SVM: RBF, linear, polynomial and sigmoid.



IEEE 2016 : Phoenix: A MapReduce Implementation with New Enhancements 
      IEEE 2016  Data Mining

Abstract-Lately, the large increasing in data amount results in compound and large data-sets that caused the appearance of "Big Data" concept which gained the attention of industrial organizations as weil as academic communities. Big data APIs that need large memory can benefit from Phoenix MapReduce implementation for shared-memory machines, instead of large, distributed clusters of computers. This paper evaluates the design and the prototype of Phoenix, Phoenix performance, as weil as Phoenix limitations. This paper also suggests some new approaches to get over of some Phoenix limitation and enhance its performance on large-scale shared memory. The major contribution of this work is finding new approaches that get over the pairs limitation in phoenix framework using hash tables with B+Trees and get over the collisions problem of hash tables.



IEEE 2016 :  On Traffic-Aware Partition and Aggregation in MapReduce for Big Data Applications
     IEEE 2016  Data Mining


Abstract: The MapReduce programming model simplifies large-scale data processing on commodity cluster by exploiting parallel map tasks and reduces tasks. Although many efforts have been made to improve the performance of MapReduce jobs, they ignore the network traffic generated in the shuffle phase, which plays a critical role in performance enhancement. Traditionally, a hash function is used to partition intermediate data among reduce tasks, which, however, is not traffic-efficient because network topology and data size associated with each key are not taken into consideration. In this paper, we study to reduce network traffic cost for a MapReduce job by designing a novel intermediate data partition scheme. Furthermore, we jointly consider the aggregator placement problem, where each aggregator can reduce merged traffic from multiple map tasks. A decomposition-based distributed algorithm is proposed to deal with the large-scale optimization problem for big data application and an online algorithm is also designed to adjust data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost under both offline and online cases.



IEEE 2016 : The SP Theory of Intelligence: Distinctive Features and Advantages
 IEEE 2016  Data Mining

Abstract: This paper aims to highlight distinctive features of the SP theory of intelligence, realized in the SP computer model, and its apparent advantages compared with some AI-related alternatives. Perhaps most importantly, the theory simplies and integrates observations and concepts in AI-related areas, and has potential to simplify and integrate of structures and processes in computing systems. Unlike most other AI-related theories, the SP theory is itself a theory of computing, which can be the basis for new architectures for computers. Fundamental in the theory is information compression via the matching and uni_cation of patterns and, more specifically, via a concept of multiple alignment. The theory promotes transparency in the representation and processing of knowledge, and unsupervised learning of natural structures via information compression. It provides an interpretation of aspects of mathematics and an interpretation of phenomena in human perception and cognition. Abstract concepts in the theory may be realized in terms of neurons and their inter-connections (SP-neural). These features and advantages of the SP system are discussed in relation to AI-related alternatives: the concept of minimum length encoding and related concepts, how computational and energy efficiency in computing may be achieved, deep learning in neural networks, united theories of cognition and related research, universal search, Bayesian networks and some other models for AI, IBM's Watson, solving problems associated with big data and in the development of intelligence in autonomous robots, pattern recognition and vision, the learning and processing of natural language, exact and inexact forms of reasoning, representation and processing of diverse forms of knowledge, and software engineering. In conclusion, the SP system can provide a _rm foundation for the long-term development of AI and related bareas, and at the same time, it may deliver useful results on relatively short timescales.



IEEE 2016 : A Parallel Patient Treatment Time Prediction Algorithm and Its Applications in Hospital Queuing-Recommendation in a Big Data Environment
IEEE 2016  Data Mining

Abstract: Effective patient queue management to minimize patient wait delays and patient overcrowding is one of the major challenges faced by hospitals. Unnecessary and annoying waits for long periods result in substantial human resource and time wastage and increase the frustration endured by patients. For each patient in the queue, the total treatment time of all the patients before him is the time that he must wait.It would be convenient and preferable if the patients could receive the most efficient treatment plan and know the predicted waiting time through a mobile application that updates in real time. Therefore, we propose a Patient Treatment Time Prediction (PTTP) algorithm to predict the waiting time for each treatment task for a patient. We use realistic patient data from various hospitals to obtain a patient treatment time model for each task. Based on this large-scale, realistic dataset, the treatment time for each patient in the current queue of each task is predicted. Based on the predicted waiting time, a Hospital Queuing-Recommendation (HQR) system is developed. HQR calculates and predicts an efficient and convenient treatment plan recommended for the patient. Because of the large-scale, realistic dataset and the requirement for real-time response, the PTTP algorithm and HQR system mandate efficiency and low-latency response. We use an Apache Spark-based cloud implementation at the National Supercomputing Center in Changsha to achieve the aforementioned goals. Extensive experimentation and simulation results demonstrate the effectiveness and applicability of our proposed model to recommend an effective treatment plan for patients to minimize their wait times in hospitals.


IEEE 2016 : Protection of Big Data Privacy
IEEE 2016 Big Data Applications

Abstract— In recent years, big data have become a hot research topic. The increasing amount of big data also increases the chance of breaching the privacy of individuals. Since big data require high computational power and large storage, distributed systems are used. As multiple parties are involved in these systems, the risk of privacy violation is increased. There have been a number of privacy-preserving mechanisms developed for privacy protection at different stages (e.g., data generation, data storage, and data processing) of a big data life cycle. The goal of this paper is to provide a comprehensive overview of the privacy preservation mechanisms in big data and present the challenges for existing mechanisms. In particular, in this paper, we illustrate the infrastructure of big data and the state-of-the-art privacy-preserving mechanisms in each stage of the big data life cycle. Furthermore, we discuss the challenges and future research directions related to privacy preservation in big data.



IEEE 2016 :  Towards a Virtual Domain Based Authentication on MapReduce
IEEE 2016  Data Mining


Abstract: This paper has proposed a novel authentication solution for the MapReduce (MR) model, a new distributed and parallel computing paradigm commonly deployed to process BigData by major IT players, such as Facebook and Yahoo. It identifies a set of security, performance, and scalability requirements that are specified from a comprehensive study of a job execution process using MR and security threats and attacks in this environment. Based on the requirements, it critically analyzes the state-of-the-art authentication solutions, discovering that the authentication services currently proposed for the MR model is not adequate. This paper then presents a novel layered authentication solution for the MR model and describes the core components of this solution, which includes the virtual domain based authentication framework (VDAF). These novel ideas are significant, because, first, the approach embeds the characteristics of MR-in-cloud deployments into security solution designs, and this will allow the MR model be delivered as a software as a service in a public cloud environment along with our proposed authentication solution; second, VDAF supports the authentication of every interactions by any MR components involved in a job execution flow, so long as the interactions are for accessing resources of the job; third, this continuous authentication service is provided in such a manner that the costs incurred in providing the authentication service should be as low as possible.


IEEE 2016 :   CMiner: Opinion Extraction and Summarization for Chinese Microblogs
IEEE 2016  Data Mining

 Abstract—Sentiment analysis of microblog texts has drawn lots of attention in both the academic and industrial fields. However, most of the current work only focuses on polarity classification. In this paper, we present an opinion mining system for Chinese microblogs called CMiner. Instead of polarity classification, CMiner focuses on more complicated opinion mining tasks -opinion target extraction and opinion summarization. Novel algorithms are developed for the two tasks and integrated into the end-to-end system. CMiner can help to effectively understand the users’ opinion towards different opinion targets in a microblog topic. Specially, we develop an unsupervised label propagation algorithm for opinion target extraction. The opinion targets of all messages in a topic are collectively extracted based on the assumption that similar messages may focus on similar opinion targets. In addition, we build an aspect-based opinion summarization framework for microblog topics. After getting the opinion targets of all the microblog messages in a topic, we cluster the opinion targets into several groups and extract representative targets and summaries for each group. A co-ranking algorithm is proposed to rank both the opinion targets and microblog sentences simultaneously. Experimental results on a benchmark dataset show the effectiveness of our system and the algorithms.



IEEE 2015: A Hierarchical Distributed Processing Framework for Big Image Data
Abstract: This paper introduces an effective processing framework nominated ICP (Image Cloud Processing) to powerfully cope with the data explosion in image processing field. While most previous researches focus on optimizing the image processing algorithms to gain higher efficiency, our work dedicates to providing a general framework for those image processing algorithms, which can be implemented in parallel so as to achieve a boost in time efficiency without compromising the results performance along with the increasing image scale. The proposed ICP framework consists of two mechanisms, i.e. SICP (Static ICP) and DICP (Dynamic ICP). Specifically, SICP is aimed at processing the big image data pre-stored in the distributed system, while DICP is proposed for dynamic input. To accomplish SICP, two novel data representations named P-Image and Big-Image are designed to cooperate with MapReduce to achieve more optimized configuration and higher efficiency. DICP is implemented through a parallel processing procedure working with the traditional processing mechanism of the distributed system. Representative results of comprehensive experiments on the challenging ImageNet dataset are selected to validate the capacity of our proposed ICP framework over the traditional state-of-the-art methods, both in time efficiency and quality of results.




IEEE 2015 :Research Directions for Engineering Big Data Analytics Software
IEEE 2016  Data Mining

Abstract: Many software startups and research and development efforts are actively taking place to harness the power of big data and create software with potential to improve almost every aspect of human life. As these efforts continue to increase, full consideration needs to be given to engineering aspects of big data software. Since these systems exist to make predictions on complex and continuous massive datasets, they pose unique problems during specification, design, and verification of software that needs to be delivered on-time and within budget. But, given the nature of big data software, can this be done? Does big data software engineering really work? This article explores details of big data software, discusses the main problems encountered when engineering big data software, and proposes avenues for future research.


IEEE 2015 : An Aggregatable Name-Based Routing for Energy-Efficient Data Sharing in Big Data Era
IEEE 2015  Transaction on Big Data

Abstract The MapReduce programming model simplifies large-scale data processing on commodity cluster by exploiting parallel map tasks and reduce tasks. Although many efforts have been made to improve the performance of MapReduce jobs, they ignore the network traffic generated in the shuffle phase, which plays a critical role in performance enhancement. Traditionally, a hash function is used to partition intermediate data among reduce tasks, which, however, is not traffic-efficient because network topology and data size associated with each key are not taken into consideration. In this paper, we study to reduce network traffic cost for a MapReduce job by designing a novel intermediate data partition scheme. Furthermore, we jointly consider the aggregator placement problem, where each aggregator can reduce merged traffic from multiple map tasks. A decomposition-based distributed algorithm is proposed to deal with the large-scale optimization problem for big data application and an online algorithm is also designed to adjust data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost under both offline and online cases.


IEEE 2015 : Secure Sensitive Data Sharing on a Big Data Platform
IEEE 2015  Transaction on Big Data

Abstract Users store vast amounts of sensitive data on a big data platform. Sharing sensitive data will help enterprises reduce the cost of providing users with personalized services and provide value-added data services. However, secure data sharing is problematic. This paper proposes a framework for secure sensitive data sharing on a big data platform, including secure data delivery, storage, usage, and destruction on a semi-trusted big data sharing platform. We present a proxy re-encryption algorithm based on heterogeneous ciphertext transformation and a user process protection method based on a virtual machine monitor, which provides support for the realization of system functions. The framework protects the security of users’ sensitive data effectively and shares these data safely. At the same time, data owners retain complete control of their own data in a sound environment for modern Internet information security.


IEEE 2015 : Research Directions for Engineering Big Data Analytics Software 
IEEE 2015  Transaction on Big Data

Abstract Many software startups and research and development efforts are actively taking place to harness the power of big data and create software with potential to improve almost every aspect of human life. As these efforts continue to increase, full consideration needs to be given to engineering aspects of big data software. Since these systems exist to make predictions on complex and continuous massive datasets, they pose unique problems during specification, design, and verification of software that needs to be delivered on-time and within budget. But, given the nature of big data software, can this be done? Does big data software engineering really work? This article explores details of big data software, discusses the main problems encountered when engineering big data software, and proposes avenues for future research.


IEEE 2015 : Massive MIMO as a Big Data System: Random Matrix Models and Testbed
IEEE 2015  Transaction on Big Data

Abstract— The paper has two parts. The first one deals with how to use large random matrices as building blocks to model the massive data arising from the massive (or large-scale) MIMO system. As a result, we apply this model for distributed spectrum sensing and network monitoring. The part boils down to the streaming, distributed massive data, for which a new algorithm is obtained and its performance is derived using the central limit theorem that is recently obtained in the literature. The second part deals with the large-scale testbed using software-defined radios (particularly USRP) that takes us more than four years to develop this 70-node network testbed. To demonstrate the power of the software defined radio, we reconfigure our testbed quickly into a testbed for massive MIMO. The massive data of this testbed is of central interest in this paper. It is for the first time for us to model the experimental data arising from this testbed. To our best knowledge, we are not aware of other similar work.


  
IEEE 2015 : Big Data and the SP Theory of Intelligence
IEEE 2015  Transaction on Big Data

Abstract This paper is about how the SP theory of intelligence and its realization in the SP machine may, with advantage, be applied to the management and analysis of big data. The SP system_introduced in this paper and fully described elsewhere_may  help to overcome the problem of variety in big data; it has potential as a universal framework for the representation and processing of diverse kinds of knowledge, helping to reduce the diversity of formalisms and formats for knowledge, and the different ways in which they are processed. It has strengths in the unsupervised learning or discovery of structure in data, in pattern recognition, in the parsing and production of natural language, in several kinds of reasoning, and more. It lends itself to the analysis of streaming data, helping to overcome the problem of velocity in big data. Central in the workings of the system is lossless compression of information: making big data smaller and reducing problems of storage and management. There is potential for substantial economies in the transmission of data, for big cuts in the use of energy in computing, for faster processing, and for smaller and lighter computers. The system provides a handle on the problem of veracity in big data, with potential to assist in the management of errors and uncertainties in data. It lends itself to the visualization of knowledge structures and inferential processes. A high-parallel, open-source version of the SP machine would provide a means for researchers everywhere to explore what can be done with the system and to create new versions of it.


IEEE 2015 : Cost Minimization for Big Data Processing in Geo-Distributed Data Centers
IEEE 2015  Transaction on Big Data

Abstract—  The explosive growth of demands on big data processing imposes a heavy burden on computation, storage, and communication in data centers, which hence incurs considerable operational expenditure to data center providers. Therefore, cost minimization has become an emergent issue for the upcoming big data era. Different from conventional cloud services, one of the main features of big data services is the tight coupling between data and computation as computation tasks can be conducted only when the corresponding data are available. As a result, three factors, i.e., task assignment, data placement, and data movement, deeply influence the operational expenditure of data centers. In this paper, we are motivated to study the cost minimization problem via a joint optimization of these three factors for big data services in geo-distributed data centers. To describe the task completion time with the consideration of both data transmission and computation, we propose a 2-D Markov chain and derive the average task completion time in closed-form. Furthermore, we model the problem as a mixed-integer nonlinear programming and propose an efficient solution to linearize it. The high efficiency of our proposal is validated by extensive simulation- based studies.


No comments:

Post a Comment

IEEE 2023: WEB SECURITY OR CYBER CRIME

  IEEE 2023:   Machine Learning and Software-Defined Networking to Detect DDoS Attacks in IOT Networks Abstract:   In an era marked by the r...