IEEE 2018 : MR-Mafia: Parallel Subspace Clustering Algorithm Based on MapReduce For Large Multi-dimensional Datasets
Abstract : The mission of subspace clustering is to find hidden clusters exist in different subspaces within a dataset. In recent years, with the exponential growth of data size and data dimensions, traditional subspace clustering algorithms become inefficient as well as ineffective while extracting knowledge in the big dataenvironment, resulting in an emergent need to design efficient parallel distributed subspace clustering algorithms to handle large multi-dimensional data with an acceptable computational cost. In this paper, we introduce MR-Mafia: a parallel mafia subspace clustering algorithm based on MapReduce. The algorithm takes advantage of MapReduce's data partitioning and task parallelism and achieves a good tradeoff between the cost for disk accesses and communication cost. The experimental results show near linear speedups and demonstrate the high scalability and great application prospects of the proposed algorithm.
IEEE 2018 : Ciphertext-Policy Attribute-Based Signcryption With Verifiable Outsourced Designcryption for Sharing Personal Health Records
Abstract : Personal Health Record (PHR) is a patient-centric model of health information exchange, which greatly facilitates the storage, access and share of personal health information. In order to share the valuable resources and reduce the operational cost, the PHR service providers would like to store the PHR applications and health information data in the cloud. The private health information may be exposed to unauthorized organizations or individuals since the patient lost the physical control of their health information. Ciphertext-Policy Attribute-Based Signcryption (CP-ABSC) is a promising solution to design cloud-assisted PHR secure sharing system. It provides fine-grained access control, confidentiality, authenticity and sender privacy of PHR data. However, a large number of pairing and modular exponentiation computations bring heavy computational overhead during designcryption process. In order to reconcile the conflict of high computational overhead and low efficiency in the designcryption process, an outsourcing scheme is proposed in this paper. In our scheme, the heavy computations are outsourced to Ciphertext Transformed Server (CTS), only leaving a small computational overhead for the PHR user. At the same time, the extra communication overhead in our scheme is actually tolerable. Furthermore, theoretical analysis and the desired securing properties including confidentiality, unforgeability and verifiability have been proved formally in the random oracle model. Experimental evaluation indicates that the proposed scheme is practical and feasible.
IEEE 2018 : Client Side Secure Image Deduplication Using DICE Protocol
Abstract : With the advent of cloud computing, secured data deduplication has gained a lot of popularity. Many techniques have been proposed in the literature of this ongoing research area. Among these techniques, the Message Locked Encryption (MLE) scheme is often mentioned. Researchers have introduced MLE based protocols which provide secured deduplication of data, where the data is generally in text form. As a result, multimedia data such as images and video, which are larger in size compared to text files, have not been given much attention. Applying secured data deduplication to such data files could significantly reduce the cost and space required for their storage. In this paper we present a secure deduplication scheme for near identical (NI) images using the Dual Integrity Convergent Encryption (DICE) protocol, which is a variant of the MLE based scheme. In the proposed scheme, an image is decomposed into blocks and the DICE protocol is applied on each block separately rather than on the entire image. As a result, the blocks that are common between two or more NI images are stored only once at the cloud. We provide detailed analyses on the theoretical, experimental and security aspects of the proposed scheme.
IEEE 2018 : Capacity-aware Key Partitioning Scheme for Heterogeneous Big Data Analytic Engines
Abstract : Big data and cloud computing became the centre of interest for the past decade. With the increase of data size and different cloud application, the idea of big data analytics become very popular both in industry and academia. The research communities in industry and academia never stopped trying to come up with the fast, robust, and fault tolerant analytic engines. MapReduce becomes one of the popular big data analytic engine over the past few years. Hadoop is a standard implementation of MapReduce framework for running data-intensive applications on the clusters of commodity servers. By thoroughly studying the framework we find out that the shuffle phase, all-to-all input data fetching phase in reduce task significantly affect the application performance. There is a problem of variance in both the intermediate key's frequencies and their distribution among data nodes throughout the cluster in Hadoop's MapReduce system. This variance in system causes network overhead which leads to unfairness on the reduce input among different data nodes in the cluster. Because of the above problems, applications experience performance degradation due to shuffle phase of MapReduce applications. We develop a new novel algorithm; unlike previous systems our algorithm considers each node's capabilities as heuristics to decide a better available trade-offfor the locality and fairness in the system. By comparing with the default Hadoop's partitioning algorithm and Leen partitioning algorithm: a). In case of 2 million key-value pairs to process, on the average our approach achieve better resource utilization by about 19%, and 9%, in that order; b). In case of 3 million key-value pairs to process, our approach achieve near optimal resource utilization by about 15%, and 7%, respectively.
IEEE 2018 : Secure Identity-based Data Sharing and Profile Matching for Mobile Healthcare Social Networks in Cloud Computing
Abstract : Cloud computing and social networks are changing the way of healthcare by providing real-time data sharing in a cost-effective manner. However, data security issue is one of the main obstacles to the wide application of mobile healthcare social
networks (MHSN), since health information is considered to be highly sensitive. In this paper, we introduce a secure data sharing and profile matching scheme for MHSN in cloud computing. The patients can outsource their encrypted health records to cloud
storage with identity-based broadcast encryption (IBBE) technique, and share them with a group of doctors in a secure and efficient manner. We then present an attribute-based conditional data re-encryption construction, which permits the doctors who satisfy the pre-defined conditions in the ciphertext to authorize the cloud platform to convert a ciphertext into a new ciphertext of an identity-based encryption scheme for specialist without leaking any sensitive information. Further, we provide a profile matching mechanism in MHSN based on identity-based encryption with equality test, that helps patients to find friends in a privacy-preserving way, and achieve flexible authorization on the encrypted health records with resisting the keywords guessing attack. Moreover, this mechanism reduces the computation cost on patient side. The security analysis and experimental evaluation show that our scheme is practical for protecting the data security and privacy in MHSN.
IEEE 2017: Efficient Processing of Skyline Queries Using MapReduce
Abstract: The skyline operator has attracted considerable attention recently due to its broad applications. However, computing a skyline is challenging today since we have to deal with big data. For data intensive applications, the MapReduce framework has been widely used recently. In this paper, we propose the efficient parallel algorithm SKY-MR+ for processing skyline queries using MapReduce. We first build a uadtree-based histogram for space partitioning by deciding whether to split each leaf node judiciously based on the benefit of splitting in terms of the estimated execution time. In addition, we apply the dominance power filtering method to effectively prune non-skyline points in advance. We next partition data based on the regions divided by the quadtree and compute candidate skyline points for each partition using MapReduce. Finally, we check whether each skyline candidate point is actually a skyline point in every partition using MapReduce.Read More
IEEE 2017: Attribute-Based Storage
Supporting Secure Deduplication of Encrypted Data in Cloud
Abstract: Attribute-based
encryption (ABE) has been widely used in cloud computing where a data provider
outsources his/her encrypted data to a cloud service provider, and can share
the data with users possessing specific credentials (or attributes). However, the
standard ABE system does not support secure deduplication, which is crucial for
eliminating duplicate copies of identical data in order to save storage space
and network bandwidth. In this paper, we present an attribute-based storage
system with secure deduplication in a hybrid cloud setting, where a private
cloud is responsible for duplicate detection and a public cloud manages the storage.
Compared with the prior data deduplication systems, our system has two
advantages. Firstly, it can be used to confidentially share data with users by
specifying access policies rather than sharing decryption keys. Secondly, it
achieves the standard notion of semantic security for data confidentiality
while existing systems only achieve it by defining a weaker security notion. In
addition, we put forth a methodology to modify a ciphertext over one access
policy into ciphertexts of the same plaintext but under other access policies without
revealing the underlying plaintext.Read More
IEEE 2017: FiDoop-DP: Data Partitioning in Frequent Itemset Mining on Hadoop Clusters
Abstract: Traditional parallel algorithms for mining frequent itemsets aim to balance load by equally partitioning data among a group of computing nodes. We start this study by discovering a serious performance problem of the existing parallel Frequent Itemset Mining algorithms. Given a large dataset, data partitioning strategies in the existing solutions suffer high communication and mining overhead induced by redundant transactions transmitted among computing nodes. We address this problem by developing a data partitioning approach called FiDoop-DP using the MapReduce programming model. The overarching goal of FiDoop-DP is to boost the performance of parallel Frequent Itemset Mining on Hadoop clusters. At the heart of FiDoop-DP is the Voronoi diagram-based data partitioning technique, which exploits correlations among transactions. Incorporating the similarity metric and the Locality-Sensitive Hashing technique, FiDoop-DP places highly similar transactions into a data partition to improve locality without creating an excessive number of redundant transactions.Read more
IEEE 2017: Privacy-Preserving Data
Encryption Strategy for Big Data in Mobile Cloud Computing
Abstract: Privacy
has become a considerable issue when the applications of big data are
dramatically growing in cloud computing. The benefits of the implementation for
these emerging technologies have improved or changed service models and improve
application performances in various perspectives. However, the remarkably
growing volume of data sizes has also resulted in many challenges in practice.
The execution time of the data encryption is one of the serious issues during
the data processing and transmissions. Many current applications abandon data
encryptions in order to reach an adoptive performance level companioning with
privacy concerns. In this paper, we concentrate on privacy and propose a novel
data encryption approach, which is called Dynamic Data Encryption Strategy
(D2ES). Our proposed approach aims to selectively encrypt data and use privacy
classification methods under timing constraints. This approach is designed to
maximize the privacy protection scope by using a selective encryption strategy
within the required execution time requirements. The performance of D2ES has been
evaluated in our experiments, which provides the proof of the privacy
enhancement.Read More
IEEE 2017: SocialQ&A: An Online Social
Network Based Question and Answer System
Abstract:
Question and Answer (Q&A) systems play a vital role in our
daily life for information and knowledge sharing. Users post questions and pick
questions to answer in the system. Due to the rapidly growing user population
and the number of questions, it is unlikely for a user to stumble upon a
question by chance that (s)he can answer. Also, altruism does not encourage all
users to provide answers, not to mention high quality answers with a short
answer wait time. The primary objective of this paper is to improve the performance
of Q&A systems by actively forwarding questions to users who are capable
and willing to answer the questions. To this end, we have designed and
implemented Social Q&A, an online social network based Q&A system.
Social Q&A leverages the social network properties of common-interest and
mutual-trust friend relationship to identify an asker through friendship who
are most likely to answer the question, and enhance the user security. We also
improve Social Q&A with security and efficiency enhancements by protecting
user privacy and identifies, and retrieving answers automatically for recurrent
questions. We describe the architecture and algorithms, and conducted
comprehensive large-scale simulation to evaluate Social Q&A in comparison
with other methods. Our results suggest that social networks can be leveraged
to improve the answer quality and asker’s waiting time. We also implemented a
real prototype of Social Q&A, and analyze the Q&A behavior of real
users and questions from a small-scale real-world Social Q&A system.Read more
IEEE 2017: Practical Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset
Abstract: Clustering techniques have been widely adopted in many real world data analysis applications, such as customer behavior analysis, medical data Analysis, digital forensics, etc. With the explosion of data in today’s big data era, a major trend to handle a clustering over large-scale datasets is outsourcing it to HDFS platforms. This is because cloud computing offers not only reliable services with performance guarantees, but also savings on in-house IT infrastructures. However, as datasets used for clustering may contain sensitive information, e.g., patient health information, commercial data, and behavioral data, etc, directly outsourcing them to any Distributed servers inevitably raise privacy concerns. In this paper, we propose a practical privacy-preserving K-means clustering scheme that can be efficiently outsourced to HDFS servers. Read More
IEEE 2017: Detecting and Analyzing
Urban Regions with High Impact of Weather Change on Transport
Abstract: In this work, we focus on two fundamental
questions that are unprecedentedly important to urban planners to understand the
functional characteristics of various urban regions throughout a city, namely,
(i) how to identify regional weather-traffic sensitivity index throughout a
city, that indicates the degree to which the region traffic in a city is
impacted by weather changes; (ii) among complex regional features, such as road
structure and population density, how to dissect the most influential regional
features that drivethe urban region traffic to be more vulnerable to weather
changes. However, these two questions are nontrivial to answer, because urban
traffic changes dynamically over time and is essentially affected by many other
factors, which may dominate the overall impact. We make the first study on
these questions, by developing a weather-traffic index (WTI) system. The system
includes two main
Components: weather-traffic index establishment
and key factor analysis. Using the proposed system, we conducted comprehensive empirical
study in Shanghai, and the weather-traffic indices extracted have been
validated to be surprisingly consistent with real world observations. Further
regional key factor analysis yields interesting results. For example, house age
has significant impact on the weather-traffic index, which sheds light on
future urban planning and reconstruction.Read More
IEEE 2017: Big Data Analytics for User Activity Analysis and User Anomaly Detection in Online Reviews
Abstract: Mobile
wireless networks can leverage spatio-temporal information about user and
network condition to embed the system with end-to-end visibility and intelligence.
Big data analytics has emerged as a promising approach to unearth meaningful
insights and to build artificially intelligent models with assistance of
machine learning tools.The
ubiquity of smart phones has led to the emergence of mobile Crowd sourcing
tasks such as the detection of spatial events when Smart phone users move
around in their daily lives. However, the credibility of those detected events
can be negatively impacted by unreliable participants with low-quality data.
Consequently, a major challenge in quality control is to discover true events
from diverse and noisy participants’ reports. This truth discovery problem is
uniquely distinct from its online counterpart in that it involves uncertainties
in both participants’ mobility and reliability. The proposed model is thus
capable of efficiently handling various types of uncertainties and
automatically discovering truth without any supervision or the need of location
tracking.
Read More
Read More
IEEE 2017: Cost-Aware Big Data Processing across Geo-distributed Datacenters
Abstract :With the globalization of service, organizations continuously produce large volumes of data that need to be analysed over geo-dispersed locations. Traditionally central approach hat moving all data to a single cluster is inefficient or infeasible due to the imitations such as the scarcity of wide-area bandwidth and the low latency requirement of data processing. Processing big data across geo-distributed datacenters continues to gain popularity in recent years. However, managing distributed MapReduce computations across geo-distributed datacenters poses a number of technical challenges: how to allocate data among a selection of geo-distributed datacenters to reduce the communication cost, how to determine the VM (Virtual Machine) provisioning strategy that offers high performance and low cost, and what criteria should be used to select a datacenter as the final reducer for big data analytics jobs. In this paper, these challenges is addressed by balancing bandwidth cost, storage cost, computing cost, migration cost, and latency cost, between the two MapReduce phases across datacenters. We formulate this complex cost optimization problem for data movement, resource provisioning and reducer selection into a joint stochastic integer nonlinear optimization problem by minimizing the five cost factors simultaneously. The Lyapunov framework is integrated into our study and an efficient online algorithm that is able to minimize the long-term time-averaged operation cost is further designed. Theoretical analysis shows that our online algorithm can provide a near optimum solution with a provable gap and can guarantee that the data processing can be completed within pre-defined bounded delays. Experiments on WorldCup98 web site trace validate the theoretical analysis results and demonstrate that our approach is close to the offline-optimum performance and superior to some representative approaches.Read More
IEEE 2017: A Secure and Verifiable
Access Control Scheme for Big Data Storage in Clouds
Abstract: Due to the complexity
and volume, outsourcing ciphertexts to a cloud is deemed to be one of the most
effective approaches for big data storage and access. Nevertheless, verifying
the access legitimacy of a user and securely updating a ciphertext in the cloud
based on a new access policy designated by the data owner are two critical
challenges to make cloud-based big data storage practical and effective.
Traditional approaches either completely ignore the issue of access policy
update or delegate the update to a third party authority; but in practice,
access policy update is important for enhancing security and dealing with the
dynamism caused by user join and leave activities. In this paper, we propose a
secure and verifiable access control scheme based on the NTRU cryptosystem for
big data storage in clouds. We first propose a new NTRU decryption algorithm to
overcome the decryption failures of the original NTRU, and then detail our
scheme and analyze its correctness, security strengths, and computational
efficiency. Our scheme allows the cloud server to efficiently update the
ciphertext when a new access policy is specified by the data owner, who is also
able to validate the update to counter against cheating behaviors of the cloud.
It also enables (i) the data owner and eligible users to effectively verify the
legitimacy of a user for accessing the data, and (ii) a user to validate the
information provided by other users for correct plaintext recovery. Rigorous
analysis indicates that our scheme can prevent eligible users from cheating and
resist various attacks such as the collusion attack. Read more
IEEE 2017: Efficient Processing of
Skyline Queries Using MapReduce
Abstract: The skyline operator has attracted considerable attention recently due
to its broad applications. However, computing a skyline is challenging today
since we have to deal with big data. For data intensive applications, the
MapReduce framework has been widely used recently. In this paper, we propose
the efficient parallel algorithm SKY-MR+ for processing skyline queries using
MapReduce. We first build a quadtree-based histogram for space partitioning by
deciding whether to split each leaf node judiciously based on the benefit of
splitting in terms of the estimated execution time. In addition, we apply the
dominance power filtering method to effectively prune non-skyline points in
advance. We next partition data based on the regions divided by the quadtree and
compute candidate skyline points for each partition using MapReduce. Finally,
we check whether each skyline candidate point is actually a skyline point in
every partition using MapReduce.
IEEE 2016: FiDoop: Parallel Mining of
Frequent Itemsets Using MapReduce
Abstract: Existing parallel mining algorithms for
frequent itemsets lack a mechanism that enables automatic parallelization, load
balancing, data distribution, and fault tolerance on large clusters. As a
solution to this problem, we design a parallel frequent itemsets mining
algorithm called FiDoop using the MapReduce programming model. To achieve
compressed storage and avoid building conditional pattern bases, FiDoop incorporates
the frequent items ultrametric tree, rather than conventional FP trees. In
FiDoop, three MapReduce jobs are implemented to complete the mining task. In the
crucial third MapReduce job, the mappers independently decompose itemsets, the
reducers perform combination operations by constructing small ultrametric
trees, and the actual mining of these trees separately. We implement FiDoop on
our in-house Hadoop cluster. We show that FiDoop on the cluster is sensitive to
data distribution and dimensions, because itemsets with different lengths have
different decomposition and construction costs. To improve FiDoop’s
performance, we develop a workload balance metric to measure load balance
across the cluster’s computing nodes. We develop FiDoop-HD, an extension of
FiDoop, to speed up the mining performance for high-dimensional data analysis.
Extensive experiments using real-world celestial spectral data demonstrate that
our proposed solution is efficient and scalable different lengths have different decomposition and construction costs. To improve
FiDoop’s performance, we develop a workload balance metric to measure load
balance across the cluster’s computing nodes. We develop FiDoop-HD, an
extension of FiDoop, to speed up the mining performance for high-dimensional
data analysis. Extensive experiments using real-world celestial spectral data
demonstrate that our proposed solution is efficient and scalable
IEEE 2016: Sentiment Analysis of Top
Colleges in India Using Twitter Data
Abstract: In today’s world, opinions and reviews
accessible to us are one of the most critical factors in formulating our views and
influencing the success of a brand, product or service. With the advent and
growth of social media in the world, stakeholders often take to expressing
their opinions on popular social media, namely Twitter. While Twitter data is
extremely informative, it presents a challenge for analysis because of its
humongous and disorganized nature. This paper is a thorough effort to dive into
the novel domain of performing sentiment analysis of people’s opinions regarding
top colleges in India. Besides taking additional preprocessing measures like
the expansion of net lingo and removal of duplicate tweets, a probabilistic
model based on Bayes’ theorem was used for spelling correction, which is overlooked
in other research studies. This paper also highlights a comparison between the
results obtained by exploiting the following machine learning algorithms: Naïve
Bayes and Support Vector Machine and an Artificial Neural Network model: Multilayer
Perceptron. Furthermore, a contrast has been presented between four different
kernels of SVM: RBF, linear, polynomial and sigmoid.
IEEE 2016 : Phoenix:
A MapReduce Implementation with New Enhancements
IEEE 2016
Data Mining
Abstract-Lately, the large increasing in data amount results in compound and large data-sets that caused the appearance of "Big Data" concept which gained the attention of industrial organizations as weil as academic communities. Big data APIs that need large memory can benefit from Phoenix MapReduce implementation for shared-memory machines, instead of large, distributed clusters of computers. This paper evaluates the design and the prototype of Phoenix, Phoenix performance, as weil as Phoenix limitations. This paper also suggests some new approaches to get over of some Phoenix limitation and enhance its performance on large-scale shared memory. The major contribution of this work is finding new approaches that get over the pairs limitation in phoenix framework using hash tables with B+Trees and get over the collisions problem of hash tables.
IEEE 2016 : On
Traffic-Aware Partition and Aggregation in MapReduce for Big Data Applications
IEEE
2016 Data Mining
Abstract: The MapReduce programming model simplifies large-scale data processing on commodity cluster by exploiting parallel map tasks and reduces tasks. Although many efforts have been made to improve the performance of MapReduce jobs, they ignore the network traffic generated in the shuffle phase, which plays a critical role in performance enhancement. Traditionally, a hash function is used to partition intermediate data among reduce tasks, which, however, is not traffic-efficient because network topology and data size associated with each key are not taken into consideration. In this paper, we study to reduce network traffic cost for a MapReduce job by designing a novel intermediate data partition scheme. Furthermore, we jointly consider the aggregator placement problem, where each aggregator can reduce merged traffic from multiple map tasks. A decomposition-based distributed algorithm is proposed to deal with the large-scale optimization problem for big data application and an online algorithm is also designed to adjust data partition and aggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce network traffic cost under both offline and online cases.
IEEE 2016
Data Mining
Abstract: This paper aims to highlight
distinctive features of the SP theory of intelligence, realized in the SP
computer model, and its apparent advantages compared with some AI-related
alternatives. Perhaps most importantly, the theory simplies and integrates
observations and concepts in AI-related areas, and has potential to simplify
and integrate of structures and processes in computing systems. Unlike most
other AI-related theories, the SP theory is itself a theory of computing, which
can be the basis for new architectures for computers. Fundamental in the theory
is information compression via the matching and uni_cation of patterns and,
more specifically, via a concept of multiple alignment. The theory promotes
transparency in the representation and processing of knowledge, and
unsupervised learning of natural structures via information compression. It provides
an interpretation of aspects of mathematics and an interpretation of phenomena
in human perception and cognition. Abstract concepts in the theory may be
realized in terms of neurons and their inter-connections (SP-neural). These
features and advantages of the SP system are discussed in relation to
AI-related alternatives: the concept of minimum length encoding and related
concepts, how computational and energy efficiency in computing may be achieved,
deep learning in neural networks, united theories of cognition and related
research, universal search, Bayesian networks and some other models for AI,
IBM's Watson, solving problems associated with big data and in the development
of intelligence in autonomous robots, pattern recognition and vision, the learning
and processing of natural language, exact and inexact forms of reasoning,
representation and processing of diverse forms of knowledge, and software
engineering. In conclusion, the SP system can provide a _rm foundation for the
long-term development of AI and related bareas, and at the same time, it may
deliver useful results on relatively short timescales.
IEEE 2016 : A
Parallel Patient Treatment Time Prediction Algorithm and Its Applications in
Hospital Queuing-Recommendation in a Big Data Environment
IEEE 2016
Data Mining
Abstract: Effective patient queue
management to minimize patient wait delays and patient overcrowding is one of the major
challenges faced by hospitals. Unnecessary and annoying waits for long periods
result in substantial human
resource and time wastage and increase the frustration endured by patients. For
each patient in the queue, the total treatment time of all the patients before
him is the time that he must wait.It would be convenient and preferable if the
patients could receive the most efficient treatment plan and know the predicted
waiting time through a mobile application that updates in real time. Therefore,
we propose a Patient Treatment Time Prediction (PTTP) algorithm to predict the
waiting time for each treatment task for a patient. We use realistic patient
data from various hospitals to obtain a patient treatment time model for each
task. Based on this large-scale, realistic dataset, the treatment time for each
patient in the current queue of each task is predicted. Based on the predicted
waiting time, a Hospital Queuing-Recommendation (HQR) system is developed. HQR
calculates and predicts an efficient and convenient treatment plan recommended for
the patient. Because of the large-scale, realistic dataset and the requirement
for real-time response, the PTTP algorithm and HQR system mandate efficiency
and low-latency response. We use an Apache Spark-based cloud implementation at
the National Supercomputing Center in Changsha to achieve the aforementioned
goals. Extensive experimentation and simulation results demonstrate the
effectiveness and applicability of our proposed model to recommend an effective
treatment plan for patients to minimize their wait times in hospitals.
IEEE 2016 : Protection
of Big Data Privacy
IEEE 2016 Big
Data Applications
Abstract— In recent years, big data have become a hot
research topic. The increasing amount of big data also increases the chance of
breaching the privacy of individuals. Since big data require high computational
power and large storage, distributed systems are used. As multiple parties are
involved in these systems, the risk of privacy violation is increased. There
have been a number of privacy-preserving mechanisms developed for privacy
protection at different stages (e.g., data generation, data storage, and data
processing) of a big data life cycle. The goal of this paper is to provide a
comprehensive overview of the privacy preservation mechanisms in big data and
present the challenges for existing mechanisms. In particular, in this paper,
we illustrate the infrastructure of big data and the state-of-the-art
privacy-preserving mechanisms in each stage of the big data life cycle.
Furthermore, we discuss the challenges and future research directions related
to privacy preservation in big data.
IEEE
2016 : Towards
a Virtual Domain Based Authentication on MapReduce
IEEE 2016
Data Mining
Abstract: This paper has proposed a
novel authentication solution for the MapReduce (MR) model, a new distributed
and parallel computing paradigm commonly deployed to process BigData by major
IT players, such as Facebook and Yahoo. It identifies a set of security, performance,
and scalability requirements that are specified from a comprehensive study of a
job execution process using MR and security threats and attacks in this
environment. Based on the requirements, it critically analyzes the
state-of-the-art authentication solutions, discovering that the authentication
services currently proposed for the MR model is not adequate. This paper then
presents a novel layered authentication solution for the MR model and describes
the core components of this solution, which includes the virtual domain based
authentication framework (VDAF). These novel ideas are significant, because, first,
the approach embeds the characteristics of MR-in-cloud deployments into
security solution designs, and this will allow the MR model be delivered as a
software as a service in a public cloud environment along with our proposed
authentication solution; second, VDAF supports the authentication of every
interactions by any MR components involved in a job execution flow, so long as
the interactions are for accessing resources of the job; third, this continuous
authentication service is provided in such a manner that the costs incurred in
providing the authentication service should be as low as possible.
IEEE
2016 : CMiner: Opinion Extraction and Summarization
for Chinese Microblogs
IEEE 2016
Data Mining
Abstract—Sentiment analysis of
microblog texts has drawn lots of attention in both the academic and industrial
fields. However, most of the current work only focuses on polarity
classification. In this paper, we present an opinion mining system for Chinese
microblogs called CMiner. Instead of polarity classification, CMiner focuses on
more complicated opinion mining tasks -opinion target extraction and opinion
summarization. Novel algorithms are developed for the two tasks and integrated
into the end-to-end system. CMiner can help to effectively understand the
users’ opinion towards different opinion targets in a microblog topic.
Specially, we develop an unsupervised label propagation algorithm for opinion
target extraction. The opinion targets of all messages in a topic are
collectively extracted based on the assumption that similar messages may focus
on similar opinion targets. In addition, we build an aspect-based opinion
summarization framework for microblog topics. After getting the opinion targets
of all the microblog messages in a topic, we cluster the opinion targets into
several groups and extract representative targets and summaries for each group.
A co-ranking algorithm is proposed to rank both the opinion targets and
microblog sentences simultaneously. Experimental results on a benchmark dataset
show the effectiveness of our system and the algorithms.
IEEE 2015: A Hierarchical Distributed
Processing Framework for Big Image Data
Abstract: This paper introduces an
effective processing framework nominated ICP (Image Cloud Processing) to
powerfully cope with the data explosion in image processing field. While most
previous researches focus on optimizing the image processing algorithms to gain
higher efficiency, our work dedicates to providing a general framework for
those image processing algorithms, which can be implemented in parallel so as
to achieve a boost in time efficiency without compromising the results
performance along with the increasing image scale. The proposed ICP framework consists
of two mechanisms, i.e. SICP (Static ICP) and DICP (Dynamic ICP). Specifically,
SICP is aimed at processing the big image data pre-stored in the distributed
system, while DICP is proposed for dynamic input. To accomplish SICP, two novel
data representations named P-Image and Big-Image are designed to cooperate with
MapReduce to achieve more optimized configuration and higher efficiency. DICP
is implemented through a parallel processing procedure working with the
traditional processing mechanism of the distributed system. Representative
results of comprehensive experiments on the challenging ImageNet dataset are
selected to validate the capacity of our proposed ICP framework over the
traditional state-of-the-art methods, both in time efficiency and quality of
results.
IEEE
2015 :Research Directions for Engineering Big Data Analytics Software
IEEE 2016
Data Mining
Abstract: Many software startups and
research and development efforts are actively taking place to harness the power
of big data and create software with potential to improve almost every aspect
of human life. As these efforts continue to increase, full consideration needs
to be given to engineering aspects of big data software. Since these systems
exist to make predictions on complex and continuous massive datasets, they pose
unique problems during specification, design, and verification of software that
needs to be delivered on-time and within budget. But, given the nature of big
data software, can this be done? Does big data software engineering really
work? This article explores details of big data software, discusses the main
problems encountered when engineering big data software, and proposes avenues
for future research.
IEEE 2015 : An Aggregatable Name-Based Routing for Energy-Efficient
Data Sharing in Big Data Era
IEEE 2015 Transaction on Big Data
Abstract— The MapReduce programming model simplifies large-scale data
processing on commodity cluster by exploiting parallel map tasks and reduce
tasks. Although many efforts have been made to improve the performance of
MapReduce jobs, they ignore the network traffic generated in the shuffle phase,
which plays a critical role in performance enhancement. Traditionally, a hash
function is used to partition intermediate data among reduce tasks, which,
however, is not traffic-efficient because network topology and data size
associated with each key are not taken into consideration. In this paper, we
study to reduce network traffic cost for a MapReduce job by designing a novel
intermediate data partition scheme. Furthermore, we jointly consider the
aggregator placement problem, where each aggregator can reduce merged traffic
from multiple map tasks. A decomposition-based distributed algorithm is
proposed to deal with the large-scale optimization problem for big data
application and an online algorithm is also designed to adjust data partition
and aggregation in a dynamic manner. Finally, extensive simulation results
demonstrate that our proposals can significantly reduce network traffic cost
under both offline and online cases.
Abstract— Users store vast amounts of sensitive data on a big data platform.
Sharing sensitive data will help enterprises reduce the cost of providing users
with personalized services and provide value-added data services. However,
secure data sharing is problematic. This paper proposes a framework for secure
sensitive data sharing on a big data platform, including secure data delivery,
storage, usage, and destruction on a semi-trusted big data sharing platform. We
present a proxy re-encryption algorithm based on heterogeneous ciphertext
transformation and a user process protection method based on a virtual machine
monitor, which provides support for the realization of system functions. The
framework protects the security of users’ sensitive data effectively and shares
these data safely. At the same time, data owners retain complete control of
their own data in a sound environment for modern Internet information security.
IEEE 2015 Transaction on Big Data
Abstract— Many software startups and research and development efforts are
actively taking place to harness the power of big data and create software with
potential to improve almost every aspect of human life. As these efforts
continue to increase, full consideration needs to be given to engineering
aspects of big data software. Since these systems exist to make predictions on
complex and continuous massive datasets, they pose unique problems during
specification, design, and verification of software that needs to be delivered
on-time and within budget. But, given the nature of big data software, can this
be done? Does big data software engineering really work? This article explores
details of big data software, discusses the main problems encountered when
engineering big data software, and proposes avenues for future research.
IEEE 2015 Transaction on Big Data
Abstract— The paper has two parts. The first one deals with how to
use large random matrices as building blocks to model the massive data arising
from the massive (or large-scale) MIMO system. As a result, we apply this model
for distributed spectrum sensing and network monitoring. The part boils down to
the streaming, distributed massive data, for which a new algorithm is obtained
and its performance is derived using the central limit theorem that is recently
obtained in the literature. The second part deals with the large-scale testbed
using software-defined radios (particularly USRP) that takes us more than four
years to develop this 70-node network testbed. To demonstrate the power of the
software defined radio, we reconfigure our testbed quickly into a testbed
for massive MIMO. The massive data of this testbed is of central interest
in this paper. It is for the first time for us to model the experimental data
arising from this testbed. To our best knowledge, we are not aware of other
similar work.
IEEE 2015 Transaction on Big Data
Abstract— This paper is about how the
SP theory of intelligence and its realization in the SP machine may, with
advantage, be applied to the management and analysis of big data. The SP
system_introduced in this paper and fully described elsewhere_may help to
overcome the problem of variety in big data; it has potential as a universal
framework for the representation and processing of diverse kinds of knowledge,
helping to reduce the diversity of formalisms and formats for knowledge, and
the different ways in which they are processed. It has strengths in the
unsupervised learning or discovery of structure in data, in pattern
recognition, in the parsing and production of natural language, in several
kinds of reasoning, and more. It lends itself to the analysis of streaming
data, helping to overcome the problem of velocity in big data. Central in the
workings of the system is lossless compression of information: making big data
smaller and reducing problems of storage and management. There is potential for
substantial economies in the transmission of data, for big cuts in the use of
energy in computing, for faster processing, and for smaller and lighter
computers. The system provides a handle on the problem of veracity in big data,
with potential to assist in the management of errors and uncertainties in data.
It lends itself to the visualization of knowledge structures and inferential
processes. A high-parallel, open-source version of the SP machine would provide
a means for researchers everywhere to explore what can be done with the system
and to create new versions of it.
IEEE 2015 Transaction on Big Data
Abstract— The explosive growth of demands on big data processing imposes a
heavy burden on computation, storage, and communication in data centers, which
hence incurs considerable operational expenditure to data center providers.
Therefore, cost minimization has become an emergent issue for the upcoming
big data era. Different from
conventional cloud services, one of the main features of big data services is
the tight coupling between data and computation as computation tasks can be
conducted only when the corresponding data are available. As a result, three
factors, i.e., task assignment, data placement, and data movement, deeply influence
the operational expenditure of data centers. In this paper, we are motivated to
study the cost minimization problem via a joint optimization of these three
factors for big data services in geo-distributed data centers. To describe the
task completion time with the consideration of both data transmission and
computation, we propose a 2-D Markov chain and derive the average task
completion time in closed-form. Furthermore, we model the problem as a
mixed-integer nonlinear programming and propose an efficient solution to
linearize it. The high efficiency of our proposal is validated by extensive
simulation- based studies.
No comments:
Post a Comment