IEEE 2017: Efficient
Processing of Skyline Queries Using MapReduce
Abstract:
The skyline operator has attracted
considerable attention recently due to its broad applications. However,
computing a skyline is challenging today since we have to deal with big data.
For data-intensive applications, the MapReduce framework has been widely used
recently. In this paper, we propose the efficient parallel algorithm SKY-MR+
for processing skyline queries using MapReduce. We first build a quadtree-based
histogram for space partitioning by deciding whether to split each leaf node
judiciously based on the benefit of splitting in terms of the estimated
execution time. In addition, we apply the dominance power filtering method to
effectively prune non-skyline points in advance. We next partition data based
on the regions divided by the quadtree and compute candidate skyline points for
each partition using MapReduce. Finally, we check whether each skyline
candidate point is actually a skyline point in every partition using MapReduce.
We also develop the workload balancing methods to make the estimated execution
times of all available machines to be similar. We did experiments to compare
SKY-MR+ with the state-of-the-art algorithms using MapReduce and confirmed the
effectiveness as well as the scalability of SKY-MR+.
IEEE 2017: NetSpam: a
Network-based Spam Detection Framework for Reviews in Online Social Media
Abstract: Nowadays, a big part of
people rely on available content in social media in their decisions (e.g.
reviews and feedback on a topic or product).The possibility that anybody can
leave a review provide a golden opportunity for spammers to write spam reviews about
products and services for different interests. Identifying these spammers and
the spam content is a hot topic of research and although a considerable number
of studies have been done recently toward this end, but so far the
methodologies put forth still barely detect spam reviews, and none of them show
the importance of each extracted feature type. In this study, we propose a
novel framework, named NetSpam, which utilizes spam features for modeling
review datasets as heterogeneous information networks to map spam detection
procedure into a classification problem in such networks. Using the importance
of spam features help us to obtain better results in terms of different metrics
experimented on real-world review datasets from Yelp and Amazon websites. The
results show that NetSpam outperforms the existing methods and among four
categories of features; including review-behavioral, user-behavioral,
reviewlinguistic, user-linguistic, the first type of features performs better
than the other categories.
IEEE 2017: Practical
Privacy-Preserving MapReduce Based K-means Clustering over Large-scale Dataset
Abstract: Clustering techniques have
been widely adopted in many real world data analysis applications, such as
customer behavior analysis, targeted marketing, digital forensics, etc. With
the explosion of data in today’s big data era, a major trend to handle a
clustering over large-scale datasets is outsourcing it to public cloud
platforms. This is because cloud computing offers not only reliable services with
performance guarantees, but also savings on in-house IT infrastructures.
However, as datasets used for clustering may contain sensitive information,
e.g., patient health information, commercial data, and behavioral data, etc,
directly outsourcing them to public cloud servers inevitably raise privacy
concerns. In this paper, we propose a practical privacy-preserving Kmeans
clustering scheme that can be efficiently outsourced to cloud servers. Our
scheme allows cloud servers to perform clustering directly over encrypted
datasets, while achieving comparable computational complexity and accuracy
compared with clusterings over unencrypted ones. We also investigate secure
integration of MapReduce into our scheme, which makes our scheme extremely
suitable for cloud computing environment. Thorough security analysis and
numerical analysis carry out the performance of our scheme in terms of security
and efficiency. Experimental evaluation over a 5 million objects dataset
further validates the practical performance of our scheme.
IEEE 2017:
SocialQ&A: An Online Social Network Based Question and Answer System
Abstract: Question
and Answer (Q&A) systems play a vital role in our daily life for
information and knowledge sharing. Users post questions and pick questions to
answer in the system. Due to the rapidly growing user population and the number
of questions, it is unlikely for a user to stumble upon a question by chance
that (s)he can answer. Also, altruism does not encourage all users to provide
answers, not to mention high quality answers with a short answer wait time. The
primary objective of this paper is to improve the performance of Q&A
systems by actively forwarding questions to users who are capable and willing
to answer the questions. To this end, we have designed and implemented
SocialQ&A, an online social network based Q&A system. SocialQ&A
leverages the social network properties of common-interest and mutual-trust
friend relationship to identify an asker through friendship who are most likely
to answer the question, and enhance the user security. We also improve
SocialQ&A with security and efficiency enhancements by protecting user
privacy and identifies, and retrieving answers automatically for recurrent
questions. We describe the architecture and algorithms, and conducted
comprehensive large-scale simulation to evaluate SocialQ&A in comparison
with other methods. Our results suggest that social networks can be leveraged
to improve the answer quality and asker’s waiting time. We also implemented a
real prototype of SocialQ&A, and analyze the Q&A behavior of real users
and questions from a small-scale real-world SocialQ&A system.
IEEE 2017: Authorship
Attribution for Social Media Forensics
Abstract: The veil of anonymity
provided by smartphones with pre-paid SIM cards, public Wi-Fi hotspots, and
distributed networks like Tor has drastically complicated the task of
identifying users of social media during forensic investigations. In some
cases, the text of a single posted message will be the only clue to an author’s
identity. How can we accurately predict who that author might be when the
message may never exceed 140 characters on a service like Twitter? For the past
50 years, linguists, computer scientists and scholars of the humanities have
been jointly developing automated methods to identify authors based on the
style of their writing. All authors possess peculiarities of habit that
influence the form and content of their written works. These characteristics can
often be quantified and measured using machine learning algorithms. In this
article, we provide a comprehensive review of the methods of authorship
attribution that can be applied to the problem of social media forensics.
Further, we examine emerging supervised learningbased methods that are
effective for small sample sizes, and provide step-by-step explanations for
several scalable approaches as instructional case studies for newcomers to the
field. We argue that there is a significant need in forensics for new
authorship attribution algorithms that can exploit context, can process
multimodal data, and are tolerant to incomplete knowledge of the space of all
possible authors at training time.
IEEE
2017: Detecting and Analyzing Urban Regions with High Impact of Weather Change
on Transport
Abstract: In this work, we focus on
two fundamental questions that are unprecedentedly important to urban planners
to understand the functional characteristics of various urban regions throughout
a city, namely, (i) how to identify regional weather-traffic sensitivity index
throughout a city, that indicates the degree to which the region traffic in a
city is impacted by weather changes; (ii) among complex regional features, such
as road structure and population density, how to dissect the most influential
regional features that drive the urban region traffic to be more vulnerable to
weather changes. However, these two questions are nontrivial to answer, because
urban traffic changes dynamically over time and is essentially affected by many
other factors, which may dominate the overall impact. We make the first study
on these questions, by developing a weather-traffic index (WTI) system. The
system includes two main components: weather-traffic index establishment and
key factor analysis. Using the proposed system, we conducted comprehensive
empirical study in Shanghai, and the weather-traffic indices extracted have
been validated to be surprisingly consistent with real world observations.
Further regional key factor analysis yields interesting
results. For example, house age has significant impact on the weather-traffic
index, which sheds light on future urban planning and reconstruction.
Abstract: Mass media sources,
specifically the news media, have traditionally informed us of daily events. In
modern times, social media services such as Twitter provide an enormous amount
of user-generated data, which have great potential to contain informative
news-related content. For these resources to be useful, we must find a way to
filter noise and only capture the content that, based on its similarity to the
news media, is considered valuable. However, even after noise is removed,
information overload may still exist in the remaining data—hence, it is
convenient to prioritize it for consumption. To achieve prioritization,
information must be ranked in order of estimated importance considering three
factors. First, the temporal prevalence of a particular topic in the news media
is a factor of importance, and can be considered the media focus (MF) of a
topic. Second, the temporal prevalence of the topic in social media indicates
its user attention (UA). Last, the interaction between the social media users
who mention this topic indicates the strength of the community discussing it,
and can be regarded as the user interaction (UI) toward the topic. We propose
an unsupervised framework—SociRank—which identifies news topics prevalent in
both social media and the news media, and then ranks them by relevance using
their degrees of MF, UA, and UI. Our experiments show that SociRank improves
the quality and variety of automatically identified news topics.
IEEE 2017: RAPARE: A
Generic Strategy for Cold-Start Rating Prediction Problem
Abstract: In recent years,
recommender system is one of indispensable components in many e-commerce
websites. One of the major challenges that largely remains open is the
cold-start problem, which can be viewed as a barrier that keeps the cold-start
users/items away from the existing ones. In this paper, we aim to break through
this barrier for cold-start users/items by the assistance of existing ones. In
particular, inspired by the classic Elo Rating System, which has been widely
adopted in chess tournaments; we propose a novel rating comparison strategy
(RAPARE) to learn the latent profiles of cold-start users/items. The center-piece
of our RAPARE is to provide a fine-grained calibration on the latent profiles
of cold-start users/items by exploring the differences between cold-start and
existing users/items. As a generic strategy, our proposed strategy can be
instantiated into existing methods in recommender systems. To reveal the
capability of RAPARE strategy, we instantiate our strategy on two prevalent
methods in recommender systems, i.e., the matrix factorization based and
neighborhood based collaborative filtering. Experimental evaluations on five
real data sets validate the superiority of our approach over the existing
methods in cold-start scenario.
No comments:
Post a Comment