IEEE 2018 : Deep Air Learning: Interpolation, Prediction, and
Feature Analysis of Fine-grained Air Quality
Abstract : The interpolation, prediction, and feature analysis of fine-gained
air quality are three important topics in the area of urban air computing. The
solutions to these topics can provide extremely useful information to support
air pollution control, and consequently generate great societal and technical
impacts. Most of the existing work solves the three problems separately by
different models. In this paper, we propose a general and effective approach to
solve the three problems in one model called the Deep Air Learning (DAL). The
main idea of DAL lies in embedding feature selection and semi-supervised
learning in different layers of the deep learning network. The proposed
approach utilizes the information pertaining to the unlabeled spatio-temporal
data to improve the performance of the interpolation and the prediction, and
performs feature selection and association analysis to reveal the main relevant
features to the variation of the air quality. We evaluate our
approach with extensive experiments based on real data sources obtained in
Beijing, China. Experiments show that DAL is superior to the peer models from
the recent literature when solving the topics of interpolation, prediction, and
feature analysis of fine-gained air quality.
IEEE 2018 : Heterogeneous Information Network Embedding for
Recommendation
Abstract : Due to the flexibility in modelling data heterogeneity,
heterogeneous information network (HIN) has been adopted to characterize
complex and heterogeneous auxiliary data in recommender systems, called HIN
based recommendation. It is challenging to develop effective methods for HIN
based recommendation in both extraction and exploitation of the information
from HINs. Most of HIN based recommendation methods rely on path based
similarity, which cannot fully mine latent structure features of users and items.
In this paper, we propose a novel heterogeneous network embedding based
approach for HIN based recommendation, called HERec. To embed HINs, we design a
meta-path based random walk strategy to generate meaningful node sequences for
network embedding. The learned node embeddings are first transformed by a set
of fusion functions, and subsequently integrated into an extended matrix
factorization (MF) model. The extended MF model together with fusion functions
are jointly optimized for the rating prediction task. Extensive experiments on three
real-world datasets demonstrate the effectiveness of the HERec model. Moreover,
we show the capability of the HERec model for the cold-start problem, and
reveal that the transformed embedding information from HINs can improve the
recommendation performance.
IEEE 2018 : Correlated Matrix Factorization for Recommendation
with Implicit Feedback
Abstract : As a typical latent factor model, Matrix Factorization (MF) has
demonstrated its great effectiveness in recommender systems. Users and items
are represented in a shared low-dimensional space so that the user preference
can be modeled by linearly combining the item factor vector V using the
user-specific coefficients U. From a generative model perspective, U and V are
drawn from two independent Gaussian distributions, which is not so faithful to
the reality. Items are produced to maximally meet users’ requirements, which
makes U and V strongly correlated. Meanwhile, the linear combination between U and
V forces a bijection (one-to-one mapping), which thereby neglects the mutual
correlation between the latent factors. In this paper, we address the upper drawbacks,
and propose a new model, named Correlated Matrix Factorization (CMF).
Technically, we apply Canonical Correlation Analysis (CCA) to map U and V into
a new semantic space. Besides achieving the optimal fitting on the rating
matrix, one component in each vector (U or V ) is also tightly correlated with
every single component in the other. We derive efficient inference and learning
algorithms based on variational EM methods. The effectiveness of our proposed
model is comprehensively verified on four public datasets. Experimental results
show that our approach achieves competitive performance on both prediction
accuracy and efficiency compared with the current state of the art.
IEEE 2018 : Classification Of A Bank Data Set On Various Data
Mining Platforms Bir Banka Müşteri Verilerinin Farklı Veri Madenciliği
Platformlarında Sınıflandırılması
Abstract : The process of extracting meaningful rules from big and complex
data is called data mining. Data mining has an increasing popularity in every
field today. Data units are established in customer-oriented industries such as
marketing, finance and telecommunication to work on the customer churn and
acquisition, in particular. Among the data mining methods, classification
algorithms are used in studies conducted for customer acquisition to predict
the potential customers of the company in question in the related industry. In
this study, bank marketing data set in UCI Machine Learning Data Set was used by
creating models with the same classification algorithms in different data mining
programs. Accuracy, precision and fmeasure criteria were used to test
performances of the classification models. When creating the classification
models, the test and training data sets were randomly divided by the holdout method
to evaluate the performance of the data set. The data set was divided into
training and test data sets with the 60-40%, 75-25% and 80-20% separation
ratios. Data mining programs used for these processes are the R, Knime,
RapidMiner and WEKA. And, classification algorithms commonly used in these
platforms are the k-nearest neighbor (k-nn), Naive Bayes, and C4.5 decision
tree.
IEEE 2018 : Harnessing Multi-source Data about Public Sentiments
and Activities for Informed Design
Abstract : The intelligence of Smart Cities (SC) is represented by its ability
in collecting, managing, integrating, analyzing and mining multi-source data
for valuable insights. In order to harness multi-source data for an informed
place design, this paper presents “Public Sentiments and Activities in Places”
multi-source data analysis flow (PSAP) in an Informed Design Platform (IDP). In
terms of key contributions, PSAP implements 1) an Interconnected Data Model
(IDM) to manage multi-source data independently and integrally, 2) an efficient
and effective data mining mechanism based on multi-dimension and multi-measure
queries (MMQs), and 3) concurrent data processing cascades with Sentiments in
Places Analysis Mechanism (SPAM) and Activities in Places Analysis Mechanism (APAM),
to fuse social network data with other data on public sentiment and activity
comprehensively. As proved by a holistic evaluation, both SPAM and APAM outperform
compared methods. Specifically, SPAM improves its classification accuracy
gradually and significantly from 72.37% to about 85% within 9 crowd-calibration
cycles, and APAM with an ensemble classifier achieves the highest precision of
92.13%, which is approximately 13% higher than the second best method. Finally,
by applying MMQs on “Sentiment&Activity Linked Data”, various place design
insights of our testbed are mined to improve its livability.
IEEE 2018 : A Data Mining based Model for Detection of Fraudulent
Behaviour in Water Consumption
Abstract : Fraudulent behavior in drinking water consumption is a significant
problem facing water supplying companies and agencies. This behavior results in
a massive loss of income and forms the highest percentage of non-technical
loss. Finding efficient measurements for detecting fraudulent activities has
been an active research area in recent years. Intelligent data mining techniques
can help water supplying companies to detect these fraudulent activities to
reduce such losses. This research explores the use of two classification
techniques (SVM and KNN) to detect suspicious fraud water customers. The main
motivation of this research is to assist Yarmouk Water Company (YWC) in Irbid
city of Jordan to overcome its profit loss. The SVM based approach uses
customer load profile attributes to expose abnormal behavior that is known to
be correlated with non-technical loss activities. The data has been collected
from the historical data of the company billing system. The accuracy of the
generated model hit a rate of over 74% which is better than the current manual prediction
procedures taken by the YWC. To deploy the model, a decision tool has been
built using the generated model. The system will help the company to predict
suspicious water customers to be inspected on site.
IEEE 2018: Collaborative Filtering Algorithm Based on Rating
Difference and User Interest
Abstract: Collaborative
filtering algorithm is one of widely used approaches in daily life, so how to
improve the quality and efficiency of collaborative filtering algorithm is an essential
problem. Usually, some traditional algorithm focuses on the user rating, while
they don't take the user rating differences and user interest into account.
However, users who have little rating difference or have a similar interest may
be highly similar. In this paper, a collaborative filtering algorithm based on
scoring difference and user interest is proposed. Firstly, a rating difference
factor is added to the traditional collaborative filtering algorithm, where the
most appropriate factor can be obtained by experiments. Secondly, calculate the
user's interest by combining the attributes of the items, then further
calculate the similarity of personal interest between users. Finally, the user
rating differences and interest similarity are weighted to get final item
recommendation and score forecast. The experimental results on data set shows
that the proposed algorithm decreases both Mean Absolute Error and Root Mean
Squared Error, and improves the accuracy of the proposed algorithm.
IEEE 2018: Collaborative filtering model for enhancing fingerprint
image
Abstract: Fingerprint
enhancement plays a very important role in automatic fingerprint identification
system. In order to ensure reliable fingerprint identification and improve fingerprint
ridge structure, a novel method based on the collaborative filtering model for
fingerprint enhancement is proposed. The proposed method consists of two
stages. First, the original fingerprint is pre-enhanced by using Gabor filter
and linear contrast stretching. Next, the pre-enhanced fingerprint is
partitioned into patches in spatial domain, and then the patches are enhanced
based on spectra diffusion by using the two-dimensional (2D) angularpass filter
and the 2D Butterworth band-pass filter. The proposed method takes full
advantage of the ridge information and spectra diffusion with higher quality to
recover the lost ridge information. To evaluate proposed method, the databases
FVC2004 are employed, and the comparison experiments are carried out using
various methods. Comparative experimental results show that the proposed
algorithm outperforms the existing state-of-the-art methods on fingerprint
enhancement.
IEEE 2018: A Novel Mechanism for Fast Detection of Transformed Data
Leakage
Abstract: Data leakage is a
growing insider threat in information security among organizations and individuals.
A series of methods have been developed to address the problem of data leakage
prevention (DLP). However, large amounts of unstructured data need to be tested
in the Big Data era. As the volume of data grows dramatically and the forms of
data become much complicated, it is a new challenge for DLP to deal with large
amounts of transformed data. We propose an Adaptive weighted Graph Walk model
(AGW) to solve this problem by mapping it to the dimension of weighted graphs.
Our approach solves this problem in three steps. First, the adaptive weighted
graphs are built to quantify the sensitivity of tested data based on its
context. Then, the improved label propagation is used to enhance the
scalability for fresh data. Finally, a low-complexity score walk algorithm is
proposed to determine the ultimate sensitivity. Experimental results show that
the proposed method can detect leaks of transformed or fresh data fast and
efficiently.
IEEE 2018: Machine Learning Methods for Disease Prediction with
Claims Data
Abstract: One of the primary
challenges of healthcare delivery is aggregating disparate, asynchronous data
sources into meaningful indicators of individual health. We combine natural language
word embedding and network modeling techniques to learn meaningful
representations of medical concepts by using the weighted network adjacency
matrix in the GloVe algorithm, which we call Code2Vec. We demonstrate that
using our learned embeddings improve neural network performance for disease prediction.
However, we also demonstrate that popular deep learning models for disease
prediction are not meaningfully better than simpler, more interpretable
classifiers such as XGBoost. Additionally, our work adds to the current
literature by providing a comprehensive survey of various machine learning
algorithms on disease prediction tasks.
IEEE 2018:
A Framework for Real-Time Spam Detection in Twitter
Abstract: With the increased
popularity of online social networks, spammers find these platforms easily
accessible to trap users in malicious activities by posting spam messages. In
this work, we have taken Twitter platform and performed spam tweets detection.
To stop spammers, Google SafeBrowsing and Twitter’s BotMaker tools detect and
block spam tweets. These tools can block malicious links, however they cannot
protect the user in real-time as early as possible. Thus, industries and researchers
have applied different approaches to make spam free social network platform.
Some of them are only based on user-based features while others are based on
tweet based features only. However, there is no comprehensive solution that can
consolidate tweet’s text information along with the user based features. To
solve this issue, we propose a framework which takes the user and tweet based
features along with the tweet text feature to classify the tweets. The benefit
of using tweet text feature is that we can identify the spam tweets even if the
spammer creates a new account which was not possible only with the user and tweet
based features. We have evaluated our solution with four different machine
learning algorithms namely - Support Vector Machine, Neural Network, Random
Forest and Gradient Boosting. With Neural Network, we are able to achieve an
accuracy of 91.65% and surpassed the existing solution [1] by approximately 18%.
IEEE 2017: NetSpam: a Network-based Spam Detection Framework for Reviews in Online Social Media
IEEE 2017 Data Mining
Abstract: Nowadays, a big part of people rely on available con-tent in social media in their decisions (e.g. reviews and feedback on a topic or product). The possibility that anybody can leave a review provide a golden opportunity for spammers to write spam reviews about products and services for different interests. Identifying these spammers and the spam content is a hot topic of research and although a considerable number of studies have been done recently toward this end, but so far the methodologies put forth still barely detect spam reviews, and none of them show the importance of each extracted feature type. In this study, we propose a novel framework, named NetSpam, which utilizes spam features for modeling review datasets as heterogeneous information networks to map spam detection procedure into a classification problem in such networks. Using the importance of spam features help us to obtain better results in terms of different metrics experimented on real-world review datasets from Yelp and Amazon websites.
IEEE 2017: Point-of-interest Recommendation for Location
Promotion in Location-based Social Networks
IEEE 2017 Data Mining
Abstract:
With the wide application of
location-based social networks (LBSNs), point-of-interest (POI) recommendation
has become one of the major services in LBSNs. The behaviors of users in LBSNs
are mainly checking in POIs, and these checking in behaviors are influenced by
user’s behavior habits and his/her friends. In social networks, social
influence is often used to help businesses to attract more users. Each target
user has a different influence on different POI in social networks. This paper
selects the list of POIs with the greatest influence for recommending users.
Our goals are to satisfy the target user’s service need, and simultaneously to
promote businesses’ locations (POIs). This paper defines a POI recommendation
problem for location promotion. Additionally, we use submodular properties to
solve the optimization problem. At last, this paper conducted a comprehensive
performance evaluation for our method using two real LBSN datasets. Experimental
results show that our proposed method achieves significantly superior POI
recommendations comparing with other state-of-the-art recommendation approaches
in terms of location promotion.
IEEE 2017: SocialQ&A: An Online Social Network Based
Question and Answer System
IEEE 2017 Data Mining
Abstract: Question and Answer (Q&A) systems play a vital role in our daily
life for information and knowledge sharing. Users post questions and pick
questions to answer in the system. Due to the rapidly growing user population
and the number of questions, it is unlikely for a user to stumble upon a
question by chance that (s) he can answer. Also, altruism does not encourage
all users to provide answers, not to mention high quality answers with a short
answer wait time. The primary objective of this paper is to improve the
performance of Q&A systems by actively forwarding questions to users who
are capable and willing to answer the questions. To this end, we have designed
and implemented SocialQ&A, an online social network based Q&A system.
SocialQ&A leverages the social network properties of common-interest and
mutual-trust friend relationship to identify an asker through friendship who
are most likely to answer the question, and enhance the user security. We also
improve SocialQ&A with security and efficiency enhancements by protecting
user privacy and identifies, and retrieving answers automatically for recurrent
questions. We describe the architecture and algorithms, and conducted
comprehensive large-scale simulation to evaluate SocialQ&A in comparison
with other methods. Our results suggest that social networks can be leveraged
to improve the answer quality and asker’s waiting time. We also implemented a
real prototype of SocialQ&A, and analyze the Q&A behavior of real users
and questions from a small-scale real-world Social Q&A system.
IEEE 2017: Modeling
Urban Behavior by Mining Geotagged Social Data
IEEE 2017 Data Mining
Abstract: Data generated on location-based social networks provide
rich information on the whereabouts of urban dwellers. Specifically, such data
reveal who spends time where, when, and on what type of activity (e.g.,
shopping at a mall, or dining at a restaurant). That information can, in turn,
be used to describe city regions in terms of activity that takes place therein.
For example, the data might reveal that citizens visit one region mainly for
shopping in the morning, while another for dining in the evening. Furthermore,
once such a description is available, one can ask more elaborate questions. For
example, one might ask what features distinguish one region from another – some
regions might be different in terms of the type of venues they host and others
in terms of the visitors they attract. As another example, one might ask which
regions are similar across cities. In this paper, we present a method to answer
such questions using publicly shared Foursquare data. Our analysis makes use of
a probabilistic model, the features of which include the exact location of
activity, the users who participate in the activity, as well as the time of the
day and day of week the activity takes place. Compared to previous approaches
to similar tasks, our probabilistic modeling approach allows us to make minimal
assumptions about the data which relieves us from having to set arbitrary parameters
in our analysis (e.g., regarding the granularity of discovered regions or the
importance of different features).We demonstrate how the model learned with our
method can be used to identify the most likely and distinctive features of a
geographical area, quantify the importance features used in the model, and
discover similar regions across different cities. Finally, we perform an
empirical comparison with previous work and discuss insights obtained through
our findings.
IEEE 2017: SociRank: Identifying and Ranking Prevalent
NewsTopics Using Social Media Factors
IEEE 2017 Data Mining
Abstract: Mass media sources,
specifically the news media, have traditionally informed us of daily events. In
modern times, social media services such as Twitter provide an enormous amount
of user-generated data, which have great potential to contain informative
news-related content. For these resources to be useful, we must find a way to
filter noise and only capture the content that, based on its similarity to the
news media, is considered valuable. However, even after noise is removed,
information overload may still exist in the remaining data—hence, it is
convenient to prioritize it for consumption. To achieve prioritization,
information must be ranked in order of estimated importance considering three
factors. First, the temporal prevalence of a particular topic in the news media
is a factor of importance, and can be considered the media focus (MF) of a
topic. Second, the temporal prevalence of the topic in social media indicates
its user attention (UA). Last, the interaction between the social media users
who mention this topic indicates the strength of the community discussing it,
and can be regarded as the user interaction (UI) toward the topic. We propose
an unsupervised framework—SociRank—which identifies news topics prevalent in
both social media and the news media, and then ranks them by relevance using
their degrees of MF, UA, and UI. Our experiments show that SociRank improves
the quality and variety of automatically identified news topics.
IEEE 2016: SmartCrawler:
A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces
IEEE 2016 Data Mining
Abstract—As deep web grows at
a very fast pace, there has been increased interest in techniques that help
efficiently locate deep-web interfaces. However, due to the large volume of web
resources and the dynamic nature of deep web, achieving wide coverage and high
efficiency is a challenging issue. We propose a two-stage framework, namely
SmartCrawler, for efficient harvesting deep web interfaces. In the first stage,
SmartCrawler performs site-based searching for center pages with the help of
search engines, avoiding visiting a large number of pages. To achieve more
accurate results for a focused crawl, SmartCrawler ranks websites to prioritize
highly relevant ones for a given topic. In the second stage, SmartCrawler
achieves fast in-site searching by excavating most relevant links with an
adaptive link-ranking. To eliminate bias on visiting some highly relevant links
in hidden web directories, we design a link tree data structure to achieve
wider coverage for a website. Our experimental results on a set of
representative domains show the agility and accuracy of our proposed crawler
framework, which efficiently retrieves deep-web interfaces from large-scale
sites and achieves higher harvest rates than other crawlers.
IEEE 2016 : Machine
Learning Approach to Forecasting Urban Pollution
IEEE 2016 Data Mining
Abstract—This work addresses the question of how to predict fine
particulate matter given a combination of weather conditions. A compilation of
several years of meteorological data in the city of Quito, Ecuador, are used to
build models using a machine learning approach. The study presents a decision
tree algorithm that learns to classify the concentrations of fine aerosols,
into two categories (>15µg/m3 vs. from a limited number of parameters such
as the level of precipitation and the wind speed and direction. Requiring few
rules, the resulting models are able to infer the concentration outcome with
significant accuracy. This fundamental research intends to be a preliminary
step in the development of a web-based platform and smartphone app to alert the
inhabitants of Ecuador’s capital about the risk to human health, with potential
future application in other urban areas.
FiDoop: Parallel Mining
of Frequent Itemsets Using MapReduce
IEEE 2016 Data Mining
Abstract—Existing parallel mining algorithms for frequent itemsets
lack a mechanism that enables automatic parallelization, load balancing, data
distribution, and fault tolerance on large clusters. As a solution to this
problem, we design a parallel frequent itemsets mining algorithm called FiDoop
using the MapReduce programming model. To achieve compressed storage and avoid
building conditional pattern bases, FiDoop incorporates the frequent items
ultrametric tree, rather than conventional FP trees. In FiDoop, three MapReduce
jobs are implemented to complete the mining task. In the crucial third
MapReduce job, the mappers independently decompose itemsets, the reducers
perform combination operations by constructing small ultrametric trees, and the
actual mining of these trees separately. We implement FiDoop on our in-house
Hadoop cluster. We show that FiDoop on the cluster is sensitive to data
distribution and dimensions, because itemsets with different lengths have
different decomposition and construction costs. To improve FiDoop’s
performance, we develop a workload balance metric to measure load balance
across the cluster’s computing nodes. We develop FiDoop-HD, an extension of
FiDoop, to speed up the mining performance for high-dimensional data analysis.
Extensive experiments using real-world celestial spectral data demonstrate that
our proposed solution is efficient and scalable.
IEEE 2016 : Inverted
Linear Quadtree Efficient Top K Spatial Keyword Search
IEEE 2016 Data Mining
Abstract: With
advances in geo-positioning technologies and geo-location services, there are a
rapidly growing amount of spatiotextual objects collected in many applications
such as location based services and social networks, in which an object is
described by its spatial location and a set of keywords (terms). Consequently,
the study of spatial keyword search which explores both location and textual
description of the objects has attracted great attention from the commercial
organizations and research communities. In the paper, we study two fundamental
problems in the spatial keyword queries: top k spatial keyword search
(TOPK-SK), and batch top k spatial keyword search (BTOPK-SK). Given a set of
spatio-textual objects, a query location and a set of query keywords, the
TOPK-SK retrieves the closest k objects each of which contains all keywords in
the query. BTOPK-SK is the batch processing of sets of TOPK-SK queries. Based
on the inverted index and the linear quadtree, we propose a novel index
structure, called inverted linear quadtree (IL- Quadtree), which is carefully
designed to exploit both spatial and keyword based pruning techniques to
effectively reduce the search space. An efficient algorithm is then developed
to tackle top k spatial keyword search. To further enhance the filtering capability
of the signature of linear quadtree, we propose a partition based method. In
addition, to deal with BTOPK-SK, we design a new computing paradigm which
partition the queries into groups based on both spatial proximity and the
textual relevance between queries. We show that the IL-Quadtree technique can
also efficiently support BTOPK-SK. Comprehensive experiments on real and
synthetic data clearly demonstrate the efficiency of our methods.
IEEE 2016 : SPORE :A
Sequential Personalized Spatial Item Recommender System
IEEE 2016 Data Mining
Abstract: With
the rapid development of location-based social networks (LBSNs), spatial item
recommendation has become an important way of helping users discover
interesting locations to increase their engagement with location-based
services. Although human movement exhibits sequential patterns in LBSNs, most
current studies on spatial item recommendations do not consider the sequential
influence of locations. Leveraging sequential patterns in spatial item
recommendation is, however, very challenging, considering 1) users’ check-in
data in LBSNs has a low sampling rate in both space and time, which renders
existing prediction techniques on GPS trajectories ineffective; 2) the
prediction space is extremely large, with millions of distinct locations as the
next prediction target, which impedes the application of classical Markov chain
models; and 3) there is no existing framework that unifies users’ personal
interests and the sequential influence in a principled manner.In light of the
above challenges, we propose a sequential personalized spatial item
recommendation framework (SPORE) which introduces a novel latent variable
topic-region to model and fuse sequential influence with personal interests in
the latent and exponential space. The advantages of modeling the sequential
effect at the topic-region level include a significantly reduced prediction
space, an effective alleviation of data sparsity and a direct expression of the
semantic meaning of users’ spatial activities. Furthermore, we design an
asymmetric Locality Sensitive Hashing (ALSH) technique to speed up the online
top-k recommendation process by extending the traditional LSH. We evaluate the
performance of SPORE on two real datasets and one large-scale synthetic dataset.
The results demonstrate a significant improvement in SPORE’s ability to
recommend spatial items, in terms of both effectiveness and efficiency,
compared with the state-of-the-art methods.
IEEE 2016: Truth
Discovery in Crowdsourced Detection of Spatial Events
IEEE 2016 Data Mining
Abstract: The
ubiquity of smartphones has led to the emergence of mobile crowdsourcing tasks
such as the detection of spatial events when smartphone users move around in
their daily lives. However, the credibility of those detected events can be
negatively impacted by unreliable participants with low-quality data.
Consequently, a major challenge in quality control is to discover true events
from diverse and noisy participants’ reports. This truth discovery problem is
uniquely distinct from its online counterpart in that it involves uncertainties
in both participants’ mobility and reliability. Decoupling these two types of
uncertainties through location tracking will raise severe privacy and energy
issues, whereas simply ignoring missing reports or treating them as negative
reports will significantly degrade the accuracy of the discovered truth. In
this paper, we propose a new method to tackle this truth discovery problem
through principled probabilistic modeling. In particular, we integrate the
modeling of location popularity, location visit indicators, truth of events and
three-way participant reliability in a unified framework. The proposed model is
thus capable of efficiently handling various types of uncertainties and
automatically discovering truth without any supervision or the need of location
tracking. Experimental results demonstrate that our proposed method outperforms
existing state-of-the-art truth discovery approaches in the mobile
crowdsourcing environment.
IEEE 2016 : Sentiment
Analysis of Top Colleges in India Using Twitter Data
IEEE 2016 Data Mining
Abstract: In
today’s world, opinions and reviews accessible to us are one of the most
critical factors in formulating our views and influencing the success of a
brand, product or service. With the advent and growth of social media in the
world, stakeholders often take to expressing their opinions on popular social
media, namely Twitter. While Twitter data is extremely informative, it presents
a challenge for analysis because of its humongous and disorganized nature. This
paper is a thorough effort to dive into the novel domain of performing
sentiment analysis of people’s opinions regarding top colleges in India.
Besides taking additional preprocessing measures like the expansion of net
lingo and removal of duplicate tweets, a probabilistic model based on Bayes’
theorem was used for spelling correction, which is overlooked in other research
studies. This paper also highlights a comparison between the results obtained
by exploiting the following machine learning algorithms: Naïve Bayes and
Support Vector Machine and an Artificial Neural Network model: Multilayer
Perceptron. Furthermore, a contrast has been presented between four different
kernels of SVM: RBF, linear, polynomial and sigmoid.
IEEE 2016 : FRAppE:
Detecting Malicious Facebook Applications
IEEE 2016 Data Mining
Abstract: With
20 million installs a day [1], third-party apps are a major reason for the
popularity and addictiveness of Facebook. Unfortunately, hackers have realized
the potential of using apps for spreading malware and spam. The problem is
already significant, as we find that at least 13% of apps in our dataset are
malicious. So far,the research community has focused on detecting malicious
posts and campaigns. In this paper, we ask the question: given a Facebook
application, can we determine if it is malicious? Our key contribution is in
developing FRAppE—Facebook’s Rigorous Application Evaluator— arguably the first
tool focused on detecting malicious apps on Facebook. To develop FRAppE, we use
information gathered by observing the posting behavior of 111K Facebook apps
seen across 2.2 million users on Facebook. First, we identify a set of features
that help us distinguish malicious apps from benign ones. For example, we
find that malicious apps often share names with other apps, and they typically
request fewer permissions than benign apps. Second, leveraging these
distinguishing features, we show that FRAppE can detect malicious apps with
99.5% accuracy, with no false positives and a low false negative rate (4.1%).
Finally, we explore the ecosystem of malicious Facebook apps and identify
mechanisms that these apps use to propagate. Interestingly, we find that many
apps collude and support each other; in our dataset, we find 1,584 apps
enabling the viral propagation of 3,723 other apps through their posts. Long-term,
we see FRAppE as a step towards creating an independent watchdog for app
assessment and ranking,so as to warn Facebook users before installing apps.
IEEE 2016: Practical
Approximate k Nearest Neighbor Queries with Location and Query Privacy
IEEE 2016 Data Mining
Abstract: In
mobile communication, spatial queries pose a serious threat to user location
privacy because the location of a query may reveal sensitive information about
the mobile user. In this paper, we study approximate k nearest neighbor (kNN)
queries where the mobile user queries the location-based service (LBS) provider
about approximate k nearest points of interest (POIs) on the basis of his
current location. We propose a basic solution and a generic solution for the
mobile user to preserve his location and query privacy in approximate kNN
queries. The proposed solutions are mainly built on the Paillier public-key
cryptosystem and can provide both location and query privacy. To preserve query
privacy, our basic solution allows the mobile user to retrieve one type of
POIs, for example, approximate k nearest car parks, without revealing to the
LBS provider what type of points is retrieved. Our generic solution can be
applied to multiple discrete type attributes of private location-based queries.
Compared with existing solutions for kNN queries with location privacy, our
solution is more efficient. Experiments have shown that our solution is
practical for kNN queries.
IEEE 2016: A Novel
Pipeline Approach for Efficient Big Data Broadcasting
IEEE 2016 Data Mining
Abstract: Big-data
computing is a new critical challenge for the ICT industry. Engineers and
researchers are dealing with data sets of petabyte scale in the cloud computing
paradigm. Thus, the demand for building a service stack to distribute, manage,
and process massive data sets has risen drastically. In this paper, we
investigate the Big Data Broadcasting problem for a single source node to
broadcast a big chunk of data to a set of nodes with the objective of
minimizing the maximum completion time. These nodes may locate in the same
data center or across geo-distributed data centers. This problem is one of the
fundamental problems in distributed computing and is known to be NP-hard in
heterogeneous environments. We model the Big-data broadcasting problem into a
LockStep Broadcast Tree (LSBT) problem. The main idea of the LSBT model is to
define a basic unit of upload bandwidth, r, such that a node with capacity c
broadcasts data to a set of bc=rc children at the rate r. Note that r is a parameter
to be optimized as part of the LSBT problem. We further divide the broadcast
data into m chunks. These data chunks can then be broadcast down the LSBT in a
pipeline manner. In a homogeneous network environment in which each node has
the same upload capacity c, we show that the optimal uplink rate r_ of LSBT is
either c=2 or c=3, whichever gives the smaller maximum completion time. For
heterogeneous environments, we present an Oðnlog2nÞ algorithm to select an
optimal uplink rate r_ and to construct an optimal LSBT. Numerical results show
that our approach performs well with less maximum completion time and lower
computational complexity than other efficient solutions in literature.
IEEE 2016 : Nearest
Keyword Set Search in Multi-dimensional Datasets
IEEE 2016 Data Mining
Abstract: Keyword-based
search in text-rich multi-dimensional datasets facilitates many novel
applications and tools. In this paper, we consider objects that are tagged with
keywords and are embedded in a vector space. For these datasets, we study
queries that ask for the tightest groups of points satisfying a given set of
keywords. We propose a novel method called ProMiSH (Projection and Multi Scale
Hashing) that uses random projection and hash-based index structures, and
achieves high scalability and speedup. We present an exact and an approximate
version of the algorithm. Our experimental results on real and synthetic
datasets show that ProMiSH has up to 60 times of speedup over state-of-the-art
tree-based techniques.
IEEE 2016: Mining
User-Aware Rare Sequential Topic Patterns in Document Streams
IEEE 2016 Data Mining
Abstract: Textual
documents created and distributed on the Internet are ever changing in various
forms. Most of existing works are devoted to topic modeling and the evolution
of individual topics, while sequential relations of topics in successive
documents published by a specific user are ignored. In this paper, in order to
characterize and detect personalized and abnormal behaviors of Internet users, we
propose Sequential Topic Patterns (STPs) and formulate the problem of mining
User-aware Rare Sequential Topic Patterns (URSTPs) in document streams on the
Internet. They are rare on the whole but relatively frequent for specific
users, so can be applied in many real-life scenarios, such as real-time
monitoring on abnormal user behaviors. We present a group of algorithms to
solve this innovative mining problem through three phases: preprocessing to
extract probabilistic topics and identify sessions for different users,
generating all the STP candidates with (expected) support values for each user
by pattern-growth, and selecting URSTPs by making user-aware rarity analysis on
derived STPs. Experiments on both real (Twitter) and synthetic datasets show
that our approach can indeed discover special users and interpretable URSTPs
effectively and efficiently, which significantly reflect users’
characteristics.
IEEE 2016: Cross-Platform
Identification of Anonymous Identical Users in Multiple Social Media Networks
IEEE 2016 Data Mining
Abstract: The
last few years have witnessed the emergence and evolution of a vibrant research
stream on a large variety of online Social Media Network (SMN)platforms.
Recognizing anonymous, yet identical users among multiple SMNs is still an
intractable problem. Clearly, cross - platform exploration may help solve
many problems in social computing in both theory and applications. Since public
profiles can be duplicated and easily impersonated by users with different
purposes, most current user identification resolutions, which mainly
focus on text mining of users’ public profiles. are fragile. Some studies
have attempted to match users based on the location and timing of user
content as well as writing style. However, the locations are sparse in the
majority of SMNs, and writing style is difficult to discern from the short
sentences of leading SMNs such as Sina Microblog and Twitter. Moreover, since
online SMNs are quite symmetric, existing user identification schemes
based on network structure are not effective. The real-world friend cycle
is highly individual and virtually no two users share a congruent friend
cycle. Therefore, it is more accurate to use a friendship structure to analyze
cross - platform SMNs. Since identical users tend to set up partial similar friendship
structures in different SMNs, we propose d the Friend Relationship-
Based User Identification (FRUI ) algorithm. FRUI calculates
a match degree for all candidate User Matched Pairs (UMPs), and only UMPs with
top ranks are considered as identical users. We also developed two propositions
to improve the efficiency of the algorithm. Results of extensive
experiments demonstrate that FRUI performs much better than current
network structure - based algorithms.
IEEE 2015 : Entity Linking with a Knowledge Base: Issues, Techniques, and
Solutions
IEEE 2015 Transaction on Data Mining
ABSTRACT :The large number of potential
applications from bridging Web data with knowledge bases have led to an
increase in the entity linking research. Entity
linking is the task to link entity mentions in text with their corresponding
entities in a knowledge base. Potential applications include
information extraction, information retrieval, and knowledge base population.
However, this
task is challenging due to name
variations and entity ambiguity. In this survey, we present a thorough overview
and analysis
of the main approaches to entity
linking, and discuss various applications, the evaluation of entity linking
systems, and future directions.
IEEE 2015 : Rule-Based Method for Entity Resolution
IEEE 2015 Transaction on Data Mining
ABSTRACT :The objective of entity resolution (ER) is to identify
records referring to the same real-world entity. Traditional ER approaches identify records based on pairwise similarity
comparisons, which assumes that records referring to the same entity are more similar to each other than otherwise. However, this
assumption does not always hold in practice and similarity comparisons do not work well when such assumption breaks. We propose a new class of
rules which could describe the complex matching conditions between records and entities. Based on this class of rules, we
present the rule-based entity resolution problem and develop an on-line approach for ER. In this framework, by applying rules to each
record, we identify which entity the record refers to. Additionally, we propose an effective and efficient rule discovery algorithm. We
experimentally evaluated our rule-based ER algorithm on real data sets. The experimental results show that both our rule discovery
algorithm and rule-based ER algorithm can achieve high performance.
IEEE 2015 : Secure Distributed Deduplication Systems with Improved Reliability
IEEE 2015 Transaction on Data Mining
ABSTRACT :Data
deduplication is a technique for eliminating duplicate copies of data, and has
been widely used in cloud storage to reduce
storage space and upload bandwidth. However, there is only one copy for each
file stored in cloud even if such a file is owned by a huge number of users. As a result, deduplication system improves
storage utilization while reducing reliability. Furthermore, the challenge of privacy for sensitive
data also arises when they are outsourced by users to cloud. Aiming to address
the above security challenges, this paper makes
the first attempt to formalize the notion of distributed reliable deduplication
system. We propose new distributed
deduplication systems with higher reliability in which the data chunks are
distributed across multiple cloud servers. The
security requirements of data confidentiality and tag consistency are also
achieved by introducing a deterministic secret sharing scheme in distributed storage systems,
instead of using convergent encryption as in previous deduplication systems.
Security analysis demonstrates that
our deduplication systems are secure in terms of the definitions specified in
the proposed security model. As a proof of
concept, we implement the proposed systems and demonstrate that the incurred
overhead is very limited in realistic environments.
IEEE 2015 : The Internet of Things for Health Care: A Comprehensive Survey
IEEE 2015 Transaction on Data Mining
IEEE 2015 Transaction on Data Mining
ABSTRACT : The Internet of Things (IoT)
makes smart objects the ultimate building blocks in the development
of cyber-physical smart pervasive frameworks. The IoT has a variety of
application domains, including
health care. The IoT revolution is redesigning modern health care with
promising technological, economic,
and social prospects. This paper surveys advances in IoT-based health care
technologies and reviews the
state-of-the-art network architectures/platforms, applications, and industrial
trends in IoT-based health care solutions. In
addition, this paper analyzes distinct IoT security and privacy features,
including security requirements, threat models, and attack taxonomies from the
health care perspective. Further, this paper proposes an intelligent collaborative security model to minimize
security risk; discusses how different innovations such as big data, ambient intelligence, and wearables can be leveraged
in a health care context; addresses various IoT and
eHealth policies and regulations across the world to determine how they can facilitate
economies and societies in terms of sustainable development; and provides some
avenues for future research on
IoT-based health care based on a set of open issues and challenges.
IEEE 2015 : Friendbook: A Semantic-based Friend Recommendation System for
Social Networks
IEEE 2015 Transaction on Data Mining
Abstract—Existing social networking
services recommend friends to users based on their social graphs, which may not
be the most appropriate to reflect a user’s preferences on friend selection in real
life. In this paper, we present Friendbook, a novel semantic-based friend recommendation system for social
networks, which recommends friends to users based on their life styles instead
of social graphs. By taking
advantage of sensor-rich smartphones, Friendbook discovers life styles of users
from user-centric sensor data, measures
the similarity of life styles between users, and recommends friends to users if
their life styles have high similarity. Inspired by text mining, we model a user’s daily life as life documents, from
which his/her life styles are extracted by using the Latent Dirichlet Allocation algorithm. We further propose a
similarity metric to measure the similarity of life styles between users, and
calculate users’ impact in terms
of life styles with a friend-matching graph. Upon receiving a request,
Friendbook returns a list of people with highest recommendation scores to the query user. Finally, Friendbook integrates a
feedback mechanism to further improve the recommendation accuracy. We have implemented
Friendbook on the Android-based smartphones, and evaluated its performance on
both small-scale experiments and large-scale simulations. The
results show that the recommendations accurately reflect the preferences of
users in
choosing friends.
IEEE 2015 : Privacy-Preserving Detection of Sensitive Data Exposure
IEEE 2015 Transaction on Data Mining
IEEE 2015 Transaction on Data Mining
Abstract—Statistics from security firms,
research institutions and government organizations show that the
number of data-leak instances
have grown rapidly in recent years. Among various data-leak cases, human mistakes are one of
the main causes of data loss.
There exist solutions detecting inadvertent sensitive data leaks caused by human mistakes and to
provide alerts for
organizations. A common approach is to screen content in storage and transmission for exposed
sensitive information. Such an
approach usually requires the detection operation to be conducted in secrecy. However, this
secrecy requirement is challenging
to satisfy in practice, as detection servers may be compromised or outsourced. In this paper, we
present a privacy preserving data-leak
detection (DLD) solution to solve the issue where a special set of sensitive data digests is used in detection.
The advantage of our method is that it
enables the data owner to safely delegate the detection
operation to a semihonest provider without
revealing the sensitive data to the provider. We describe how Internet service providers can offer
their customers DLD as an add-on
service with strong privacy guarantees. The evaluation results show that our method can support
accurate detection with very
small number of false alarms under various data-leak scenarios.
IEEE 2015 : Constructing
a Global Social Service Network for
Better Quality of Web Service Discovery
IEEE 2015 Transaction on Data Mining
IEEE 2015 Transaction on Data Mining
Abstract—Web services have had a
tremendous impact on the Web for supporting a distributed service-based economy
on a
global scale. However, despite the outstanding progress, their uptake on a Web
scale has been significantly less than initially anticipated. The isolation of services and the lack of social
relationships among related services have been identified as reasons for the poor uptake. In this paper,
we propose connecting the isolated service islands into a global social service
network to enhance the services’
sociability on a global scale. First, we propose linked social service-specific
principles based on linked data
principles for publishing services on the open Web as linked social services.
Then, we suggest a new framework for
constructing the global social service network following linked social
service-specific principles based on complex network theories.
Next, an approach is proposed to enable the exploitation of the global social
service network, providing Linked Social Services as a Service. Finally, experimental results show that our
approach can solve the quality of service discovery problem, improving both the
service discovering time and the success rate by exploring service-to-service
based on the global social service network.
IEEE 2015 : PAGE: A Partition Aware Engine for Parallel Graph
Computation
IEEE 2015 Transaction on Data Mining
IEEE 2015 Transaction on Data Mining
Abstract—Graph partition quality affects
the overall performance of parallel graph computation systems. The quality of a
graph partition
is measured by the balance factor and edge cut ratio. A balanced graph
partition with small edge cut ratio is generally preferred since it reduces the
expensive network communication cost. However, according to an empirical study
on Giraph, the performance over well partitioned graph might be even two times worse
than simple random partitions. This is because these systems only optimize for the
simple partition strategies and cannot efficiently handle the increasing
workload of local message processing when a high quality graph
partition is used. In this paper, we propose a novel partition aware graph
computation engine named PAGE,
which equips a new message processor and a dynamic concurrency control model.
The new message processor
concurrently processes local and remote messages in a unified way. The dynamic
model adaptively adjusts the concurrency
of the processor based on the online statistics. The experimental evaluation
demonstrates the superiority of PAGE over the graph partitions with various qualities.
IEEE 2015 Identity-based Encryption with Outsourced
Revocation in Cloud Computing
IEEE 2015 Transaction on Data Mining
IEEE 2015 Transaction on Data Mining
Abstract—Identity-Based Encryption (IBE)
which simplifies the public key and certificate
management at Public Key Infrastructure (PKI) is an
important alternative to public key encryption. However, one of the main
efficiency drawbacks of IBE is the overhead computation at Private Key
Generator (PKG) during user revocation. Efficient revocation has been
well studied in traditional PKI setting, but the cumbersome
management of certificates is precisely the burden that IBE strives to
alleviate. In this paper, aiming at tackling the critical issue
of identity revocation, we introduce outsourcing computation
into IBE for the first time and propose a
revocable IBE scheme in the server-aided setting. Our scheme
offloads most of the key generation related operations during
key-issuing and key-update processes to a Key Update Cloud
Service Provider, leaving only a constant number of simple
operations for PKG and users to perform locally. This goal is achieved by
utilizing a novel collusion-resistant technique: we employ a hybrid
private key for each user, in which an AND gate is involved to connect
and bound the identity component and the time component.
Furthermore, we propose another construction which is provable secure under the
recently formulized Refereed Delegation of Computation
model. Finally, we provide extensive experimental results to
demonstrate the efficiency of our proposed construction.
IEEE 2015 : Co-Extracting Opinion Targets and Opinion Words from Online Reviews Based on the Word Alignment Model
IEEE 2015 Transaction on Data Mining
IEEE 2015 Transaction on Data Mining
Abstract—Mining opinion targets and
opinion words from online reviews are important tasks for fine-grained opinion
mining, the key component of which involves detecting opinion
relations among words. To this end, this paper proposes a novel approach based on the
partially-supervised alignment model, which regards identifying opinion
relations as an alignment process. Then, a graph-based
co-ranking algorithm is exploited to estimate the confidence of each candidate.
Finally, candidates with higher confidence are extracted as
opinion targets or opinion words. Compared to previous methods based on the
nearest-neighbor rules, our model captures opinion
relations more precisely, especially for long-span relations. Compared to
syntax-based methods, our word alignment model
effectively alleviates the negative effects of parsing errors when dealing with
informal online texts. In particular, compared to the
traditional unsupervised alignment model, the proposed model obtains better
precision because of the usage of partial
supervision. In addition, when estimating candidate confidence, we penalize
higher-degree vertices in our graph-based co-ranking
algorithm to decrease the probability of error generation. Our experimental
results on three corpora with different sizes and languages
show that our approach effectively outperforms state-of-the-art methods.
IEEE 2015 : Research Directions for
Engineering Big Data Analytics Software
IEEE 2015 Transaction on Data Mining
Abstract: Many software
startups and research and development efforts are actively taking place to
harness the power of big
data and create software with potential to improve almost every aspect of human life. As
these efforts continue to increase, full consideration needs to be given to engineering aspects
of big data software. Since these systems exist to make predictions on complex and
continuous massive datasets, they pose unique problems during specification, design, and
verification of software that needs to be delivered on-time and within budget.
But, given the nature of
big data software, can this be done? Does big data software engineering really work? This
article explores details of big data software, discusses the main problems encountered when
engineering big data software, and proposes avenues for future research.
IEEE 2015 : Massive
MIMO as a Big Data System: Random Matrix Models and Testbed
IEEE 2015 Transaction on Data Mining
IEEE 2015 Transaction on Data Mining
ABSTRACT :The paper has two parts. The first one deals with
how to use large random matrices as building blocks to model the massive data arising from the massive (or large-scale) MIMO system. As a result,
we apply this model for distributed spectrum sensing and network monitoring. The part boils down to the streaming, distributed
massive data, for which a new algorithm is obtained and its performance is derived using the central limit theorem that is recently
obtained in the literature. The second part deals with the large-scale testbed using software-defined radios (particularly USRP) that
takes us more than four years to develop this 70-node network testbed. To demonstrate the power of the software defined radio,
we reconfigure our testbed quickly into a testbed for massive MIMO. The massive data of this testbed is of central interest in
this paper. It is for the first time for us to model the experimental data arising from this testbed. To our best knowledge, we are
not aware of other similar work.
IEEE 2015 : A
Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud
Data
IEEE 2015 Transaction on Data Mining
IEEE 2015 Transaction on Data Mining
ABSTRACT :Due to the
increasing popularity of cloud computing, more and more data owners are
motivated to outsource their data to cloud servers for great convenience and reduced cost in
data management. However, sensitive data should be encrypted before outsourcing for privacy requirements, which obsoletes data utilization
like keyword-based document retrieval. In this paper, we present a secure multi-keyword ranked search scheme over encrypted cloud data,
which simultaneously supports dynamic update operations like deletion and insertion of documents. Specifically, the vector space
model and the widely-used TF[1]IDF
model are combined in the index construction and
query generation. We construct a special tree-based index structure and propose
a “Greedy Depth-first Search” algorithm to provide
efficient multi-keyword ranked search. The secure kNN algorithm is utilized to
encrypt the index and query vectors, and meanwhile ensure
accurate relevance score calculation between encrypted index and query vectors.
In order to resist statistical attacks, phantom terms are added to the index
vector for blinding search results . Due to the use of our special tree-based index structure, the proposed scheme can achieve sub-linear search time
and deal with the deletion and insertion of documents flexibly. Extensive experiments are conducted to demonstrate the efficiency of the
proposed scheme.
IEEE 2015 : Generating
Searchable Public-Key Cipher texts with Hidden Structures for Fast Keyword
Search
IEEE 2015 Transaction on Data Mining
IEEE 2015 Transaction on Data Mining
ABSTRACT :Existing semantically secure public-key searchable encryption
schemes take search time linear with the total number of the ciphertexts.
This makes retrieval from large-scale databases prohibitive. To alleviate
this problem, this paper proposes Searchable Public-Key Ciphertexts
with Hidden Structures (SPCHS) for keyword search as fast as possible
without sacrificing semantic security of the encrypted keywords. In SPCHS,
all keyword-searchable ciphertexts are structured by hidden relations, and
with the search trapdoor corresponding to a keyword, the minimum
information of the relations is disclosed to a search algorithm as the
guidance to find all matching ciphertexts efficiently. We construct
a SPCHS scheme from scratch in which the ciphertexts have a hidden
star-like structure. We prove our scheme to be semantically secure in the
Random Oracle (RO) model. The search complexity of our scheme is dependent
on the actual number of the ciphertexts containing the queried keyword,
rather than the number of all ciphertexts. Finally, we present a generic
SPCHS construction from anonymous identity-based encryption and collision-free
full-identity malleable Identity-Based Key Encapsulation Mechanism (IBKEM)
with anonymity. We illustrate two collision-free full-identity malleable
IBKEM instances, which are semantically secure and anonymous,
respectively, in the RO and standard models. The latter instance
enables us to construct an SPCHS scheme with semantic security in the
standard model.
IEEE 2015 : k-Nearest
Neighbor Classification over Semantically Secure Encrypted elational Data
IEEE 2015 Transaction on Data Mining
IEEE 2015 Transaction on Data Mining
ABSTRACT :Data
Mining has wide applications in many areas such as banking, medicine,
scientific research and among government agencies. Classification is one
of the commonly used tasks in data mining applications. For the past decade,
due to the rise of various privacy issues, many theoretical and practical
solutions to the classification problem have been proposed under different
security models. However, with the recent popularity of cloud computing, users
now have the opportunity to outsource their data, in encrypted form, as
well as the data mining tasks to the cloud. Since the data on the cloud is in
encrypted form, existing privacy-preserving classification techniques are
not applicable. In this paper, we focus on solving the lassification problem over encrypted data. In particular, we propose a
secure k-NN classifier over encrypted data in the cloud. The proposed protocolprotects
the confidentiality of data, privacy of user’s input query, and hides the data
access patterns. To the best of our knowledge, our work is the first to
develop a secure k-NN classifier over encrypted data under the semi-honest
model. Also, we empirically analyze the efficiency of our proposed
protocol using a real-world dataset under different parameter settings.
IEEE 2015 : Towards
Effective Bug Triage with Software Data Reduction Techniques
IEEE 2015 Transaction on Data Mining
IEEE 2015 Transaction on Data Mining
ABSTRACT :Software
companies spend over 45 percent of cost in dealing with software bugs. An
inevitable step of fixing bugs is bug triage, which aims to correctly
assign a developer to a new bug. To decrease the time cost in manual work,
text classification techniques are applied to conduct automatic bug
triage. In this paper, we address the problem of data reduction for bug
triage, i.e., how to reduce the scale and improve the quality of bug data. We
combine instance selection with feature selection to simultaneously reduce
data scale on the bug dimension and the word dimension. To determine the order
of applying instance selection and feature selection, we extract
attributes from historical bug data sets and build a predictive model for
a new bug data set. We empirically investigate the performance of data
reduction on totally 600,000 bug reports of two large open source
projects, namely Eclipse and Mozilla. The results show that our data reduction
can effectively reduce the data scale and improve the accuracy of bug
triage. Our work provides an approach to leveraging techniques on data
processing to form reduced and high-quality bug data in software
development and maintenance.
IEEE 2015 : Self-Organizing Neural Networks Integrating Domain Knowledge
and Reinforcement Learning
IEEE 2015 Transaction on Data Mining
ABSTRACT :The use of domain knowledge in learning systems is expected
to improve learning efficiency and reduce model complexity. However, due
to the incompatibility with knowledge structure of the learning systems
and real-time exploratory nature of reinforcement learning (RL), domain
knowledge cannot be inserted directly. In this paper, we show how
self-organizing neural networks designed for online and incremental
adaptation can integrate domain knowledge and RL.
Specifically, symbol-based domain knowledge is translated into
numeric patterns before inserting into the self-organizing neural
networks. To ensure effective use of domain knowledge, we present
an analysis of how the inserted knowledge is used by the
selforganizing neural networks during RL. To this end, we propose
a vigilance adaptation and greedy exploitation strategy to
maximize exploitation of the inserted domain knowledge while retaining the plasticity
of learning and using new knowledge. Our experimental results based on the
pursuit-evasion and minefield navigation problem domains show that
such self-organizing neural network can make effective use of domain
knowledge to improve learning efficiency and reduce model
complexity.ork can make effective use of domain knowledge to improve
learning efficiency and reduce model complexity.
IEEE 2015 : Active
Learning for Ranking through Expected Loss Optimization
IEEE 2015 Transaction on Data Mining
ABSTRACT : Learning to rank arises in many information retrieval
applications, ranging from Web search engine, online advertising to
recommendation system. In learning to rank, the performance of a ranking model
is strongly a®ected by the number of labeled examples in the training set;
on the other hand, obtaining labeled examples for training data is very
expensive and time-consuming. This presents a great need for the active
learning approaches to select most informative examples for ranking learning;
however, in the literature there is still very limited work to address
active learning for ranking. In this paper, we propose a general active
learning framework, Expected Loss Optimization (ELO), for ranking. The ELO
framework is applicable to a wide range of ranking functions. Under this
framework, we derive a novel algorithm, Expected DCG Loss Optimization
(ELO-DCG), to select most informative examples. Furthermore, we
investigate both query and document level active learning for raking and
propose a two-stage ELO-DCG algorithm which incorporate both query and
document selection into activelearning. Extensive experiments on real-world Web
search data sets have demonstrated great potential and e®ective- ness
of the proposed framework and algorithms.
IEEE 2015 : Discover the Expert: Context-Adaptive Expert Selection
for Medical Diagnosis
IEEE 2015 Transaction on Data Mining
IEEE 2015 Transaction on Data Mining
ABSTRACT : In this paper,
we propose an expert selection system that learns online the best expert to
assign to each patient depending on the context of the patient. In
general, the context can include an enormous number and variety of
information related to the patient's health condition, age, gender, previous
drug doses, and so forth, but the most relevant information is embedded in
only a few contexts. If these most relevant contexts were known in
advance, learning would be relatively simple but they are not. Moreover, the
relevant contexts may be different for different health conditions. To
address these challenges, we develop a new class of algorithms aimed at
discovering the most relevant contexts and the best clinic and expert to use to
make a diagnosis given a patient's contexts. We prove that as the number
of patients grows, the proposed contextadaptive algorithm will discover
the optimal expert to select for patients with a speci_c context.
Moreover, the algorithm also provides con_dence bounds on the diagnostic
accuracy of the expert it selects, which can be considered by the primary
care physician before making the _nal decision. While our algorithm is
general and can be applied in numerous medical scenarios, we illustrate its
functionality and performance by applying it to a real-world breast cancer
diagnosis data set. Finally, while the application we consider in this paper
is medical diagnosis, our proposed algorithm can be applied in other
environments where expertise needs to be discovered.
IEEE 2015 : A
Secure and Dynamic Multi-keyword Ranked Search Scheme over
Encrypted Cloud Data
IEEE 2015 Transaction on Data Mining
IEEE 2015 Transaction on Data Mining
ABSTRACT : Due to the increasing popularity of cloud
computing, more and more data owners are motivated to outsource their
data to cloud servers for great convenience and reduced cost in data
management. However, sensitive data should be encrypted before outsourcing
for privacy requirements, which obsoletes data utilization like keyword-based
document retrieval. In this paper, we present a secure multi-keyword
ranked search scheme over encrypted cloud data, which simultaneously supports
dynamic update operations like deletion and insertion of documents.
Specifically, the vector space model and the widely-used TF_IDF model are combined in the index construction and query generation. We
construct a special tree-based index structure and propose a “Greedy
Depth-first Search” algorithm to provide efficient multi-keyword ranked
search. The secure kNN algorithm is utilized to encrypt the index and
query vectors, and meanwhile ensure accurate relevance score calculation
between encrypted index and query vectors. In order to resist statistical
attacks, phantom terms are added to the index vector for blinding search results
. Due to the use of our special tree-based index structure, the proposed
scheme can achieve sub-linear search time and deal with the deletion and
insertion of documents flexibly.
IEEE 2015 : Co-Extracting Opinion Targets and Opinion Words from Online Reviews Based on the Word Alignment
Model
IEEE 2015 Transaction on Data Mining
IEEE 2015 Transaction on Data Mining
ABSTRACT : Mining opinion targets and opinion words from online reviews are
important tasks for fine-grained opinion mining, the key component of
which involves detecting opinion relations among words. To this end, this paper
proposes a novel approach based on the partially-supervised alignment
model, which regards identifying opinion relations as an alignment process.
Then, a graph-based co-ranking algorithm is exploited to estimate the
confidence of each candidate. Finally, candidates with higher confidence
are extracted as opinion targets or opinion words. Compared to previous methods
based on the nearest-neighbor rules, our model captures opinion relations
more precisely, especially for long-span relations. Compared to syntax-based
methods, our word alignment model effectively alleviates the negative
effects of parsing errors when dealing with informal online texts.
In particular, compared to the traditional unsupervised alignment model,
the proposed model obtains better precision because of the usage of
partial supervision. In addition, when estimating candidate confidence, we
penalize higher-degree vertices in our graph-based co-ranking algorithm to
decrease the probability of error generation. Our experimental results on three
corpora with different sizes and languages show that our approach effectively
outperforms state-of-the-art methods.
IEEE 2015 : Privacy-Preserving
Detection of Sensitive Data Exposure
IEEE 2015
IEEE 2015 Transaction on Data Mining
ABSTRACT : Statistics from security firms, research institutions and
government organizations show that the number of data-leak instances have
grown rapidly in recent years. Among various data-leak cases, human
mistakes are one of the main causes of data loss. There exist solutions
detecting inadvertent sensitive data leaks caused by human mistakes and to
provide alerts for organizations. A common approach is to screen
content in storage and transmission for exposed sensitive
information. Such an approach usually requires the detection operation
to be conducted in secrecy. However, this secrecy requirement
is challenging to satisfy in practice, as detection servers may
be compromised or outsourced. In this paper, we present a
privacypreserving data-leak detection (DLD) solution to solve the
issue where a special set of sensitive data digests is used in
detection. The advantage of our method is that it enables the data owner
to safely delegate the detection operation to a semihonest
provider without revealing the sensitive data to the provider. We
describe how Internet service providers can offer their customers DLD
as an add-on service with strong privacy guarantees. The
evaluation results show that our method can support accurate
detection with very small number of false alarms under various
data-leak scenarios.
IEEE 2015 : The
Internet of Things for Health Care: A Comprehensive
Survey
IEEE 2015 Transaction on Data Mining
IEEE 2015 Transaction on Data Mining
ABSTRACT : The
Internet of Things (IoT) makes smart objects the ultimate building blocks in
the development of cyber-physical smart pervasive frameworks. The IoT has
a variety of application domains, including health care. The IoT
revolution is redesigning modern health care with promising
technological, economic, and social prospects. This paper surveys advances
in IoT-based health care technologies and reviews the state-of-the-art
network architectures/platforms, applications, and industrial trends in
IoT-based health care solutions. In addition, this paper analyzes distinct
IoT security and privacy features, including security requirements, threat
models, and attack taxonomies from the health care perspective. Further,
this paper proposes an intelligent collaborative security model to
minimize security risk; discusses how different innovations such as big
data, ambient intelligence, and wearables can be leveraged in a health care
context; addresses various IoT and eHealth policies and regulations across
the world to determine how they can facilitate economies and societies in
terms of sustainable development; and provides some avenues for
future research on IoT-based health care based on a set of open issues and
challenges.
IEEE 2014 Transactions on Knowledge and Data Engineering
ABSTRACT :Top-k pairs and top-k objects queries have
received significant attention by the research community. In this paper, we
present the first approach to answer a broad class of top-k pairs and top-k
objects queries over sliding windows. Our framework handles multiple top-k
queries and each query is allowed to use a different scoring function, a
different value of k and a different size of the sliding window. Furthermore,
the framework allows the users to define arbitrarily complex scoring functions
and supports out-of-order data streams. For all the queries that use the same
scoring function, we need to maintain only one K-sky band. We present efficient
techniques for the K-sky band maintenance and query answering. We conduct a
detailed complexity analysis and show that the expected cost of our approach is
reasonably close to the lower bound cost. For top-k pairs queries, we
demonstrate the efficiency of our approach by comparing it with a specially
designed supreme algorithm that assumes the existence of an oracle and meets
the lower bound cost. For top-k objects queries, our experimental results
demonstrate the superiority of our algorithm over the state-of-the-art
algorithm.
IEEE 2014 Transactions on Knowledge and Data Engineering
ABSTRACT :shortest distance query between two
nodes is a fundamental operation in large-scale networks. Most existing methods
in the literature take a landmark embedding approach, which selects a set of
graph nodes as landmarks and computes the shortest distances from each landmark
to all nodes as an embedding. To handle a shortest distance query between two
nodes, the pre computed distances from the landmarks to the query nodes are
used to compute an approximate shortest distance based on the triangle inequality.
IEEE 2014: CoRE: A Context-Aware Relation Extraction Method for Relation Completion
IEEE 2014 Transactions on Knowledge and Data Engineering
ABSTRACT :we identify Relation Completion (RC) as one recurring problem that
is central to the success of novel big data applications such as Entity
Reconstruction and Data Enrichment. Given a semantic relation R, RC attempts at
linking entity pairs between two entity lists under the relation R. To
accomplish the RC goals, we propose to formulate search queries for each query
entity _ based on some auxiliary information, so that to detect its target
entity _ from the set of retrieved documents. For instance, a Pattern-based
method (PaRE) uses extracted patterns as the auxiliary information in
formulating search queries. However, high-quality patterns may decrease the
probability of finding suitable target entities. As an alternative, we propose
CoRE method that uses context terms learned surrounding the expression of a
relation as the auxiliary information in formulating queries. The experimental
results based on several real-world web data collections demonstrate that CoRE
reaches a much higher accuracy than PaRE for the purpose of RC.
IEEE 2014: Efficient Ranking on Entity Graphs with Personalized Relationships
IEEE 2014 Transactions on Knowledge and Data Engineering
ABSTRACT : Authority flow techniques like Page Rank and
Object Rank can provide personalized ranking of typed entity-relationship
graphs. There are two main ways to personalize authority flow ranking:
Node-based personalization, where authority originates from a set of
user-specific nodes; Edge-based personalization, where the importance of
different edge types is user-specific. We propose the first approach to achieve
efficient edge-based personalization using a combination of pre computation and
runtime algorithms. In particular, we apply our method to Object Rank, where a
personalized weight assignment vector (WAV) assigns different weights to each
edge type or relationship type. Our approach includes a repository of rankings
for various WAVs. We consider the following two classes of approximation: (a)
Schema Approx is formulated as a distance minimization problem at the schema
level; (b) Data Approx is a distance minimization problem at the data graph
level. Schema Approx is not robust since it does not distinguish between
important and trivial edge types based on the edge distribution in the data
graph. In contrast, Data Approx has a provable error bound. Both Schema Approx
and Data Approx are expensive so we develop efficient heuristic
implementations, Scale Rank and Pick One respectively. Extensive experiments on
the DBLP data graph show that Scale Rank provides a fast and accurate
personalized authority Flow ranking.
IEEE 2014:Secure
Mining of Association Rules in Horizontally Distributed Databases
IEEE 2014 Transactions on Knowledge and Data Engineering
IEEE 2014 Transactions on Knowledge and Data Engineering
ABSTRACT :we
propose a protocol for secure mining of association rules in horizontally
distributed databases. The current leading protocol is that of Kantarcioglu and
Clifton [18]. Our protocol, like theirs, is based on the Fast Distributed
Mining (FDM) algorithm of Cheung et al. [8], which is an unsecured distributed
version of the Apriori algorithm. The main ingredients in our protocol are two
novel secure multi-party algorithms — one that computes the union of private
subsets that each of the interacting players hold, and another that tests the
inclusion of an element held by one player in a subset held by another. Our
protocol offers enhanced privacy with respect to the protocol in [18]. In
addition, it is simpler and is significantly more efficient in terms of
communication rounds, communication cost and computational cost.
IEEE 2014 Transactions on Knowledge and Data Engineering
ABSTRACT :A large number of organizations today generate and share textual descriptions of their products, services, and actions. Such collections of textual data contain significant amount of structured information, which remains buried in the unstructured text. While information extraction algorithms facilitate the extraction of structured relations, they are often expensive and inaccurate, especially when operating on top of text that does not contain any instances of the targeted structured information. We present a novel alternative approach that facilitates the generation of the structured metadata by identifying documents that are likely to contain information of interest and this information is going to be subsequently useful for querying the database. Our approach relies on the idea that humans are more likely to add the necessary metadata during creation time, if prompted by the interface; or that it is much easier for humans (and/or algorithms) to identify the metadata when such information actually exists in the document, instead of naively prompting users to fill in forms with information that is not available in the document. As a major contribution of this paper, we present algorithms that identify structured attributes that are likely to appear within the document, by jointly utilizing the content of the text and the query workload. Our experimental evaluation shows that our approach generates superior results compared to approaches that rely only on the textual content or only on the query workload, to identify attributes of interest.
IEEE 2014 Transactions on Knowledge and Data Engineering
ABSTRACT : In the past decade, extending the
keyword search paradigm to relational data has been an active area of research
within the database and information retrieval (IR) community. A large number of
approaches have been proposed and implemented, but despite numerous
publications, there remains a severe lack of standardization for system
evaluations. This lack of standardization has resulted in contradictory results
from different evaluations, and the numerous discrepancies muddle what
advantages are proffered by different approaches. In this paper, we present a
thorough empirical performance evaluation of relational keyword search systems.
Our results indicate that many existing search techniques do not provide
acceptable performance for realistic retrieval tasks. In particular, memory
consumption precludes many search techniques from scaling Beyond small
datasets with tens of thousands of vertices. We also explore the relationship
between execution time and factors varied in previous evaluations; our analysis
indicates that these factors have relatively little impact on performance. In
summary, our work confirms previous claims regarding the unacceptable
performance of these systems and underscores the need for standardization—as
exemplified by the IR community—when evaluating these retrieval systems.
IEEE 2013: SUSIE: Search Using Services and Information Extraction
IEEE 2013 Transactions on Knowledge and Data Engineering
ABSTRACT : The
API of a Web service restricts the types of queries that the service can
answer. For example, a Web service might provide a method that returns the
songs of a given singer, but it might not provide a method that returns the
singers of a given song. If the user asks for the singer of some specific song,
then the Web service cannot be called – even though the underlying database
might have the desired piece of information. This asymmetry is particularly
problematic if the service is used in a Web service orchestration system. In
this paper, we propose to use on-the-fly information extraction to collect
values that can be used as parameter bindings for the Web service. We show how
this idea can be integrated into a Web service orchestration system. Our
approach is fully implemented in a prototype called SUSIE. We present experiments
with real-life data and services to demonstrate the practical viability and
good performance of our approach.
ABSTRACT : Feature
selection involves identifying a subset of the most useful features that
produces compatible results as the original entire set of features. A feature
selection algorithm may be evaluated from both the efficiency and effectiveness
points of view. While the efficiency concerns the time required to find a
subset of features, the effectiveness is related to the quality of the subset
of features. Based on these criteria, a fast clustering-based feature selection
algorithm, FAST, is proposed and experimentally evaluated in this paper. The
FAST algorithm works in two steps. In the first step, features are divided into
clusters by using graph-theoretic clustering methods. In the second step, the
most representative feature that is strongly related to target classes is
selected from each cluster to form a subset of features. Features in different
clusters are relatively independent, the clustering-based strategy of FAST has
a high probability of producing a subset of useful and independent features. To
ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree
clustering method. The efficiency and effectiveness of the FAST algorithm are
evaluated through an empirical study. Extensive experiments are carried out to
compare FAST and several representative feature selection algorithms, namely,
FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to four types of well-known
classifiers, namely, the probability-based Naive Bayes, the tree-based C4.5,
the instance-based IB1, and the rule-based RIPPER before and after feature
selection. The results, on 35 publicly available real-world high dimensional
image, microarray, and text data, demonstrate that FAST not only produces
smaller subsets of features but also improves the performances of the four
types of classifiers.
IEEE 2013: Facilitating Document Annotation using Content and Querying Value
IEEE 2013 Transactions on Knowledge and Data Engineering
ABSTRACT : A large
number of organizations today generate and share textual descriptions of their
products, services, and actions. Such collections of textual data contain
significant amount of structured information, which remains buried in the
unstructured text. While information extraction algorithms facilitate the
extraction of structured relations, they are often expensive and inaccurate,
especially when operating on top of text that does not contain any instances of
the targeted structured information. We present a novel alternative approach
that facilitates the generation of the structured metadata by identifying
documents that are likely to contain information of interest and this
information is going to be subsequently useful for querying the database. Our
approach relies on the idea that humans are more likely to add the necessary
metadata during creation time, if prompted by the interface; or that it is much
easier for humans (and/or algorithms) to identify the metadata when such
information actually exists in the document, instead of naively prompting users
to fill in forms with information that is not available in the document. As a
major contribution of this paper, we present algorithms that identify
structured attributes that are likely to appear within the document, by jointly
utilizing the content of the text and the query workload. Our experimental
evaluation shows that our approach generates superior results compared to
approaches that rely only on the textual content or only on the query workload,
to identify attributes of interest
IEEE 2013
Transactions on Engineering Research & Technology
ABSTRACT : A Web Usage
Mining Approach Based On New Technique In Web Path Recommendation Systems The
Internet is one of the fastest growing areas of intelligence gathering.
The ranking of web page for the Web search-engine is one of the
significant problems at present. This leads to the important
attention to the research community. Web Perfecting is used to reduce the
access latency of the Internet. However, if most perfected Web pages are not
visited by the users in their subsequent accesses, the limited network
bandwidth and server resources will not be used efficiently and may worsen the
access delay problem. Therefore, it is critical that we have an accurate
prediction method during perfecting. To provide prediction efficiently,
we advance architecture for predicting in Web Usage Mining
system and propose a novel approach for classifying user navigation patterns
for predicting users’ requests based on clustering users browsing behavior
knowledge. The Excremental results show that the approach can improve
accuracy, precision, recall and F measure of classification in the
architecture.
IEEE
2013: SUSIE: Search Using Services and Information Extraction
IEEE 2013 Transactions on Knowledge and Data Engineering
IEEE 2013 Transactions on Knowledge and Data Engineering
ABSTRACT : Restricts the
types of queries that the service can answer. For example, a Web service might
provide a method that returns the songs of a given singer, but it might not
provide a method that returns the singers of a given song. If the user asks for
the singer of some specific song, then the Web service cannot be called – even
though the underlying database might have the desired piece of information.
This asymmetry is particularly problematic if the service is used in a Web
service orchestration system. In this paper, we propose to use on-the-fly
information extraction to collect values that can be used as parameter bindings
for the Web service. We show how this idea can be integrated into a Web service
orchestration system. Our approach is fully implemented in a prototype called
SUSIE. We present experiments with real-life data and services to demonstrate
the practical viability and good performance of our approach.
IEEE 2013:PMSE: A Personalized Mobile Search Engine
IEEE 2013 Transactions on Knowledge and Data Engineering
IEEE 2013 Transactions on Knowledge and Data Engineering
ABSTRACT : We propose a
personalized mobile search engine (PMSE) that captures the users’ preferences
in the form of concepts by mining their click through data. Due to the
importance of location information in mobile search, PMSE classifies these
concepts into content concepts and location concepts. In addition, users’
locations (positioned by GPS) are used to supplement the location concepts in
PMSE. The user preferences are organized in an ontology-based, multifacet user
profile, which are used to adapt a personalized ranking function for rank
adaptation of future search results. To characterize the diversity of the
concepts associated with a query and their relevance to the user’s need, four
entropies are introduced to balance the weights between the content and
location facets. Based on the client-server model, we also present a detailed
architecture and design for implementation of PMSE. In our design, the client
collects and stores locally the click through data to protect privacy, whereas
heavy tasks such as concept extraction, training, and re ranking are performed
at the PMSE server. Moreover, we address the privacy issue by restricting the
information in the user profile exposed to the PMSE server with two privacy
parameters. We prototype PMSE on the Google Android platform. Experimental
results show that PMSE significantly improves the precision comparing to the
baseline.
IEEE 2013: Generation of Personalized Ontology Based on
Consumer Emotion and Behavior Analysis
IEEE 2013 Transactions on Affective Computing
IEEE 2013 Transactions on Affective Computing
ABSTRACT : The relationships between consumer emotions and
their buying behaviors have been well documented. Technology-savvy consumers
often use the web to find information on products and services before they
commit to buying. We propose a semantic web usage mining approach for
discovering periodic web access patterns from annotated web usage logs which
incorporates information on consumer emotions and behaviors through
self-reporting and behavioral tracking. We use fuzzy logic to represent
real-life temporal concepts (e.g., morning) and requested resource attributes
(ontological domain concepts for the requested URLs) of periodic pattern-based
web access activities. These fuzzy temporal and resource representations, which
contain both behavioral and emotional cues, are incorporated into a Personal
Web Usage Lattice that models the user’s web access activities. From this, we
generate a Personal Web Usage Ontology written in OWL, which enables semantic
web applications such as personalized web resources recommendation. Finally, we
demonstrate the effectiveness of our approach by presenting experimental
results in the context of personalized web resources recommendation with
varying degrees of emotional influence. Emotional influence has been found to
contribute positively to adaptation in personalized recommendation
IEEE 2013: Identity-Based Secure Distributed Data Storage Schemes
IEEE 2013 Transactions on Computers
ABSTRACT : Secure distributed
data storage can shift the burden of maintaining a large number of files from
the owner to proxy servers. Proxy servers can convert encrypted files for the
owner to encrypted files for the receiver without the necessity of knowing the
content of the original files. In practice, the original files will be removed
by the owner for the sake of space efficiency. Hence, the issues on
confidentiality and integrity of the outsourced data must be addressed
carefully. In this paper, we propose two identity-based secure distributed data
storage (IBSDDS) schemes. Our schemes can capture the following properties: The
file owner can decide the access permission independently without the help of
the private key generator (PKG); For one query, a receiver can only
access one file, instead of all files of the owner; Our schemes are secure
against the collusion attacks, namely even if the receiver can compromise the
proxy servers, he cannot obtain the owner’s secret key. Although the first
scheme is only secure against the chosen plain text attacks (CPA), the second
scheme is secure against the chosen cipher text attacks (CCA). To the best of
our knowledge, it is the first IBSDDS schemes where an access permissions is
made by the owner for an exact file and collusion attacks can be protected in
the standard model.
IEEE 2013: Ginix: Generalized Inverted Index for Keyword Search
IEEE 2013 Transactions on Knowledge and Data Mining
ABSTRACT : Keyword search
has become a ubiquitous method for users to access text data in the face of
information explosion. Inverted lists are usually used to index underlying
documents to retrieve documents according to a set of keywords efficiently.
Since inverted lists are usually large, many compression techniques have been
proposed to reduce the storage space and disk I/O time. However, these
techniques usually perform decompression operations on the fly, which increases
the CPU time. This paper presents a more efficient index structure, the
Generalized INverted IndeX (Ginix), which merges consecutive IDs in inverted
lists into intervals to save storage space. With this index structure, more
efficient algorithms can be devised to perform basic keyword search operations,
i.e., the union and the intersection operations, by taking the advantage of
intervals. Specifically, these algorithms do not require conversions from
interval lists back to ID lists. As a result, keyword search using Ginix can be
more efficient than those using traditional inverted indices. The performance
of Ginix is also improved by reordering the documents in data sets using two
scalable algorithms. Experiments on the performance and scalability of
Ginix on real data sets show that Ginix not only requires less storage space,
but also improves the keyword search performance, compared with traditional
inverted indexes
IEEE 2013: An Ontology-based Framework for Context-aware Adaptive
E-learning System
IEEE 2013 Transactions on Computer Communication and Informatics
ABSTRACT : In web-based
e-learning environment every learner has a distinct background, learning style
and a specific goal when searching for learning material on the web. The goal
of personalization is to tailor search results to a particular user based on
that user’s contextual information. The effectiveness of accessing learning
material involves two important challenges: identifying the user context and
modeling the user context as ontological profiles. This work describes the
ontology-based framework for context-aware adaptive learning system, with
detailed discussions on the categorization contextual information and modeling
along with the use of ontology to explicitly specify learner context in an
e-learning environment. Finally we conclude by showing the applicability of the
proposed ontology with appropriate architectural overview of e-learning system.
IEEE 2013: ELCA Evaluation for Keyword Search on
Probabilistic XML Data
IEEE 2013 Transactions on Knowledge and Data Engineering
ABSTRACT : As probabilistic data
management is becoming one of the main re-search focuses and keyword search is
turning into a more popular query means, it is natural to think how to support
keyword queries on probabilistic XML data. With regards to keyword query on
De-terministic XML documents, ELCA (Exclusive Lowest Common Ancestor) semantics
allows more relevant fragments rooted at the ELCAs to appear as results and is
more popular compared with other keyword query result semantics (such as
SLCAs). In this paper, we investigate how to evaluate ELCA results for keyword
queries on probabilistic XML documents. After defin-ing probabilistic ELCA
semantics in terms of possible world se-mantics, we propose an approach to
compute ELCA probabilities without generating possible worlds. Then we develop
an efficient stack-based algorithm that can find all probabilistic ELCA results
and their ELCA probabilities for a given keyword query on a prob-abilistic XML
document. Finally, we experimentally evaluate the proposed ELCA algorithm and
compare it with its SLCA counter-part in aspects of result effectiveness, time
and space efficiency, and scalability
IEEE
2013: Crowd sourcing Predictors of Behavioral Outcomes
IEEE 2013:Transactions on Knowledge and Data Engineering
ABSTRACT : Generating models from
large data sets—and deter-mining which subsets of data to mine—is becoming
increasingly automated. However choosing what data to collect in the first
place requires human intuition or experience, usually supplied by a domain
expert. This paper describes a new approach to machine science which
demonstrates for the first time that non-domain experts can collectively
formulate features, and provide values for those features such that they are
predictive of some behavioral outcome of interest. This was accomplished by
building a web platform in which human groups interact to both respond to
questions likely to help predict a behavioral outcome and pose new questions to
their peers. This results in a dynamically-growing online survey, but the
result of this cooperative behavior also leads to models that can predict
user’s outcomes based on their responses to the user-generated survey
questions. Here we describe two web-based experiments that instantiate this
approach: the first site led to models that can predict users’ monthly electric
energy consumption; the other led to models that can predict users’ body mass
index. As exponential increases in content are often observed in successful
online collaborative communities, the proposed methodology may, in the future,
lead to similar exponential rises in discovery and insight into the causal
factors of behavioral outcomes.
IEEE 2013:
Facilitating Document Annotation using Content and Querying Value
IEEE 2013 Transactions on
Knowledge and Data Engineering in cloud
ABSTRACT : A large number of
organizations today generate and share textual descriptions of their products,
services, and actions. Such collections of textual data contain significant
amount of struc-tured information, which remains buried in the unstructured
text. While information extraction algorithms facilitate the extraction of
structured relations, they are often expensive and inaccurate, es-pecially when
operating on top of text that does not contain any instances of the targeted
structured information. We present a novel alternative approach that
facilitates the generation of the structured metadata by identifying documents
that are likely to contain informa-tion of interest and this information is
going to be subsequently useful for querying the database. Our approach relies
on the idea that hu-mans are more likely to add the necessary metadata during
creation time, if prompted by the interface; or that it is much easier for
humans (and/or algorithms) to identify the metadata when such information
actually exists in the document, instead of naively prompting users to fill in
forms with information that is not available in the document. As a major
contribution of this paper, we present algorithms that identify structured
attributes that are likely to appear within the document, by jointly utilizing
the content of the text and the query workload. Our experimental evaluation
shows that our approach generates superior results compared to approaches that
rely only on the textual content or only on the query workload, to identify
attributes of interest
No comments:
Post a Comment