IEEE 2018 / 19- Data mining Projects

IEEE 2018 : Deep Air Learning: Interpolation, Prediction, and Feature Analysis of Fine-grained Air Quality
Abstract :  The interpolation, prediction, and feature analysis of fine-gained air quality are three important topics in the area of urban air computing. The solutions to these topics can provide extremely useful information to support air pollution control, and consequently generate great societal and technical impacts. Most of the existing work solves the three problems separately by different models. In this paper, we propose a general and effective approach to solve the three problems in one model called the Deep Air Learning (DAL). The main idea of DAL lies in embedding feature selection and semi-supervised learning in different layers of the deep learning network. The proposed approach utilizes the information pertaining to the unlabeled spatio-temporal data to improve the performance of the interpolation and the prediction, and performs feature selection and association analysis to reveal the main relevant features to the variation of the air quality. We evaluate our approach with extensive experiments based on real data sources obtained in Beijing, China. Experiments show that DAL is superior to the peer models from the recent literature when solving the topics of interpolation, prediction, and feature analysis of fine-gained air quality.


IEEE 2018 : Heterogeneous Information Network Embedding for Recommendation
 Abstract :  Due to the flexibility in modelling data heterogeneity, heterogeneous information network (HIN) has been adopted to characterize complex and heterogeneous auxiliary data in recommender systems, called HIN based recommendation. It is challenging to develop effective methods for HIN based recommendation in both extraction and exploitation of the information from HINs. Most of HIN based recommendation methods rely on path based similarity, which cannot fully mine latent structure features of users and items. In this paper, we propose a novel heterogeneous network embedding based approach for HIN based recommendation, called HERec. To embed HINs, we design a meta-path based random walk strategy to generate meaningful node sequences for network embedding. The learned node embeddings are first transformed by a set of fusion functions, and subsequently integrated into an extended matrix factorization (MF) model. The extended MF model together with fusion functions are jointly optimized for the rating prediction task. Extensive experiments on three real-world datasets demonstrate the effectiveness of the HERec model. Moreover, we show the capability of the HERec model for the cold-start problem, and reveal that the transformed embedding information from HINs can improve the recommendation performance.


IEEE 2018 : Correlated Matrix Factorization for Recommendation with Implicit Feedback
 Abstract : As a typical latent factor model, Matrix Factorization (MF) has demonstrated its great effectiveness in recommender systems. Users and items are represented in a shared low-dimensional space so that the user preference can be modeled by linearly combining the item factor vector V using the user-specific coefficients U. From a generative model perspective, U and V are drawn from two independent Gaussian distributions, which is not so faithful to the reality. Items are produced to maximally meet users’ requirements, which makes U and V strongly correlated. Meanwhile, the linear combination between U and V forces a bijection (one-to-one mapping), which thereby neglects the mutual correlation between the latent factors. In this paper, we address the upper drawbacks, and propose a new model, named Correlated Matrix Factorization (CMF). Technically, we apply Canonical Correlation Analysis (CCA) to map U and V into a new semantic space. Besides achieving the optimal fitting on the rating matrix, one component in each vector (U or V ) is also tightly correlated with every single component in the other. We derive efficient inference and learning algorithms based on variational EM methods. The effectiveness of our proposed model is comprehensively verified on four public datasets. Experimental results show that our approach achieves competitive performance on both prediction accuracy and efficiency compared with the current state of the art.


IEEE 2018 : Classification Of A Bank Data Set On Various Data Mining Platforms Bir Banka Müşteri Verilerinin Farklı Veri Madenciliği Platformlarında Sınıflandırılması
 Abstract :  The process of extracting meaningful rules from big and complex data is called data mining. Data mining has an increasing popularity in every field today. Data units are established in customer-oriented industries such as marketing, finance and telecommunication to work on the customer churn and acquisition, in particular. Among the data mining methods, classification algorithms are used in studies conducted for customer acquisition to predict the potential customers of the company in question in the related industry. In this study, bank marketing data set in UCI Machine Learning Data Set was used by creating models with the same classification algorithms in different data mining programs. Accuracy, precision and fmeasure criteria were used to test performances of the classification models. When creating the classification models, the test and training data sets were randomly divided by the holdout method to evaluate the performance of the data set. The data set was divided into training and test data sets with the 60-40%, 75-25% and 80-20% separation ratios. Data mining programs used for these processes are the R, Knime, RapidMiner and WEKA. And, classification algorithms commonly used in these platforms are the k-nearest neighbor (k-nn), Naive Bayes, and C4.5 decision tree.


IEEE 2018 : Harnessing Multi-source Data about Public Sentiments and Activities for Informed Design
 Abstract : The intelligence of Smart Cities (SC) is represented by its ability in collecting, managing, integrating, analyzing and mining multi-source data for valuable insights. In order to harness multi-source data for an informed place design, this paper presents “Public Sentiments and Activities in Places” multi-source data analysis flow (PSAP) in an Informed Design Platform (IDP). In terms of key contributions, PSAP implements 1) an Interconnected Data Model (IDM) to manage multi-source data independently and integrally, 2) an efficient and effective data mining mechanism based on multi-dimension and multi-measure queries (MMQs), and 3) concurrent data processing cascades with Sentiments in Places Analysis Mechanism (SPAM) and Activities in Places Analysis Mechanism (APAM), to fuse social network data with other data on public sentiment and activity comprehensively. As proved by a holistic evaluation, both SPAM and APAM outperform compared methods. Specifically, SPAM improves its classification accuracy gradually and significantly from 72.37% to about 85% within 9 crowd-calibration cycles, and APAM with an ensemble classifier achieves the highest precision of 92.13%, which is approximately 13% higher than the second best method. Finally, by applying MMQs on “Sentiment&Activity Linked Data”, various place design insights of our testbed are mined to improve its livability.


IEEE 2018 : A Data Mining based Model for Detection of Fraudulent Behaviour in Water Consumption

Abstract :  Fraudulent behavior in drinking water consumption is a significant problem facing water supplying companies and agencies. This behavior results in a massive loss of income and forms the highest percentage of non-technical loss. Finding efficient measurements for detecting fraudulent activities has been an active research area in recent years. Intelligent data mining techniques can help water supplying companies to detect these fraudulent activities to reduce such losses. This research explores the use of two classification techniques (SVM and KNN) to detect suspicious fraud water customers. The main motivation of this research is to assist Yarmouk Water Company (YWC) in Irbid city of Jordan to overcome its profit loss. The SVM based approach uses customer load profile attributes to expose abnormal behavior that is known to be correlated with non-technical loss activities. The data has been collected from the historical data of the company billing system. The accuracy of the generated model hit a rate of over 74% which is better than the current manual prediction procedures taken by the YWC. To deploy the model, a decision tool has been built using the generated model. The system will help the company to predict suspicious water customers to be inspected on site.



IEEE 2018: Collaborative Filtering Algorithm Based on Rating Difference and User Interest
Abstract: Collaborative filtering algorithm is one of widely used approaches in daily life, so how to improve the quality and efficiency of collaborative filtering algorithm is an essential problem. Usually, some traditional algorithm focuses on the user rating, while they don't take the user rating differences and user interest into account. However, users who have little rating difference or have a similar interest may be highly similar. In this paper, a collaborative filtering algorithm based on scoring difference and user interest is proposed. Firstly, a rating difference factor is added to the traditional collaborative filtering algorithm, where the most appropriate factor can be obtained by experiments. Secondly, calculate the user's interest by combining the attributes of the items, then further calculate the similarity of personal interest between users. Finally, the user rating differences and interest similarity are weighted to get final item recommendation and score forecast. The experimental results on data set shows that the proposed algorithm decreases both Mean Absolute Error and Root Mean Squared Error, and improves the accuracy of the proposed algorithm.



IEEE 2018: Collaborative filtering model for enhancing fingerprint image
Abstract: Fingerprint enhancement plays a very important role in automatic fingerprint identification system. In order to ensure reliable fingerprint identification and improve fingerprint ridge structure, a novel method based on the collaborative filtering model for fingerprint enhancement is proposed. The proposed method consists of two stages. First, the original fingerprint is pre-enhanced by using Gabor filter and linear contrast stretching. Next, the pre-enhanced fingerprint is partitioned into patches in spatial domain, and then the patches are enhanced based on spectra diffusion by using the two-dimensional (2D) angularpass filter and the 2D Butterworth band-pass filter. The proposed method takes full advantage of the ridge information and spectra diffusion with higher quality to recover the lost ridge information. To evaluate proposed method, the databases FVC2004 are employed, and the comparison experiments are carried out using various methods. Comparative experimental results show that the proposed algorithm outperforms the existing state-of-the-art methods on fingerprint enhancement.


IEEE 2018: A Novel Mechanism for Fast Detection of Transformed Data Leakage
Abstract: Data leakage is a growing insider threat in information security among organizations and individuals. A series of methods have been developed to address the problem of data leakage prevention (DLP). However, large amounts of unstructured data need to be tested in the Big Data era. As the volume of data grows dramatically and the forms of data become much complicated, it is a new challenge for DLP to deal with large amounts of transformed data. We propose an Adaptive weighted Graph Walk model (AGW) to solve this problem by mapping it to the dimension of weighted graphs. Our approach solves this problem in three steps. First, the adaptive weighted graphs are built to quantify the sensitivity of tested data based on its context. Then, the improved label propagation is used to enhance the scalability for fresh data. Finally, a low-complexity score walk algorithm is proposed to determine the ultimate sensitivity. Experimental results show that the proposed method can detect leaks of transformed or fresh data fast and efficiently.



IEEE 2018: Machine Learning Methods for Disease Prediction with Claims Data
Abstract: One of the primary challenges of healthcare delivery is aggregating disparate, asynchronous data sources into meaningful indicators of individual health. We combine natural language word embedding and network modeling techniques to learn meaningful representations of medical concepts by using the weighted network adjacency matrix in the GloVe algorithm, which we call Code2Vec. We demonstrate that using our learned embeddings improve neural network performance for disease prediction. However, we also demonstrate that popular deep learning models for disease prediction are not meaningfully better than simpler, more interpretable classifiers such as XGBoost. Additionally, our work adds to the current literature by providing a comprehensive survey of various machine learning algorithms on disease prediction tasks.


IEEE 2018: A Framework for Real-Time Spam Detection in Twitter
Abstract: With the increased popularity of online social networks, spammers find these platforms easily accessible to trap users in malicious activities by posting spam messages. In this work, we have taken Twitter platform and performed spam tweets detection. To stop spammers, Google SafeBrowsing and Twitter’s BotMaker tools detect and block spam tweets. These tools can block malicious links, however they cannot protect the user in real-time as early as possible. Thus, industries and researchers have applied different approaches to make spam free social network platform. Some of them are only based on user-based features while others are based on tweet based features only. However, there is no comprehensive solution that can consolidate tweet’s text information along with the user based features. To solve this issue, we propose a framework which takes the user and tweet based features along with the tweet text feature to classify the tweets. The benefit of using tweet text feature is that we can identify the spam tweets even if the spammer creates a new account which was not possible only with the user and tweet based features. We have evaluated our solution with four different machine learning algorithms namely - Support Vector Machine, Neural Network, Random Forest and Gradient Boosting. With Neural Network, we are able to achieve an accuracy of 91.65% and surpassed the existing solution [1] by approximately 18%.





IEEE 2017: NetSpam: a Network-based Spam Detection Framework for Reviews in Online Social Media
IEEE 2017  Data Mining


Abstract: Nowadays, a big part of people rely on available con-tent in social media in their decisions (e.g. reviews and feedback on a topic or product). The possibility that anybody can leave a review provide a golden opportunity for spammers to write spam reviews about products and services for different interests. Identifying these spammers and the spam content is a hot topic of research and although a considerable number of studies have been done recently toward this end, but so far the methodologies put forth still barely detect spam reviews, and none of them show the importance of each extracted feature type. In this study, we propose a novel framework, named NetSpam, which utilizes spam features for modeling review datasets as heterogeneous information networks to map spam detection procedure into a classification problem in such networks. Using the importance of spam features help us to obtain better results in terms of different metrics experimented on real-world review datasets from Yelp and Amazon websites. 





IEEE 2017: Point-of-interest Recommendation for Location Promotion in Location-based Social Networks


IEEE 2017 Data Mining



 Abstract: With the wide application of location-based social networks (LBSNs), point-of-interest (POI) recommendation has become one of the major services in LBSNs. The behaviors of users in LBSNs are mainly checking in POIs, and these checking in behaviors are influenced by user’s behavior habits and his/her friends. In social networks, social influence is often used to help businesses to attract more users. Each target user has a different influence on different POI in social networks. This paper selects the list of POIs with the greatest influence for recommending users. Our goals are to satisfy the target user’s service need, and simultaneously to promote businesses’ locations (POIs). This paper defines a POI recommendation problem for location promotion. Additionally, we use submodular properties to solve the optimization problem. At last, this paper conducted a comprehensive performance evaluation for our method using two real LBSN datasets. Experimental results show that our proposed method achieves significantly superior POI recommendations comparing with other state-of-the-art recommendation approaches in terms of location promotion.








IEEE 2017: SocialQ&A: An Online Social Network Based Question and Answer System


IEEE 2017 Data Mining


Abstract: Question and Answer (Q&A) systems play a vital role in our daily life for information and knowledge sharing. Users post questions and pick questions to answer in the system. Due to the rapidly growing user population and the number of questions, it is unlikely for a user to stumble upon a question by chance that (s) he can answer. Also, altruism does not encourage all users to provide answers, not to mention high quality answers with a short answer wait time. The primary objective of this paper is to improve the performance of Q&A systems by actively forwarding questions to users who are capable and willing to answer the questions. To this end, we have designed and implemented SocialQ&A, an online social network based Q&A system. SocialQ&A leverages the social network properties of common-interest and mutual-trust friend relationship to identify an asker through friendship who are most likely to answer the question, and enhance the user security. We also improve SocialQ&A with security and efficiency enhancements by protecting user privacy and identifies, and retrieving answers automatically for recurrent questions. We describe the architecture and algorithms, and conducted comprehensive large-scale simulation to evaluate SocialQ&A in comparison with other methods. Our results suggest that social networks can be leveraged to improve the answer quality and asker’s waiting time. We also implemented a real prototype of SocialQ&A, and analyze the Q&A behavior of real users and questions from a small-scale real-world Social Q&A system.








IEEE 2017:  Modeling Urban Behavior by Mining Geotagged Social Data


IEEE 2017 Data Mining


Abstract: Data generated on location-based social networks provide rich information on the whereabouts of urban dwellers. Specifically, such data reveal who spends time where, when, and on what type of activity (e.g., shopping at a mall, or dining at a restaurant). That information can, in turn, be used to describe city regions in terms of activity that takes place therein. For example, the data might reveal that citizens visit one region mainly for shopping in the morning, while another for dining in the evening. Furthermore, once such a description is available, one can ask more elaborate questions. For example, one might ask what features distinguish one region from another – some regions might be different in terms of the type of venues they host and others in terms of the visitors they attract. As another example, one might ask which regions are similar across cities. In this paper, we present a method to answer such questions using publicly shared Foursquare data. Our analysis makes use of a probabilistic model, the features of which include the exact location of activity, the users who participate in the activity, as well as the time of the day and day of week the activity takes place. Compared to previous approaches to similar tasks, our probabilistic modeling approach allows us to make minimal assumptions about the data which relieves us from having to set arbitrary parameters in our analysis (e.g., regarding the granularity of discovered regions or the importance of different features).We demonstrate how the model learned with our method can be used to identify the most likely and distinctive features of a geographical area, quantify the importance features used in the model, and discover similar regions across different cities. Finally, we perform an empirical comparison with previous work and discuss insights obtained through our findings.







IEEE 2017:  SociRank: Identifying and Ranking Prevalent NewsTopics Using Social Media Factors

IEEE 2017 Data Mining


Abstract: Mass media sources, specifically the news media, have traditionally informed us of daily events. In modern times, social media services such as Twitter provide an enormous amount of user-generated data, which have great potential to contain informative news-related content. For these resources to be useful, we must find a way to filter noise and only capture the content that, based on its similarity to the news media, is considered valuable. However, even after noise is removed, information overload may still exist in the remaining data—hence, it is convenient to prioritize it for consumption. To achieve prioritization, information must be ranked in order of estimated importance considering three factors. First, the temporal prevalence of a particular topic in the news media is a factor of importance, and can be considered the media focus (MF) of a topic. Second, the temporal prevalence of the topic in social media indicates its user attention (UA). Last, the interaction between the social media users who mention this topic indicates the strength of the community discussing it, and can be regarded as the user interaction (UI) toward the topic. We propose an unsupervised framework—SociRank—which identifies news topics prevalent in both social media and the news media, and then ranks them by relevance using their degrees of MF, UA, and UI. Our experiments show that SociRank improves the quality and variety of automatically identified news topics.


IEEE 2016: SmartCrawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces
 IEEE 2016  Data Mining


AbstractAs deep web grows at a very fast pace, there has been increased interest in techniques that help efficiently locate deep-web interfaces. However, due to the large volume of web resources and the dynamic nature of deep web, achieving wide coverage and high efficiency is a challenging issue. We propose a two-stage framework, namely SmartCrawler, for efficient harvesting deep web interfaces. In the first stage, SmartCrawler performs site-based searching for center pages with the help of search engines, avoiding visiting a large number of pages. To achieve more accurate results for a focused crawl, SmartCrawler ranks websites to prioritize highly relevant ones for a given topic. In the second stage, SmartCrawler achieves fast in-site searching by excavating most relevant links with an adaptive link-ranking. To eliminate bias on visiting some highly relevant links in hidden web directories, we design a link tree data structure to achieve wider coverage for a website. Our experimental results on a set of representative domains show the agility and accuracy of our proposed crawler framework, which efficiently retrieves deep-web interfaces from large-scale sites and achieves higher harvest rates than other crawlers.


IEEE 2016 : Machine Learning Approach to Forecasting Urban Pollution
IEEE 2016  Data Mining


AbstractThis work addresses the question of how to predict fine particulate matter given a combination of weather conditions. A compilation of several years of meteorological data in the city of Quito, Ecuador, are used to build models using a machine learning approach. The study presents a decision tree algorithm that learns to classify the concentrations of fine aerosols, into two categories (>15µg/m3 vs. from a limited number of parameters such as the level of precipitation and the wind speed and direction. Requiring few rules, the resulting models are able to infer the concentration outcome with significant accuracy. This fundamental research intends to be a preliminary step in the development of a web-based platform and smartphone app to alert the inhabitants of Ecuador’s capital about the risk to human health, with potential future application in other urban areas.


FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce

IEEE 2016  Data Mining


AbstractExisting parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the MapReduce programming model. To achieve compressed storage and avoid building conditional pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional FP trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial third MapReduce job, the mappers independently decompose itemsets, the reducers perform combination operations by constructing small ultrametric trees, and the actual mining of these trees separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster is sensitive to data distribution and dimensions, because itemsets with different lengths have different decomposition and construction costs. To improve FiDoop’s performance, we develop a workload balance metric to measure load balance across the cluster’s computing nodes. We develop FiDoop-HD, an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis. Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution is efficient and scalable.



IEEE 2016 : Inverted Linear Quadtree Efficient Top K Spatial Keyword Search
IEEE 2016  Data Mining

Abstract: With advances in geo-positioning technologies and geo-location services, there are a rapidly growing amount of spatiotextual objects collected in many applications such as location based services and social networks, in which an object is described by its spatial location and a set of keywords (terms). Consequently, the study of spatial keyword search which explores both location and textual description of the objects has attracted great attention from the commercial organizations and research communities. In the paper, we study two fundamental problems in the spatial keyword queries: top k spatial keyword search (TOPK-SK), and batch top k spatial keyword search (BTOPK-SK). Given a set of spatio-textual objects, a query location and a set of query keywords, the TOPK-SK retrieves the closest k objects each of which contains all keywords in the query. BTOPK-SK is the batch processing of sets of TOPK-SK queries. Based on the inverted index and the linear quadtree, we propose a novel index structure, called inverted linear quadtree (IL- Quadtree), which is carefully designed to exploit both spatial and keyword based pruning techniques to effectively reduce the search space. An efficient algorithm is then developed to tackle top k spatial keyword search. To further enhance the filtering capability of the signature of linear quadtree, we propose a partition based method. In addition, to deal with BTOPK-SK, we design a new computing paradigm which partition the queries into groups based on both spatial proximity and the textual relevance between queries. We show that the IL-Quadtree technique can also efficiently support BTOPK-SK. Comprehensive experiments on real and synthetic data clearly demonstrate the efficiency of our methods.


IEEE 2016 : SPORE :A Sequential Personalized Spatial Item Recommender System
IEEE 2016 Data Mining


Abstract: With the rapid development of location-based social networks (LBSNs), spatial item recommendation has become an important way of helping users discover interesting locations to increase their engagement with location-based services. Although human movement exhibits sequential patterns in LBSNs, most current studies on spatial item recommendations do not consider the sequential influence of locations. Leveraging sequential patterns in spatial item recommendation is, however, very challenging, considering 1) users’ check-in data in LBSNs has a low sampling rate in both space and time, which renders existing prediction techniques on GPS trajectories ineffective; 2) the prediction space is extremely large, with millions of distinct locations as the next prediction target, which impedes the application of classical Markov chain models; and 3) there is no existing framework that unifies users’ personal interests and the sequential influence in a principled manner.In light of the above challenges, we propose a sequential personalized spatial item recommendation framework (SPORE) which introduces a novel latent variable topic-region to model and fuse sequential influence with personal interests in the latent and exponential space. The advantages of modeling the sequential effect at the topic-region level include a significantly reduced prediction space, an effective alleviation of data sparsity and a direct expression of the semantic meaning of users’ spatial activities. Furthermore, we design an asymmetric Locality Sensitive Hashing (ALSH) technique to speed up the online top-k recommendation process by extending the traditional LSH. We evaluate the performance of SPORE on two real datasets and one large-scale synthetic dataset. The results demonstrate a significant improvement in SPORE’s ability to recommend spatial items, in terms of both effectiveness and efficiency, compared with the state-of-the-art methods.



IEEE 2016: Truth Discovery in Crowdsourced Detection of Spatial Events
IEEE 2016 Data Mining

Abstract: The ubiquity of smartphones has led to the emergence of mobile crowdsourcing tasks such as the detection of spatial events when smartphone users move around in their daily lives. However, the credibility of those detected events can be negatively impacted by unreliable participants with low-quality data. Consequently, a major challenge in quality control is to discover true events from diverse and noisy participants’ reports. This truth discovery problem is uniquely distinct from its online counterpart in that it involves uncertainties in both participants’ mobility and reliability. Decoupling these two types of uncertainties through location tracking will raise severe privacy and energy issues, whereas simply ignoring missing reports or treating them as negative reports will significantly degrade the accuracy of the discovered truth. In this paper, we propose a new method to tackle this truth discovery problem through principled probabilistic modeling. In particular, we integrate the modeling of location popularity, location visit indicators, truth of events and three-way participant reliability in a unified framework. The proposed model is thus capable of efficiently handling various types of uncertainties and automatically discovering truth without any supervision or the need of location tracking. Experimental results demonstrate that our proposed method outperforms existing state-of-the-art truth discovery approaches in the mobile crowdsourcing environment.




IEEE 2016 : Sentiment Analysis of Top Colleges in India Using Twitter Data
IEEE 2016 Data Mining

Abstract: In today’s world, opinions and reviews accessible to us are one of the most critical factors in formulating our views and influencing the success of a brand, product or service. With the advent and growth of social media in the world, stakeholders often take to expressing their opinions on popular social media, namely Twitter. While Twitter data is extremely informative, it presents a challenge for analysis because of its humongous and disorganized nature. This paper is a thorough effort to dive into the novel domain of performing sentiment analysis of people’s opinions regarding top colleges in India. Besides taking additional preprocessing measures like the expansion of net lingo and removal of duplicate tweets, a probabilistic model based on Bayes’ theorem was used for spelling correction, which is overlooked in other research studies. This paper also highlights a comparison between the results obtained by exploiting the following machine learning algorithms: Naïve Bayes and Support Vector Machine and an Artificial Neural Network model: Multilayer Perceptron. Furthermore, a contrast has been presented between four different kernels of SVM: RBF, linear, polynomial and sigmoid.


IEEE 2016 : FRAppE: Detecting Malicious Facebook Applications
IEEE 2016  Data Mining

Abstract: With 20 million installs a day [1], third-party apps are a major reason for the popularity and addictiveness of Facebook. Unfortunately, hackers have realized the potential of using apps for spreading malware and spam. The problem is already significant, as we find that at least 13% of apps in our dataset are malicious. So far,the research community has focused on detecting malicious posts and campaigns. In this paper, we ask the question: given a Facebook application, can we determine if it is malicious? Our key contribution is in developing FRAppE—Facebook’s Rigorous Application Evaluator— arguably the first tool focused on detecting malicious apps on Facebook. To develop FRAppE, we use information gathered by observing the posting behavior of 111K Facebook apps seen across 2.2 million users on Facebook. First, we identify a set of features that help us distinguish malicious apps from benign ones. For example, we find that malicious apps often share names with other apps, and they typically request fewer permissions than benign apps. Second, leveraging these distinguishing features, we show that FRAppE can detect malicious apps with 99.5% accuracy, with no false positives and a low false negative rate (4.1%). Finally, we explore the ecosystem of malicious Facebook apps and identify mechanisms that these apps use to propagate. Interestingly, we find that many apps collude and support each other; in our dataset, we find 1,584 apps enabling the viral propagation of 3,723 other apps through their posts. Long-term, we see FRAppE as a step towards creating an independent watchdog for app assessment and ranking,so as to warn Facebook users before installing apps.




IEEE 2016: Practical Approximate k Nearest Neighbor Queries with Location and Query Privacy
IEEE 2016 Data Mining

Abstract: In mobile communication, spatial queries pose a serious threat to user location privacy because the location of a query may reveal sensitive information about the mobile user. In this paper, we study approximate k nearest neighbor (kNN) queries where the mobile user queries the location-based service (LBS) provider about approximate k nearest points of interest (POIs) on the basis of his current location. We propose a basic solution and a generic solution for the mobile user to preserve his location and query privacy in approximate kNN queries. The proposed solutions are mainly built on the Paillier public-key cryptosystem and can provide both location and query privacy. To preserve query privacy, our basic solution allows the mobile user to retrieve one type of POIs, for example, approximate k nearest car parks, without revealing to the LBS provider what type of points is retrieved. Our generic solution can be applied to multiple discrete type attributes of private location-based queries. Compared with existing solutions for kNN queries with location privacy, our solution is more efficient. Experiments have shown that our solution is practical for kNN queries.


IEEE 2016: A Novel Pipeline Approach for Efficient Big Data Broadcasting
IEEE 2016 Data Mining

Abstract: Big-data computing is a new critical challenge for the ICT industry. Engineers and researchers are dealing with data sets of petabyte scale in the cloud computing paradigm. Thus, the demand for building a service stack to distribute, manage, and process massive data sets has risen drastically. In this paper, we investigate the Big Data Broadcasting problem for a single source node to broadcast a big chunk of data to a set of nodes with the objective of minimizing the maximum completion time. These nodes may locate in the same data center or across geo-distributed data centers. This problem is one of the fundamental problems in distributed computing and is known to be NP-hard in heterogeneous environments. We model the Big-data broadcasting problem into a LockStep Broadcast Tree (LSBT) problem. The main idea of the LSBT model is to define a basic unit of upload bandwidth, r, such that a node with capacity c broadcasts data to a set of bc=rc children at the rate r. Note that r is a parameter to be optimized as part of the LSBT problem. We further divide the broadcast data into m chunks. These data chunks can then be broadcast down the LSBT in a pipeline manner. In a homogeneous network environment in which each node has the same upload capacity c, we show that the optimal uplink rate r_ of LSBT is either c=2 or c=3, whichever gives the smaller maximum completion time. For heterogeneous environments, we present an Oðnlog2nÞ algorithm to select an optimal uplink rate r_ and to construct an optimal LSBT. Numerical results show that our approach performs well with less maximum completion time and lower computational complexity than other efficient solutions in literature.


IEEE 2016 : Nearest Keyword Set Search in Multi-dimensional Datasets
IEEE 2016 Data Mining

Abstract: Keyword-based search in text-rich multi-dimensional datasets facilitates many novel applications and tools. In this paper, we consider objects that are tagged with keywords and are embedded in a vector space. For these datasets, we study queries that ask for the tightest groups of points satisfying a given set of keywords. We propose a novel method called ProMiSH (Projection and Multi Scale Hashing) that uses random projection and hash-based index structures, and achieves high scalability and speedup. We present an exact and an approximate version of the algorithm. Our experimental results on real and synthetic datasets show that ProMiSH has up to 60 times of speedup over state-of-the-art tree-based techniques.


IEEE 2016: Mining User-Aware Rare Sequential Topic Patterns in Document Streams
IEEE 2016 Data Mining

Abstract: Textual documents created and distributed on the Internet are ever changing in various forms. Most of existing works are devoted to topic modeling and the evolution of individual topics, while sequential relations of topics in successive documents published by a specific user are ignored. In this paper, in order to characterize and detect personalized and abnormal behaviors of Internet users, we propose Sequential Topic Patterns (STPs) and formulate the problem of mining User-aware Rare Sequential Topic Patterns (URSTPs) in document streams on the Internet. They are rare on the whole but relatively frequent for specific users, so can be applied in many real-life scenarios, such as real-time monitoring on abnormal user behaviors. We present a group of algorithms to solve this innovative mining problem through three phases: preprocessing to extract probabilistic topics and identify sessions for different users, generating all the STP candidates with (expected) support values for each user by pattern-growth, and selecting URSTPs by making user-aware rarity analysis on derived STPs. Experiments on both real (Twitter) and synthetic datasets show that our approach can indeed discover special users and interpretable URSTPs effectively and efficiently, which significantly reflect users’ characteristics.


IEEE 2016:  Cross-Platform Identification of Anonymous Identical Users in Multiple Social Media Networks
IEEE 2016 Data Mining

Abstract: The last few years have witnessed the emergence and evolution of a vibrant research stream on a large variety of online Social Media Network (SMN)platforms. Recognizing anonymous, yet identical users among multiple SMNs is still an intractable problem. Clearly, cross -  platform exploration may help solve many problems in social computing in both theory and applications. Since public profiles can be duplicated and easily impersonated by users with different purposes, most current  user identification resolutions, which mainly focus on text mining of users’ public  profiles. are fragile. Some studies have attempted to match users based on the location and timing of user content as well as writing style. However, the locations are sparse in the majority of SMNs, and writing style is difficult to discern from the short sentences of leading SMNs such as Sina Microblog and Twitter. Moreover, since online SMNs are quite symmetric, existing user identification schemes based on network structure are not effective. The real-world friend cycle is  highly individual and virtually no two users share a congruent friend cycle. Therefore, it is more accurate to use a friendship structure to analyze cross - platform SMNs. Since identical users tend to set up partial similar friendship structures in different SMNs, we propose d  the Friend Relationship- Based  User Identification  (FRUI ) algorithm.  FRUI calculates a match degree for all candidate User Matched Pairs (UMPs), and only UMPs with top ranks are considered as identical users. We also developed two propositions to improve the efficiency of the algorithm. Results of  extensive experiments demonstrate that FRUI  performs much better than current network structure - based algorithms.

IEEE 2015 : Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions
IEEE 2015 Transaction on Data Mining
ABSTRACT :The large number of potential applications from bridging Web data with knowledge bases have led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledge base. Potential applications include information extraction, information retrieval, and knowledge base population. However, this task is challenging due to name variations and entity ambiguity. In this survey, we present a thorough overview and analysis of the main approaches to entity linking, and discuss various applications, the evaluation of entity linking systems, and future directions.

IEEE 2015 : Rule-Based Method for Entity Resolution
IEEE 2015 Transaction on Data Mining

ABSTRACT :The objective of entity resolution (ER) is to identify records referring to the same real-world entity. Traditional ER approaches identify records based on pairwise similarity comparisons, which assumes that records referring to the same entity are more similar to each other than otherwise. However, this assumption does not always hold in practice and similarity comparisons do not work well when such assumption breaks. We propose a new class of rules which could describe the complex matching conditions between records and entities. Based on this class of rules, we present the rule-based entity resolution problem and develop an on-line approach for ER. In this framework, by applying rules to each record, we identify which entity the record refers to. Additionally, we propose an effective and efficient rule discovery algorithm. We experimentally evaluated our rule-based ER algorithm on real data sets. The experimental results show that both our rule discovery algorithm and rule-based ER algorithm can achieve high performance.

IEEE 2015 : Secure Distributed Deduplication Systems with Improved Reliability
IEEE 2015 Transaction on Data Mining

ABSTRACT :Data deduplication is a technique for eliminating duplicate copies of data, and has been widely used in cloud storage to reduce storage space and upload bandwidth. However, there is only one copy for each file stored in cloud even if such a file is owned by a huge number of users. As a result, deduplication system improves storage utilization while reducing reliability. Furthermore, the challenge of privacy for sensitive data also arises when they are outsourced by users to cloud. Aiming to address the above security challenges, this paper makes the first attempt to formalize the notion of distributed reliable deduplication system. We propose new distributed deduplication systems with higher reliability in which the data chunks are distributed across multiple cloud servers. The security requirements of data confidentiality and tag consistency are also achieved by introducing a deterministic secret sharing scheme in distributed storage systems, instead of using convergent encryption as in previous deduplication systems. Security analysis demonstrates that our deduplication systems are secure in terms of the definitions specified in the proposed security model. As a proof of concept, we implement the proposed systems and demonstrate that the incurred overhead is very limited in realistic environments.



IEEE 2015 : FRAppE - Detecting Malicious Facebook Applications
IEEE 2015 Transaction on Data Mining
 ABSTRACT : With 20 million installs a day [1], third-party apps are a major reason for the popularity and addictiveness of Facebook. Unfortunately, hackers have realized the potential of using apps for spreading malware and spam. The problem is already significant, as we find that at least 13% of apps in our dataset are malicious. So far, the research community has focused on detecting malicious posts and campaigns. In this paper, we ask the question: given a Facebook application, can we determine if it is malicious? Our key contribution is in developing FRAppE—Facebook’s Rigorous Application Evaluator— arguably the first tool focused on detecting malicious apps on Facebook. To develop FRAppE, we use information gathered by observing the posting behavior of 111K Facebook apps seen across 2.2 million users on Facebook. First, we identify a set of features that help us distinguish malicious apps from benign ones. For example, we find that malicious apps often share names with other apps, and they typically request fewer permissions than benign apps. Second, leveraging these distinguishing features, we show that FRAppE can detect malicious apps with 99.5% accuracy, with no false positives and a low false negative rate (4.1%). Finally, we explore the ecosystem of malicious Facebook apps and identify mechanisms that these apps use to propagate. Interestingly, we find that many apps collude and support each other; in our dataset, we find 1,584 apps enabling the viral propagation of 3,723 other apps through their posts. Long-term, we see FRAppE as a step towards creating an independent watchdog for app assessment and ranking, so as to warn Facebook users before installing apps.



IEEE 2015 : The Internet of Things for Health Care: A Comprehensive Survey
IEEE 2015 Transaction on Data Mining

ABSTRACT : The Internet of Things (IoT) makes smart objects the ultimate building blocks in the development of cyber-physical smart pervasive frameworks. The IoT has a variety of application domains, including health care. The IoT revolution is redesigning modern health care with promising technological, economic, and social prospects. This paper surveys advances in IoT-based health care technologies and reviews the state-of-the-art network architectures/platforms, applications, and industrial trends in IoT-based health care solutions. In addition, this paper analyzes distinct IoT security and privacy features, including security requirements, threat models, and attack taxonomies from the health care perspective. Further, this paper proposes an intelligent collaborative security model to minimize security risk; discusses how different innovations such as big data, ambient intelligence, and wearables can be leveraged in a health care context; addresses various IoT and eHealth policies and regulations across the world to determine how they can facilitate economies and societies in terms of sustainable development; and provides some avenues for future research on IoT-based health care based on a set of open issues and challenges.




IEEE 2015 : Friendbook: A Semantic-based Friend Recommendation System for Social Networks
IEEE 2015 Transaction on Data Mining

Abstract—Existing social networking services recommend friends to users based on their social graphs, which may not be the most appropriate to reflect a user’s preferences on friend selection in real life. In this paper, we present Friendbook, a novel semantic-based friend recommendation system for social networks, which recommends friends to users based on their life styles instead of social graphs. By taking advantage of sensor-rich smartphones, Friendbook discovers life styles of users from user-centric sensor data, measures the similarity of life styles between users, and recommends friends to users if their life styles have high similarity. Inspired by text mining, we model a user’s daily life as life documents, from which his/her life styles are extracted by using the Latent Dirichlet Allocation algorithm. We further propose a similarity metric to measure the similarity of life styles between users, and calculate users’ impact in terms of life styles with a friend-matching graph. Upon receiving a request, Friendbook returns a list of people with highest recommendation scores to the query user. Finally, Friendbook integrates a feedback mechanism to further improve the recommendation accuracy. We have implemented Friendbook on the Android-based smartphones, and evaluated its performance on both small-scale experiments and large-scale simulations. The results show that the recommendations accurately reflect the preferences of users in
choosing friends.


IEEE 2015 : Privacy-Preserving Detection of Sensitive Data Exposure
IEEE 2015 Transaction on Data Mining

Abstract—Statistics from security firms, research institutions and government organizations show that the number of data-leak instances have grown rapidly in recent years. Among various data-leak cases, human mistakes are one of the main causes of data loss. There exist solutions detecting inadvertent sensitive data leaks caused by human mistakes and to provide alerts for organizations. A common approach is to screen content in storage and transmission for exposed sensitive information. Such an approach usually requires the detection operation to be conducted in secrecy. However, this secrecy requirement is challenging to satisfy in practice, as detection servers may be compromised or outsourced. In this paper, we present a privacy preserving data-leak detection (DLD) solution to solve the issue where a special set of sensitive data digests is used in detection. The advantage of our method is that it enables the data owner to safely delegate the detection operation to a semihonest provider without revealing the sensitive data to the provider. We describe how Internet service providers can offer their customers DLD as an add-on service with strong privacy guarantees. The evaluation results show that our method can support accurate detection with very small number of false alarms under various data-leak scenarios.



IEEE 2015 : Constructing a Global Social Service Network for Better Quality of Web Service Discovery
IEEE 2015 Transaction on Data Mining

Abstract—Web services have had a tremendous impact on the Web for supporting a distributed service-based economy on a global scale. However, despite the outstanding progress, their uptake on a Web scale has been significantly less than initially anticipated. The isolation of services and the lack of social relationships among related services have been identified as reasons for the poor uptake. In this paper, we propose connecting the isolated service islands into a global social service network to enhance the services’ sociability on a global scale. First, we propose linked social service-specific principles based on linked data principles for publishing services on the open Web as linked social services. Then, we suggest a new framework for constructing the global social service network following linked social service-specific principles based on complex network  theories. Next, an approach is proposed to enable the exploitation of the global social service network, providing Linked Social Services as a Service. Finally, experimental results show that our approach can solve the quality of service discovery problem, improving both the service discovering time and the success rate by exploring service-to-service based on the global social service network.




IEEE 2015 : PAGE: A Partition Aware Engine for Parallel Graph Computation
IEEE 2015 Transaction on Data Mining

Abstract—Graph partition quality affects the overall performance of parallel graph computation systems. The quality of a graph partition is measured by the balance factor and edge cut ratio. A balanced graph partition with small edge cut ratio is generally preferred since it reduces the expensive network communication cost. However, according to an empirical study on Giraph, the performance over well partitioned graph might be even two times worse than simple random partitions. This is because these systems only optimize for the simple partition strategies and cannot efficiently handle the increasing workload of local message processing when a high quality graph partition is used. In this paper, we propose a novel partition aware graph computation engine named PAGE, which equips a new message processor and a dynamic concurrency control model. The new message processor concurrently processes local and remote messages in a unified way. The dynamic model adaptively adjusts the concurrency of the processor based on the online statistics. The experimental evaluation demonstrates the superiority of PAGE over the graph partitions with various qualities.


IEEE 2015 Identity-based Encryption with Outsourced 
Revocation in Cloud Computing
IEEE 2015 Transaction on Data Mining

Abstract—Identity-Based Encryption (IBE) which simplifies the public key and certificate management at Public Key Infrastructure (PKI) is an important alternative to public key encryption. However, one of the main efficiency drawbacks of IBE is the overhead computation at Private Key Generator (PKG) during user revocation. Efficient revocation has been well studied in traditional PKI setting, but the cumbersome management of certificates is precisely the burden that IBE strives to alleviate. In this paper, aiming at tackling the critical issue of identity revocation, we introduce outsourcing computation into IBE for the first time and propose a revocable IBE scheme in the server-aided setting. Our scheme offloads most of the key generation related operations during key-issuing and key-update processes to a Key Update Cloud Service Provider, leaving only a constant number of simple operations for PKG and users to perform locally. This goal is achieved by utilizing a novel collusion-resistant technique: we employ a hybrid private key for each user, in which an AND gate is involved to connect and bound the identity component and the time component. Furthermore, we propose another construction which is provable secure under the recently formulized Refereed Delegation of Computation model. Finally, we provide extensive experimental results to demonstrate the efficiency of our proposed construction.




IEEE 2015 : Co-Extracting Opinion Targets and Opinion Words from Online Reviews Based on the Word Alignment Model
IEEE 2015 Transaction on Data Mining

Abstract—Mining opinion targets and opinion words from online reviews are important tasks for fine-grained opinion mining, the key component of which involves detecting opinion relations among words. To this end, this paper proposes a novel approach based on the partially-supervised alignment model, which regards identifying opinion relations as an alignment process. Then, a graph-based co-ranking algorithm is exploited to estimate the confidence of each candidate. Finally, candidates with higher confidence are extracted as opinion targets or opinion words. Compared to previous methods based on the nearest-neighbor rules, our model captures opinion relations more precisely, especially for long-span relations. Compared to syntax-based methods, our word alignment model effectively alleviates the negative effects of parsing errors when dealing with informal online texts. In particular, compared to the traditional unsupervised alignment model, the proposed model obtains better precision because of the usage of partial supervision. In addition, when estimating candidate confidence, we penalize higher-degree vertices in our graph-based co-ranking algorithm to decrease the probability of error generation. Our experimental results on three corpora with different sizes and languages show that our approach effectively outperforms state-of-the-art methods.             



  IEEE 2015 : Research Directions for Engineering Big Data Analytics Software
IEEE 2015 Transaction on Data Mining

Abstract: Many software startups and research and development efforts are actively taking place to harness the power of big data and create software with potential to improve almost every aspect of human life. As these efforts continue to increase, full consideration needs to be given to engineering aspects of big data software. Since these systems exist to make predictions on complex and continuous massive datasets, they pose unique problems during specification, design, and verification of software that needs to be delivered on-time and within budget. But, given the nature of big data software, can this be done? Does big data software engineering really work? This article explores details of big data software, discusses the main problems encountered when engineering big data software, and proposes avenues for future research.


IEEE 2015 : Massive MIMO as a Big Data System: Random Matrix Models and Testbed
IEEE 2015 Transaction on Data Mining

ABSTRACT :The paper has two parts. The first one deals with how to use large random matrices as building blocks to model the massive data arising from the massive (or large-scale) MIMO system. As a result, we apply this model for distributed spectrum sensing and network monitoring. The part boils down to the streaming, distributed massive data, for which a new algorithm is obtained and its performance is derived using the central limit theorem that is recently obtained in the literature. The second part deals with the large-scale testbed using software-defined radios (particularly USRP) that takes us more than four years to develop this 70-node network testbed. To demonstrate the power of the software defined radio, we reconfigure our testbed quickly into a testbed for massive MIMO. The massive data of this testbed is of central interest in this paper. It is for the first time for us to model the experimental data arising from this testbed. To our best knowledge, we are not aware of other similar work.



IEEE 2015 : A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud Data
IEEE 2015 Transaction on Data Mining

ABSTRACT :Due to the increasing popularity of cloud computing, more and more data owners are motivated to outsource their data to cloud servers for great convenience and reduced cost in data management. However, sensitive data should be encrypted before outsourcing for privacy requirements, which obsoletes data utilization like keyword-based document retrieval. In this paper, we present a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations like deletion and insertion of documents. Specifically, the vector space model and the widely-used TF[1]IDF model are combined in the index construction and query generation. We construct a special tree-based index structure and propose a “Greedy Depth-first Search” algorithm to provide efficient multi-keyword ranked search. The secure kNN algorithm is utilized to encrypt the index and query vectors, and meanwhile ensure accurate relevance score calculation between encrypted index and query vectors. In order to resist statistical attacks, phantom terms are added to the index vector for blinding search results . Due to the use of our special tree-based index structure, the proposed scheme can achieve sub-linear search time and deal with the deletion and insertion of documents flexibly. Extensive experiments are conducted to demonstrate the efficiency of the proposed scheme.

IEEE 2015  : Generating Searchable Public-Key Cipher texts with Hidden Structures for Fast Keyword Search
IEEE 2015 Transaction on Data Mining

ABSTRACT :Existing semantically secure public-key searchable encryption schemes take search time linear with the total number of the ciphertexts. This makes retrieval from large-scale databases prohibitive. To alleviate this problem, this paper proposes Searchable Public-Key Ciphertexts with Hidden Structures (SPCHS) for keyword search as fast as possible without sacrificing semantic security of the encrypted keywords. In SPCHS, all keyword-searchable ciphertexts are structured by hidden relations, and with the search trapdoor corresponding to a keyword, the minimum information of the relations is disclosed to a search algorithm as the guidance to find all matching ciphertexts efficiently. We construct a SPCHS scheme from scratch in which the ciphertexts have a hidden star-like structure. We prove our scheme to be semantically secure in the Random Oracle (RO) model. The search complexity of our scheme is dependent on the actual number of the ciphertexts containing the queried keyword, rather than the number of all ciphertexts. Finally, we present a generic SPCHS construction from anonymous identity-based encryption and collision-free full-identity malleable Identity-Based Key Encapsulation Mechanism (IBKEM) with anonymity. We illustrate two collision-free full-identity malleable IBKEM instances, which are semantically secure and anonymous, respectively, in the RO and standard models. The latter instance enables us to construct an SPCHS scheme with semantic security in the standard model.

IEEE 2015 : k-Nearest Neighbor Classification over Semantically Secure Encrypted elational Data
IEEE 2015 Transaction on Data Mining
ABSTRACT :Data Mining has wide applications in many areas such as banking, medicine, scientific research and among government agencies. Classification is one of the commonly used tasks in data mining applications. For the past decade, due to the rise of various privacy issues, many theoretical and practical solutions to the classification problem have been proposed under different security models. However, with the recent popularity of cloud computing, users now have the opportunity to outsource their data, in encrypted form, as well as the data mining tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy-preserving classification techniques are not applicable. In this paper, we focus on solving the lassification problem over encrypted data. In particular, we propose a secure k-NN classifier over encrypted data in the cloud. The proposed protocolprotects the confidentiality of data, privacy of user’s input query, and hides the data access patterns. To the best of our knowledge, our work is the first to develop a secure k-NN classifier over encrypted data under the semi-honest model. Also, we empirically analyze the efficiency of our proposed protocol using a real-world dataset under different parameter settings.


IEEE 2015 : Towards Effective Bug Triage with Software Data Reduction Techniques
IEEE 2015 Transaction on Data Mining

ABSTRACT :Software companies spend over 45 percent of cost in dealing with software bugs. An inevitable step of fixing bugs is bug triage, which aims to correctly assign a developer to a new bug. To decrease the time cost in manual work, text classification techniques are applied to conduct automatic bug triage. In this paper, we address the problem of data reduction for bug triage, i.e., how to reduce the scale and improve the quality of bug data. We combine instance selection with feature selection to simultaneously reduce data scale on the bug dimension and the word dimension. To determine the order of applying instance selection and feature selection, we extract attributes from historical bug data sets and build a predictive model for a new bug data set. We empirically investigate the performance of data reduction on totally 600,000 bug reports of two large open source projects, namely Eclipse and Mozilla. The results show that our data reduction can effectively reduce the data scale and improve the accuracy of bug triage. Our work provides an approach to leveraging techniques on data processing to form reduced and high-quality bug data in software development and maintenance.


IEEE 2015 : Self-Organizing Neural Networks Integrating Domain Knowledge and Reinforcement Learning
IEEE 2015 Transaction on Data Mining
ABSTRACT :The use of domain knowledge in learning systems is expected to improve learning efficiency and reduce model complexity. However, due to the incompatibility with knowledge structure of the learning systems and real-time exploratory nature of reinforcement learning (RL), domain knowledge cannot be inserted directly. In this paper, we show how self-organizing neural networks designed for online and incremental adaptation can integrate domain knowledge and RL. Specifically, symbol-based domain knowledge is translated into numeric patterns before inserting into the self-organizing neural networks. To ensure effective use of domain knowledge, we present an analysis of how the inserted knowledge is used by the selforganizing neural networks during RL. To this end, we propose a vigilance adaptation and greedy exploitation strategy to maximize exploitation of the inserted domain knowledge while retaining the plasticity of learning and using new knowledge. Our experimental results based on the pursuit-evasion and minefield navigation problem domains show that such self-organizing neural network can make effective use of domain knowledge to improve learning efficiency and reduce model complexity.ork can make effective use of domain knowledge to improve learning efficiency and reduce model complexity.



IEEE 2015 : Active Learning for Ranking through Expected Loss Optimization
IEEE 2015 Transaction on Data Mining

ABSTRACT : Learning to rank arises in many information retrieval applications, ranging from Web search engine, online advertising to recommendation system. In learning to rank, the performance of a ranking model is strongly a®ected by the number of labeled examples in the training set; on the other hand, obtaining labeled examples for training data is very expensive and time-consuming. This presents a great need for the active learning approaches to select most informative examples for ranking learning; however, in the literature there is still very limited work to address active learning for ranking. In this paper, we propose a general active learning framework, Expected Loss Optimization (ELO), for ranking. The ELO framework is applicable to a wide range of ranking functions. Under this framework, we derive a novel algorithm, Expected DCG Loss Optimization (ELO-DCG), to select most informative examples. Furthermore, we investigate both query and document level active learning for raking and propose a two-stage ELO-DCG algorithm which incorporate both query and document selection into activelearning. Extensive experiments on real-world Web search data sets have demonstrated great potential and e®ective- ness of the proposed framework and algorithms.



IEEE 2015 : Discover the Expert: Context-Adaptive Expert Selection for Medical Diagnosis
IEEE 2015 Transaction on Data Mining

ABSTRACT :  In this paper, we propose an expert selection system that learns online the best expert to assign to each patient depending on the context of the patient. In general, the context can include an enormous number and variety of information related to the patient's health condition, age, gender, previous drug doses, and so forth, but the most relevant information is embedded in only a few contexts. If these most relevant contexts were known in advance, learning would be relatively simple but they are not. Moreover, the relevant contexts may be different for different health conditions. To address these challenges, we develop a new class of algorithms aimed at discovering the most relevant contexts and the best clinic and expert to use to make a diagnosis given a patient's contexts. We prove that as the number of patients grows, the proposed contextadaptive algorithm will discover the optimal expert to select for patients with a speci_c context. Moreover, the algorithm also provides con_dence bounds on the diagnostic accuracy of the expert it selects, which can be considered by the primary care physician before making the _nal decision. While our algorithm is general and can be applied in numerous medical scenarios, we illustrate its functionality and performance by applying it to a real-world breast cancer diagnosis data set. Finally, while the application we consider in this paper is medical diagnosis, our proposed algorithm can be applied in other environments where expertise needs to be discovered.



IEEE 2015 : A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud Data
IEEE 2015 Transaction on Data Mining

ABSTRACT : Due to the increasing popularity of cloud computing, more and more data owners are motivated to outsource their data to cloud servers for great convenience and reduced cost in data management. However, sensitive data should be encrypted before outsourcing for privacy requirements, which obsoletes data utilization like keyword-based document retrieval. In this paper, we present a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations like deletion and insertion of documents. Specifically, the vector space model and the widely-used TF_IDF model are combined in the index construction and query generation. We construct a special tree-based index structure and propose a “Greedy Depth-first Search” algorithm to provide efficient multi-keyword ranked search. The secure kNN algorithm is utilized to encrypt the index and query vectors, and meanwhile ensure accurate relevance score calculation between encrypted index and query vectors. In order to resist statistical attacks, phantom terms are added to the index vector for blinding search results . Due to the use of our special tree-based index structure, the proposed scheme can achieve sub-linear search time and deal with the deletion and insertion of documents flexibly. 




IEEE 2015 : Co-Extracting Opinion Targets and Opinion Words from Online Reviews Based on the Word Alignment Model
IEEE 2015 Transaction on Data Mining

ABSTRACT : Mining opinion targets and opinion words from online reviews are important tasks for fine-grained opinion mining, the key component of which involves detecting opinion relations among words. To this end, this paper proposes a novel approach based on the partially-supervised alignment model, which regards identifying opinion relations as an alignment process. Then, a graph-based co-ranking algorithm is exploited to estimate the confidence of each candidate. Finally, candidates with higher confidence are extracted as opinion targets or opinion words. Compared to previous methods based on the nearest-neighbor rules, our model captures opinion relations more precisely, especially for long-span relations. Compared to syntax-based methods, our word alignment model effectively alleviates the negative effects of parsing errors when dealing with informal online texts. In particular, compared to the traditional unsupervised alignment model, the proposed model obtains better precision because of the usage of partial supervision. In addition, when estimating candidate confidence, we penalize higher-degree vertices in our graph-based co-ranking algorithm to decrease the probability of error generation. Our experimental results on three corpora with different sizes and languages show that our approach effectively outperforms state-of-the-art methods.



IEEE 2015 : Privacy-Preserving Detection of Sensitive Data Exposure
IEEE 2015 IEEE 2015 Transaction on Data Mining
ABSTRACT : Statistics from security firms, research institutions and government organizations show that the number of data-leak instances have grown rapidly in recent years. Among various data-leak cases, human mistakes are one of the main causes of data loss. There exist solutions detecting inadvertent sensitive data leaks caused by human mistakes and to provide alerts for organizations. A common approach is to screen content in storage and transmission for exposed sensitive information. Such an approach usually requires the detection operation to be conducted in secrecy. However, this secrecy requirement is challenging to satisfy in practice, as detection servers may be compromised or outsourced. In this paper, we present a privacypreserving data-leak detection (DLD) solution to solve the issue where a special set of sensitive data digests is used in detection. The advantage of our method is that it enables the data owner to safely delegate the detection operation to a semihonest provider without revealing the sensitive data to the provider. We describe how Internet service providers can offer their customers DLD as an add-on service with strong privacy guarantees. The evaluation results show that our method can support accurate detection with very small number of false alarms under various data-leak scenarios.



IEEE 2015 : The Internet of Things for Health Care: A Comprehensive Survey
IEEE 2015 Transaction on Data Mining

ABSTRACT : The Internet of Things (IoT) makes smart objects the ultimate building blocks in the development of cyber-physical smart pervasive frameworks. The IoT has a variety of application domains, including health care. The IoT revolution is redesigning modern health care with promising technological, economic, and social prospects. This paper surveys advances in IoT-based health care technologies and reviews the state-of-the-art network architectures/platforms, applications, and industrial trends in IoT-based health care solutions. In addition, this paper analyzes distinct IoT security and privacy features, including security requirements, threat models, and attack taxonomies from the health care perspective. Further, this paper proposes an intelligent collaborative security model to minimize security risk; discusses how different innovations such as big data, ambient intelligence, and wearables can be leveraged in a health care context; addresses various IoT and eHealth policies and regulations across the world to determine how they can facilitate economies and societies in terms of sustainable development; and provides some avenues for future research on IoT-based health care based on a set of open issues and challenges.





IEEE 2014: A Generic Framework for Top-k Pairs and Top-k Objects Queries over Sliding Windows
IEEE 2014 Transactions on Knowledge and Data Engineering

ABSTRACT :Top-k pairs and top-k objects queries have received significant attention by the research community. In this paper, we present the first approach to answer a broad class of top-k pairs and top-k objects queries over sliding windows. Our framework handles multiple top-k queries and each query is allowed to use a different scoring function, a different value of k and a different size of the sliding window. Furthermore, the framework allows the users to define arbitrarily complex scoring functions and supports out-of-order data streams. For all the queries that use the same scoring function, we need to maintain only one K-sky band. We present efficient techniques for the K-sky band maintenance and query answering. We conduct a detailed complexity analysis and show that the expected cost of our approach is reasonably close to the lower bound cost. For top-k pairs queries, we demonstrate the efficiency of our approach by comparing it with a specially designed supreme algorithm that assumes the existence of an oracle and meets the lower bound cost. For top-k objects queries, our experimental results demonstrate the superiority of our algorithm over the state-of-the-art algorithm.



IEEE 2014: Approximate Shortest Distance Computing: A Query-Dependent Local  Landmark Scheme
IEEE 2014 Transactions on Knowledge and Data Engineering


ABSTRACT :shortest distance query between two nodes is a fundamental operation in large-scale networks. Most existing methods in the literature take a landmark embedding approach, which selects a set of graph nodes as landmarks and computes the shortest distances from each landmark to all nodes as an embedding. To handle a shortest distance query between two nodes, the pre computed distances from the landmarks to the query nodes are used to compute an approximate shortest distance based on the triangle inequality.
                                                                                                                                         

IEEE 2014: CoRE: A Context-Aware Relation Extraction Method for Relation Completion
IEEE 2014 Transactions on Knowledge and Data Engineering

ABSTRACT :we identify Relation Completion (RC) as one recurring problem that is central to the success of novel big data applications such as Entity Reconstruction and Data Enrichment. Given a semantic relation R, RC attempts at linking entity pairs between two entity lists under the relation R. To accomplish the RC goals, we propose to formulate search queries for each query entity _ based on some auxiliary information, so that to detect its target entity _ from the set of retrieved documents. For instance, a Pattern-based method (PaRE) uses extracted patterns as the auxiliary information in formulating search queries. However, high-quality patterns may decrease the probability of finding suitable target entities. As an alternative, we propose CoRE method that uses context terms learned surrounding the expression of a relation as the auxiliary information in formulating queries. The experimental results based on several real-world web data collections demonstrate that CoRE reaches a much higher accuracy than PaRE for the purpose of RC.


IEEE 2014: Efficient Ranking on Entity Graphs with Personalized Relationships
IEEE 2014 Transactions on Knowledge and Data Engineering

ABSTRACT : Authority flow techniques like Page Rank and Object Rank can provide personalized ranking of typed entity-relationship graphs. There are two main ways to personalize authority flow ranking: Node-based personalization, where authority originates from a set of user-specific nodes; Edge-based personalization, where the importance of different edge types is user-specific. We propose the first approach to achieve efficient edge-based personalization using a combination of pre computation and runtime algorithms. In particular, we apply our method to Object Rank, where a personalized weight assignment vector (WAV) assigns different weights to each edge type or relationship type. Our approach includes a repository of rankings for various WAVs. We consider the following two classes of approximation: (a) Schema Approx is formulated as a distance minimization problem at the schema level; (b) Data Approx is a distance minimization problem at the data graph level. Schema Approx is not robust since it does not distinguish between important and trivial edge types based on the edge distribution in the data graph. In contrast, Data Approx has a provable error bound. Both Schema Approx and Data Approx are expensive so we develop efficient heuristic implementations, Scale Rank and Pick One respectively. Extensive experiments on the DBLP data graph show that Scale Rank provides a fast and accurate personalized authority Flow ranking.


IEEE 2014:Secure Mining of Association Rules in Horizontally Distributed Databases
IEEE 2014 Transactions on Knowledge and Data Engineering
ABSTRACT :we propose a protocol for secure mining of association rules in horizontally distributed databases. The current leading protocol is that of Kantarcioglu and Clifton [18]. Our protocol, like theirs, is based on the Fast Distributed Mining (FDM) algorithm of Cheung et al. [8], which is an unsecured distributed version of the Apriori algorithm. The main ingredients in our protocol are two novel secure multi-party algorithms — one that computes the union of private subsets that each of the interacting players hold, and another that tests the inclusion of an element held by one player in a subset held by another. Our protocol offers enhanced privacy with respect to the protocol in [18]. In addition, it is simpler and is significantly more efficient in terms of communication rounds, communication cost and computational cost.



IEEE 2014:Facilitating Document Annotation using Content and Querying Value
IEEE 2014 Transactions on  Knowledge and Data Engineering

ABSTRACT :A large number of organizations today generate and share textual descriptions of their products, services, and actions. Such collections of textual data contain significant amount of structured information, which remains buried in the unstructured text. While information extraction algorithms facilitate the extraction of structured relations, they are often expensive and inaccurate, especially when operating on top of text that does not contain any instances of the targeted structured information. We present a novel alternative approach that facilitates the generation of the structured metadata by identifying documents that are likely to contain information of interest and this information is going to be subsequently useful for querying the database. Our approach relies on the idea that humans are more likely to add the necessary metadata during creation time, if prompted by the interface; or that it is much easier for humans (and/or algorithms) to identify the metadata when such information actually exists in the document, instead of naively prompting users to fill in forms with information that is not available in the document. As a major contribution of this paper, we present algorithms that identify structured attributes that are likely to appear within the document, by jointly utilizing the content of the text and the query workload. Our experimental evaluation shows that our approach generates superior results compared to approaches that rely only on the textual content or only on the query workload, to identify attributes of interest.




IEEE 2014:An Empirical Performance Evaluation of Relational Keyword Search Systems
IEEE 2014 Transactions on  Knowledge and Data Engineering

ABSTRACT : In the past decade, extending the keyword search paradigm to relational data has been an active area of research within the database and information retrieval (IR) community. A large number of approaches have been proposed and implemented, but despite numerous publications, there remains a severe lack of standardization for system evaluations. This lack of standardization has resulted in contradictory results from different evaluations, and the numerous discrepancies muddle what advantages are proffered by different approaches. In this paper, we present a thorough empirical performance evaluation of relational keyword search systems. Our results indicate that many existing search techniques do not provide acceptable performance for realistic retrieval tasks. In particular, memory consumption precludes many search techniques from scaling Beyond small datasets with tens of thousands of vertices. We also explore the relationship between execution time and factors varied in previous evaluations; our analysis indicates that these factors have relatively little impact on performance. In summary, our work confirms previous claims regarding the unacceptable performance of these systems and underscores the need for standardization—as exemplified by the IR community—when evaluating these retrieval systems.

 
IEEE 2013: SUSIE: Search Using Services and Information Extraction
IEEE 2013 Transactions on Knowledge and Data Engineering 

ABSTRACT : The API of a Web service restricts the types of queries that the service can answer. For example, a Web service might provide a method that returns the songs of a given singer, but it might not provide a method that returns the singers of a given song. If the user asks for the singer of some specific song, then the Web service cannot be called – even though the underlying database might have the desired piece of information. This asymmetry is particularly problematic if the service is used in a Web service orchestration system. In this paper, we propose to use on-the-fly information extraction to collect values that can be used as parameter bindings for the Web service. We show how this idea can be integrated into a Web service orchestration system. Our approach is fully implemented in a prototype called SUSIE. We present experiments with real-life data and services to demonstrate the practical viability and good performance of our approach.

IEEE 2013 : A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data
IEEE  2013  Transactions on Knowledge and Data Engineering 

ABSTRACT : Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm, FAST, is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent, the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to four types of well-known classifiers, namely, the probability-based Naive Bayes, the tree-based C4.5, the instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world high dimensional image, microarray, and text data, demonstrate that FAST not only produces smaller subsets of features but also improves the performances of the four types of classifiers.




IEEE 2013: Facilitating Document Annotation using Content and Querying Value
IEEE 2013 Transactions on Knowledge and Data Engineering

ABSTRACT : A large number of organizations today generate and share textual descriptions of their products, services, and actions. Such collections of textual data contain significant amount of structured information, which remains buried in the unstructured text. While information extraction algorithms facilitate the extraction of structured relations, they are often expensive and inaccurate, especially when operating on top of text that does not contain any instances of the targeted structured information. We present a novel alternative approach that facilitates the generation of the structured metadata by identifying documents that are likely to contain information of interest and this information is going to be subsequently useful for querying the database. Our approach relies on the idea that humans are more likely to add the necessary metadata during creation time, if prompted by the interface; or that it is much easier for humans (and/or algorithms) to identify the metadata when such information actually exists in the document, instead of naively prompting users to fill in forms with information that is not available in the document. As a major contribution of this paper, we present algorithms that identify structured attributes that are likely to appear within the document, by jointly utilizing the content of the text and the query workload. Our experimental evaluation shows that our approach generates superior results compared to approaches that rely only on the textual content or only on the query workload, to identify attributes of interest



IEEE 2013:A Web Usage Mining Approach Based On New Technique In Web Path Recommendation Systems
IEEE 2013 Transactions on Engineering Research & Technology  
ABSTRACT : A Web Usage Mining Approach Based On New Technique In Web Path Recommendation Systems The Internet is one of the fastest growing areas of intelligence gathering.  The ranking of web page for the  Web search-engine  is one of the significant  problems at present. This leads to the  important attention to the research community.  Web Perfecting is used to reduce the access latency of the Internet. However, if most perfected Web pages are not visited by the users in their subsequent accesses,  the limited network bandwidth and server resources will not be used efficiently and may worsen the access delay problem.  Therefore, it is critical that we have an accurate prediction method during perfecting.  To provide prediction efficiently, we advance architecture for    predicting in Web Usage Mining system and propose a novel approach for classifying user navigation patterns for predicting users’ requests based on clustering users browsing behavior knowledge.  The Excremental results show that the approach can improve accuracy, precision, recall and F measure  of classification in the architecture.


IEEE 2013: SUSIE: Search Using Services and Information Extraction
IEEE 2013 Transactions on Knowledge and Data Engineering  

ABSTRACT : Restricts the types of queries that the service can answer. For example, a Web service might provide a method that returns the songs of a given singer, but it might not provide a method that returns the singers of a given song. If the user asks for the singer of some specific song, then the Web service cannot be called – even though the underlying database might have the desired piece of information. This asymmetry is particularly problematic if the service is used in a Web service orchestration system. In this paper, we propose to use on-the-fly information extraction to collect values that can be used as parameter bindings for the Web service. We show how this idea can be integrated into a Web service orchestration system. Our approach is fully implemented in a prototype called SUSIE. We present experiments with real-life data and services to demonstrate the practical viability and good performance of our approach. 


IEEE 2013:PMSE: A Personalized Mobile Search Engine
IEEE 2013 Transactions on Knowledge and Data Engineering  
ABSTRACT : We propose a personalized mobile search engine (PMSE) that captures the users’ preferences in the form of concepts by mining their click through data. Due to the importance of location information in mobile search, PMSE classifies these concepts into content concepts and location concepts. In addition, users’ locations (positioned by GPS) are used to supplement the location concepts in PMSE. The user preferences are organized in an ontology-based, multifacet user profile, which are used to adapt a personalized ranking function for rank adaptation of future search results. To characterize the diversity of the concepts associated with a query and their relevance to the user’s need, four entropies are introduced to balance the weights between the content and location facets. Based on the client-server model, we also present a detailed architecture and design for implementation of PMSE. In our design, the client collects and stores locally the click through data to protect privacy, whereas heavy tasks such as concept extraction, training, and re ranking are performed at the PMSE server. Moreover, we address the privacy issue by restricting the information in the user profile exposed to the PMSE server with two privacy parameters. We prototype PMSE on the Google Android platform. Experimental results show that PMSE significantly improves the precision comparing to the baseline.



IEEE 2013: Generation of Personalized Ontology Based on Consumer Emotion and Behavior Analysis
IEEE 2013 Transactions on Affective Computing

ABSTRACT : The relationships between consumer emotions and their buying behaviors have been well documented. Technology-savvy consumers often use the web to find information on products and services before they commit to buying. We propose a semantic web usage mining approach for discovering periodic web access patterns from annotated web usage logs which incorporates information on consumer emotions and behaviors through self-reporting and behavioral tracking. We use fuzzy logic to represent real-life temporal concepts (e.g., morning) and requested resource attributes (ontological domain concepts for the requested URLs) of periodic pattern-based web access activities. These fuzzy temporal and resource representations, which contain both behavioral and emotional cues, are incorporated into a Personal Web Usage Lattice that models the user’s web access activities. From this, we generate a Personal Web Usage Ontology written in OWL, which enables semantic web applications such as personalized web resources recommendation. Finally, we demonstrate the effectiveness of our approach by presenting experimental results in the context of personalized web resources recommendation with varying degrees of emotional influence. Emotional influence has been found to contribute positively to adaptation in personalized recommendation



IEEE 2013: Identity-Based Secure Distributed Data Storage Schemes
IEEE 2013 Transactions on Computers 

ABSTRACT : Secure distributed data storage can shift the burden of maintaining a large number of files from the owner to proxy servers. Proxy servers can convert encrypted files for the owner to encrypted files for the receiver without the necessity of knowing the content of the original files. In practice, the original files will be removed by the owner for the sake of space efficiency. Hence, the issues on confidentiality and integrity of the outsourced data must be addressed carefully. In this paper, we propose two identity-based secure distributed data storage (IBSDDS) schemes. Our schemes can capture the following properties: The file owner can decide the access permission independently without the help of the private key generator (PKG);  For one query, a receiver can only access one file, instead of all files of the owner; Our schemes are secure against the collusion attacks, namely even if the receiver can compromise the proxy servers, he cannot obtain the owner’s secret key. Although the first scheme is only secure against the chosen plain text attacks (CPA), the second scheme is secure against the chosen cipher text attacks (CCA). To the best of our knowledge, it is the first IBSDDS schemes where an access permissions is made by the owner for an exact file and collusion attacks can be protected in the standard model. 


IEEE 2013: Ginix: Generalized Inverted Index for Keyword Search
IEEE 2013 Transactions on Knowledge and Data Mining

ABSTRACT : Keyword search has become a ubiquitous method for users to access text data in the face of information explosion. Inverted lists are usually used to index underlying documents to retrieve documents according to a set of keywords efficiently. Since inverted lists are usually large, many compression techniques have been proposed to reduce the storage space and disk I/O time. However, these techniques usually perform decompression operations on the fly, which increases the CPU time. This paper presents a more efficient index structure, the Generalized INverted IndeX (Ginix), which merges consecutive IDs in inverted lists into intervals to save storage space. With this index structure, more efficient algorithms can be devised to perform basic keyword search operations, i.e., the union and the intersection operations, by taking the advantage of intervals. Specifically, these algorithms do not require conversions from interval lists back to ID lists. As a result, keyword search using Ginix can be more efficient than those using traditional inverted indices. The performance of Ginix is also improved by reordering the documents in data sets using two scalable  algorithms. Experiments on the performance and scalability of Ginix on real data sets show that Ginix not only requires less storage space, but also improves the keyword search performance, compared with traditional inverted indexes



IEEE 2013: An Ontology-based Framework for Context-aware Adaptive E-learning System
IEEE 2013 Transactions on Computer Communication and Informatics
ABSTRACT : In web-based e-learning environment every learner has a distinct background, learning style and a specific goal when searching for learning material on the web. The goal of personalization is to tailor search results to a particular user based on that user’s contextual information. The effectiveness of accessing learning material involves two important challenges: identifying the user context and modeling the user context as ontological profiles. This work describes the ontology-based framework for context-aware adaptive learning system, with detailed discussions on the categorization contextual information and modeling along with the use of ontology to explicitly specify learner context in an e-learning environment. Finally we conclude by showing the applicability of the proposed ontology with appropriate architectural overview of e-learning system.



IEEE 2013: ELCA Evaluation for Keyword Search on Probabilistic XML Data
IEEE 2013 Transactions on Knowledge and Data Engineering 

ABSTRACT : As probabilistic data management is becoming one of the main re-search focuses and keyword search is turning into a more popular query means, it is natural to think how to support keyword queries on probabilistic XML data. With regards to keyword query on De-terministic XML documents, ELCA (Exclusive Lowest Common Ancestor) semantics allows more relevant fragments rooted at the ELCAs to appear as results and is more popular compared with other keyword query result semantics (such as SLCAs). In this paper, we investigate how to evaluate ELCA results for keyword queries on probabilistic XML documents. After defin-ing probabilistic ELCA semantics in terms of possible world se-mantics, we propose an approach to compute ELCA probabilities without generating possible worlds. Then we develop an efficient stack-based algorithm that can find all probabilistic ELCA results and their ELCA probabilities for a given keyword query on a prob-abilistic XML document. Finally, we experimentally evaluate the proposed ELCA algorithm and compare it with its SLCA counter-part in aspects of result effectiveness, time and space efficiency, and scalability 


IEEE 2013: Crowd sourcing Predictors of Behavioral Outcomes 
IEEE 2013:Transactions on Knowledge and Data Engineering  
ABSTRACT : Generating models from large data sets—and deter-mining which subsets of data to mine—is becoming increasingly automated. However choosing what data to collect in the first place requires human intuition or experience, usually supplied by a domain expert. This paper describes a new approach to machine science which demonstrates for the first time that non-domain experts can collectively formulate features, and provide values for those features such that they are predictive of some behavioral outcome of interest. This was accomplished by building a web platform in which human groups interact to both respond to questions likely to help predict a behavioral outcome and pose new questions to their peers. This results in a dynamically-growing online survey, but the result of this cooperative behavior also leads to models that can predict user’s outcomes based on their responses to the user-generated survey questions. Here we describe two web-based experiments that instantiate this approach: the first site led to models that can predict users’ monthly electric energy consumption; the other led to models that can predict users’ body mass index. As exponential increases in content are often observed in successful online collaborative communities, the proposed methodology may, in the future, lead to similar exponential rises in discovery and insight into the causal factors of behavioral outcomes.



IEEE 2013: Facilitating Document Annotation using Content and Querying Value
IEEE 2013 Transactions on Knowledge and Data Engineering in cloud

ABSTRACT : A large number of organizations today generate and share textual descriptions of their products, services, and actions. Such collections of textual data contain significant amount of struc-tured information, which remains buried in the unstructured text. While information extraction algorithms facilitate the extraction of structured relations, they are often expensive and inaccurate, es-pecially when operating on top of text that does not contain any instances of the targeted structured information. We present a novel alternative approach that facilitates the generation of the structured metadata by identifying documents that are likely to contain informa-tion of interest and this information is going to be subsequently useful for querying the database. Our approach relies on the idea that hu-mans are more likely to add the necessary metadata during creation time, if prompted by the interface; or that it is much easier for humans (and/or algorithms) to identify the metadata when such information actually exists in the document, instead of naively prompting users to fill in forms with information that is not available in the document. As a major contribution of this paper, we present algorithms that identify structured attributes that are likely to appear within the document, by jointly utilizing the content of the text and the query workload. Our experimental evaluation shows that our approach generates superior results compared to approaches that rely only on the textual content or only on the query workload, to identify attributes of interest











No comments:

Post a Comment

IEEE 2023: WEB SECURITY OR CYBER CRIME

  IEEE 2023:   Machine Learning and Software-Defined Networking to Detect DDoS Attacks in IOT Networks Abstract:   In an era marked by the r...