TECHNOLOGY: Dot net
DOMAIN: Data Mining
S. No. | IEEE TITLE | ABSTRACT | IEEE YEAR |
1 | Enabling Kernel-Based Attribute-Aware Matrix Factorization for Rating Prediction
|
In recommender systems, one key task is to predict the personalized rating of a user to a new item and then return the new items having the top predicted ratings to the user. Recommender systems usually apply collaborative filtering techniques (e.g., matrix factorization) over a sparse user-item rating matrix to make rating prediction. However, the collaborative filtering techniques are severely affected by the data sparsity of the underlying user-item rating matrix and often confront the cold-start problems for new items and users. Since the attributes of items and social links between users become increasingly accessible in the Internet, this paper exploits the rich attributes of items and social links of users to alleviate the rating sparsity effect and tackle the cold-start problems. Specifically, we first propose a Kernel-based Attribute-aware Matrix Factorization model called KAMF to integrate the attribute information of items into matrix factorization. KAMF can discover the nonlinear interactions among attributes, users, and items, which mitigate the rating sparsity effect and deal with the cold-start problem for new items by nature. Further, we extend KAMF to address the cold-start problem for new users by utilizing the social links between users. Finally, we conduct a comprehensive performance evaluation for KAMF using two large-scale real-world data sets recently released in Yelp and MovieLens. Experimental results show that KAMF achieves significantly superior performance against other state-of-the-art rating prediction techniques. | 2017 |
2 | Query Expansion with Enriched User Profiles for Personalized Search Utilizing Folksonomy Data | Query expansion has been widely adopted in Web search as a way of tackling the ambiguity of queries. Personalized search utilizing folksonomy data has demonstrated an extreme vocabulary mismatch problem that requires even more effective query expansion methods. Co-occurrence statistics, tag-tag relationships, and semantic matching approaches are among those favored by previous research. However, user profiles which only contain a user’s past annotation information may not be enough to support the selection of expansion terms, especially for users with limited previous activity with the system. We propose a novel model to construct enriched user profiles with the help of an external corpus for personalized query expansion. Our model integrates the current state-of-the-art text representation learning framework, known as word embeddings, with topic models in two groups of pseudo-aligned documents. Based on user profiles, we build two novel query expansion techniques. These two techniques are based on topical weights-enhanced word embeddings, and the topical relevance between the query and the terms inside a user profile, respectively. The results of an in-depth experimental evaluation, performed on two real-world datasets using different external corpora, show that our approach outperforms traditional techniques, including existing non-personalized and personalized query expansion methods. | 2017 |
3 | Collaboratively Training Sentiment Classifiers for Multiple Domains | We propose a collaborative multi-domain sentiment classification approach to train sentiment classifiers for multiple domains simultaneously. In our approach, the sentiment information in different domains is shared to train more accurate and robust sentiment classifiers for each domain when labeled data is scarce. Specifically, we decompose the sentiment classifier of each domain into two components, a global one and a domain-specific one. The global model can capture the general sentiment knowledge and is shared by various domains. The domain-specific model can capture the specific sentiment expressions in each domain. In addition, we extract domain-specific sentiment knowledge from both labeled and unlabeled samples in each domain and use it to enhance the learning of domain-specific sentiment classifiers. Besides, we incorporate the similarities between domains into our approach as regularization over the domain-specific sentiment classifiers to encourage the sharing of sentiment information between similar domains. Two kinds of domain similarity measures are explored, one based on textual content and the other one based on sentiment expressions. Moreover, we introduce two efficient algorithms to solve the model of our approach. Experimental results on benchmark datasets show that our approach can effectively improve the performance of multi-domain sentiment classification and significantly outperform baseline methods. | 2017 |
4 | Energy-Efficient Query Processing in Web Search Engines | Web search engines are composed by thousands of query processing nodes, i.e., servers dedicated to process user queries. Such many servers consume a significant amount of energy, mostly accountable to their CPUs, but they are necessary to ensure low latencies, since users expect sub-second response times (e.g., 500 ms). However, users can hardly notice response times that are faster than their expectations. Hence, we propose the Predictive Energy Saving Online Scheduling Algorithm (PESOS) to select the most appropriate CPU frequency to process a query on a per-core basis. PESOS aims at process queries by their deadlines, and leverage high-level scheduling information to reduce the CPU energy consumption of a query processing node. PESOS bases its decision on query efficiency predictors, estimating the processing volume and processing time of a query. We experimentally evaluate PESOS upon the TREC ClueWeb09B collection and the MSN2006 query log. Results show that PESOS can reduce the CPU energy consumption of a query processing node up to ~48 percent compared to a system running at maximum CPU core frequency. PESOS outperforms also the best state-of-the-art competitor with a ~20 percent energy saving, while the competitor requires a fine parameter tuning and it may incurs in uncontrollable latency violations. | 2017 |
5 | A Scalable Data Chunk Similarity Based Compression Approach for Efficient Big Sensing Data Processing on Cloud | Big sensing data is prevalent in both industry and scientific research applications where the data is generated with high volume and velocity. Cloud computing provides a promising platform for big sensing data processing and storage as it provides a flexible stack of massive computing, storage, and software services in a scalable manner. Current big sensing data processing on Cloud have adopted some data compression techniques. However, due to the high volume and velocity of big sensing data, traditional data compression techniques lack sufficient efficiency and scalability for data processing. Based on specific on-Cloud data compression requirements, we propose a novel scalable data compression approach based on calculating similarity among the partitioned data chunks. Instead of compressing basic data units, the compression will be conducted over partitioned data chunks. To restore original data sets, some restoration functions and predictions will be designed. MapReduce is used for algorithm implementation to achieve extra scalability on Cloud. With real world meteorological big sensing data experiments on U-Cloud platform, we demonstrate that the proposed scalable compression approach based on data chunk similarity can significantly improve data compression efficiency with affordable data accuracy loss. | 2017 |
6 | User-Centric Similarity Search | User preferences play a significant role in market analysis. In the database literature, there has been extensive work on query primitives, such as the well known top-k query that can be used for the ranking of products based on the preferences customers have expressed. Still, the fundamental operation that evaluates the similarity between products is typically done ignoring these preferences. Instead products are depicted in a feature space based on their attributes and similarity is computed via traditional distance metrics on that space. In this work, we utilize the rankings of the products based on the opinions of their customers in order to map the products in a user-centric space where similarity calculations are performed. We identify important properties of this mapping that result in upper and lower similarity bounds, which in turn permit us to utilize conventional multidimensional indexes on the original product space in order to perform these user-centric similarity computations. We show how interesting similarity calculations that are motivated by the commonly used range and nearest neighbor queries can be performed efficiently, while pruning significant parts of the data set based on the bounds we derive on the user-centric similarity of products | 2017 |
7 | Computing Semantic Similarity of Concepts in Knowledge Graphs | This paper presents a method for measuring the semantic similarity between concepts in Knowledge Graphs (KGs) such as WordNet and DBpedia. Previous work on semantic similarity methods have focused on either the structure of the semantic network between concepts (e.g., path length and depth), or only on the Information Content (IC) of concepts. We propose a semantic similarity method, namely wpath, to combine these two approaches, using IC to weight the shortest path length between concepts. Conventional corpus-based IC is computed from the distributions of concepts over textual corpus, which is required to prepare a domain corpus containing annotated concepts and has high computational cost. As instances are already extracted from textual corpus and annotated by concepts in KGs, graph-based IC is proposed to compute IC based on the distributions of concepts over instances. Through experiments performed on well known word similarity datasets, we show that the wpath semantic similarity method has produced a statistically significant improvement over other semantic similarity methods. Moreover, in a real category classification evaluation, the wpath method has shown the best performance in terms of accuracy and F score. | 2017 |
8 | Efficient Pattern-Based Aggregation on Sequence Data | A Sequence OLAP (S-OLAP) system provides a platform on which pattern-based aggregate (PBA) queries on a sequence database are evaluated. In its simplest form, a PBA query consists of a pattern template T and an aggregate function F. A pattern template is a sequence of variables, each is defined over a domain. Each variable is instantiated with all possible values in its corresponding domain to derive all possible patterns of the template. Sequences are grouped based on the patterns they possess. The answer to a PBA query is a sequence cuboid (s-cuboid), which is a multidimensional array of cells. Each cell is associated with a pattern instantiated from the query’s pattern template. The value of each s-cuboid cell is obtained by applying the aggregate function F to the set of data sequences that belong to that cell. Since a pattern template can involve many variables and can be arbitrarily long, the induced s-cuboid for a PBA query can be huge. For most analytical tasks, however, only iceberg cells with very large aggregate values are of interest. This paper proposes an efficient approach to identifying and evaluating iceberg cells of s-cuboids. Experimental results show that our algorithms are orders of magnitude faster than existing approaches. | 2017 |
9 | Efficient Distance-Aware Influence Maximization in Geo-Social Networks | Given a social network G and a positive integer k, the influence maximization problem aims to identify a set of k nodes in G that can maximize the influence spread under a certain propagation model. As the proliferation of geo-social networks, location-aware promotion is becoming more necessary in real applications. In this paper, we study the distance-aware influence maximization (DAIM) problem, which advocates the importance of the distance between users and the promoted location. Unlike the traditional influence maximization problem, DAIM treats users differently based on their distances from the promoted location. In this situation, the k nodes selected are different when the promoted location varies. In order to handle the large number of queries and meet the online requirement, we develop two novel index-based approaches, MIA-DA and RIS-DA, by utilizing the information over some pre-sampled query locations. MIA-DA is a heuristic method which adopts the maximum influence arborescence (MIA) model to approximate the influence calculation. In addition, different pruning strategies as well as a priority-based algorithm are proposed to significantly reduce the searching space. To improve the effectiveness, in RIS-DA, we extend the reverse influence sampling (RIS) model and come up with an unbiased estimator for the DAIM problem. Through carefully analyzing the sample size needed for indexing, RIS-DA is able to return a 1 – 1/e – E approximate solution with at least 1 – d probability for any given query. Finally, we demonstrate the efficiency and effectiveness of proposed methods over real geo-social networks. | 2017 |
10 | A Systematic Approach to Clustering Whole Trajectories of Mobile Objects in Road Networks | Most of mobile object trajectory clustering analysis to date has been focused on clustering the location points or sub-trajectories extracted from trajectory data. This paper presents TRACEMOB, a systematic approach to clustering whole trajectories of mobile objects traveling in road networks. TRACEMOB as a whole trajectory clustering framework has three unique features. First, we design a quality measure for the distance between two whole trajectories. By quality, we mean that the distance measure can capture the complex characteristics of trajectories as a whole including their varying lengths and their constrained movement in the road network space. Second, we develop an algorithm that transforms whole trajectories in a road network space into multidimensional data points in a euclidean space while preserving their relative distances in the transformed metric space. This transformation enables us to effectively shift the clustering task for whole mobile object trajectories in the complex road network space to the traditional clustering task for multidimensional data in a euclidean space. Third, we develop a cluster validation method for evaluating the clustering quality in both the transformed metric space and the road network space. Extensive experimental evaluation with trajectories generated on real road network maps of different cities shows that TRACEMOB produces higher quality clustering results and outperforms existing approaches by an order of magnitude. | 2017 |
11 | Bag-of-Discriminative-Words (BoDW) Representation via Topic Modeling | Many of the words in a given document either deliver facts (objective) or express opinions (subjective), respectively, depending on the topics they are involved in. For example, given a bunch of documents, the word “bug” assigned to the topic “order Hemiptera” apparently remarks one object (i.e., one kind of insects), while the same word assigned to the topic “software” probably conveys a negative opinion. Motivated by the intuitive assumption that different words have varying degrees of discriminative power in delivering the objective sense or the subjective sense with respect to their assigned topics, a model named as discriminatively objective-subjective LDA (dosLDA) is proposed in this paper. The essential idea underlying the proposed dosLDA is that a pair of objective and subjective selection variables are explicitly employed to encode the interplay between topics and discriminative power for the words in documents in a supervised manner. As a result, each document is appropriately represented as “bag-of-discriminativewords” (BoDW). The experiments reported on documents and images demonstrate that dosLDA not only performs competitively over traditional approaches in terms of topic modeling and document classification, but also has the ability to discern the discriminative power of each word in terms of its objective or subjective sense with respect to its assigned topic. | 2017 |
12 | Search Rank Fraud and Malware Detection in Google Play | Fraudulent behaviors in Google Play, the most popular Android app market, fuel search rank abuse and malware proliferation. To identify malware, previous work has focused on app executable and permission analysis. In this paper, we introduce FairPlay, a novel system that discovers and leverages traces left behind by fraudsters, to detect both malware and apps subjected to search rank fraud. FairPlay correlates review activities and uniquely combines detected review relations with linguistic and behavioral signals gleaned from Google Play app data (87 K apps, 2.9 M reviews, and 2.4M reviewers, collected over half a year), in order to identify suspicious apps. FairPlay achieves over 95 percent accuracy in classifying gold standard datasets of malware, fraudulent and legitimate apps. We show that 75 percent of the identified malware apps engage in search rank fraud. FairPlay discovers hundreds of fraudulent apps that currently evade Google Bouncer’s detection technology. FairPlay also helped the discovery of more than 1,000 reviews, reported for 193 apps, that reveal a new type of “coercive” review campaign: users are harassed into writing positive reviews, and install and review other apps. | 2017 |