Refine
Document Type
- Journal article (3)
- Conference proceeding (3)
- Doctoral Thesis (1)
Language
- English (7)
Is part of the Bibliography
- yes (7)
Institute
- Informatik (7)
Publisher
- IARIA (5)
- Gesellschaft für Informatik (1)
When forecasting sales figures, not only the sales history but also the future price of a product will influence the sales quantity. At first sight, multivariate time series seem to be the appropriate model for this task. Nontheless, in real life history is not always repeatable, i.e. in the case of sales history there is only one price for a product at a given time. This complicates the design of a multivariate time series. However, for some seasonal or perishable products the price is rather a function of the expiration date than of the sales history. This additional information can help to design a more accurate and causal time series model. The proposed solution uses an univariate time series model but takes the price of a product as a parameter that influences systematically the prediction. The price influence is computed based on historical sales data using correlation analysis and adjustable price ranges to identify products with comparable history. Compared to other techniques this novel approach is easy to compute and allows to preset the price parameter for predictions and simulations. Tests with data from the Data Mining Cup 2012 demonstrate better results than established sophisticated time series methods.
A sequence of transactions represents a complex and multi dimensional type of data. Feature construction can be used to reduce the data´s dimensionality to find behavioural patterns within such sequences. The patterns can be expressed using the blue prints of the constructed relevant features. These blue prints can then be used for real time classification on other sequences.
Data collected from internet applications are mainly stored in the form of transactions. All transactions of one user form a sequence, which shows the user´s behaviour on the site. Nowadays, it is important to be able to classify the behaviour in real time for various reasons: e.g. to increase conversion rate of customers while they are in the store or to prevent fraudulent transactions before they are placed. However, this is difficult due to the complex structure of the data sequences (i.e. a mix of categorical and continuous data types, constant data updates) and the large amounts of data that are stored. Therefore, this thesis studies the classification of complex data sequences. It surveys the fields of time series analysis (temporal data mining), sequence data mining or standard classification algorithms. It turns out that these algorithms are either difficult to be applied on data sequences or do not deliver a classification: Time series need a predefined model and are not able to handle complex data types; sequence classification algorithms such as the apriori algorithm family are not able to utilize the time aspect of the data. The strengths and weaknesses of the candidate algorithms are identified and used to build a new approach to solve the problem of classification of complex data sequences. The problem is thereby solved by a two-step process. First, feature construction is used to create and discover suitable features in a training phase. Then, the blueprints of the discovered features are used in a formula during the classification phase to perform the real time classification. The features are constructed by combining and aggregating the original data over the span of the sequence including the elapsed time by using a calculated time axis. Additionally, a combination of features and feature selection are used to simplify complex data types. This allows catching behavioural patterns that occur in the course of time. This new proposed approach combines techniques from several research fields. Part of the algorithm originates from the field of feature construction and is used to reveal behaviour over time and express this behaviour in the form of features. A combination of the features is used to highlight relations between them. The blueprints of these features can then be used to achieve classification in real time on an incoming data stream. An automated framework is presented that allows the features to adapt iteratively to a change in underlying patterns in the data stream. This core feature of the presented work is achieved by separating the feature application step from the computational costly feature construction step and by iteratively restarting the feature construction step on the new incoming data. The algorithm and the corresponding models are described in detail as well as applied to three case studies (customer churn prediction, bot detection in computer games, credit card fraud detection). The case studies show that the proposed algorithm is able to find distinctive information in data sequences and use it effectively for classification tasks. The promising results indicate that the suggested approach can be applied to a wide range of other application areas that incorporate data sequences.
Online credit card fraud presents a significant challenge in the field of eCommerce. In 2012 alone, the total loss due to credit card fraud in the US amounted to $ 54 billion. Especially online games merchants have difficulties applying standard fraud detection algorithms to achieve timely and accurate detection. This paper describes the Special constrains of this domain and highlights the reasons why conventional algorithms are not quite effective to deal with this problem. Our suggested solution for the problem originates from the fields of feature construction joined with the field of temporal sequence data mining. We present Feature construction techniques, which are able to create discriminative features based on a sequence of transaction and are able to incorporate the time into the classification process. In addition to that, a framework is presented that allows for an automated and adaptive change of features in case the underlying pattern is changing.
The recent years and especially the Internet have changed the ways in which data is stored. It is now common to store data in the form of transactions, together with ist creation time-stamp. These transactions can often be attributed to Logical units, e.g., all transactions that belong to one customer. These groups, we refer to them as data sequences, have a more complex structure than tuple-based data. This makes it more difficult to find discriminatory patterns for classification purposes. However, the complex structure potentially enables us to track behaviour and its change over the course of time. This is quite interesting, especially in the e-commerce area, in which classification of a sequence of customer actions is still a challenging task for data miners. However, before standard algorithms such as Decision Trees, Neural Nets, Naive Bayes or Bayesian Belief Networks can be applied on sequential data, preparations are required in order to capture the information stored within the sequences. Therefore, this work presents a systematic approach on how to reveal sequence patterns among data and how to construct powerful features out of the primitive sequence attributes. This is achieved by sequence aggregation and the incorporation of time dimension into the feature construction step. The proposed algorithm is described in detail and applied on a real-life data set, which demonstrates the ability of the proposed algorithm to boost the classification performance of well-known data mining algorithms for binary classification tasks.
When forecasting sales figures, not only the sales history but also the future price of a product will influence the sales quantity. At first sight, multivariate time series seem to be the appropriate model for this task. Nonetheless, in real life history is not always repeatable, i.e., in the case of sales history there is only one price for a product at a given time. This complicates the design of a multivariate time series. However, for some seasonal or perishable products the price is rather a function of the expiration date than of the sales history. This additional information can help to design a more accurate and causal time series model. The proposed solution uses an univariate time series model but takes the price of a product as a parameter that influences systematically the prediction based on a calculated periodicity. The price influence is computed based on historical sales data using correlation analysis and adjustable price ranges to identify products with comparable history. The periodicity is calculated based on a novel approach that is based on data folding and Pearson Correlation. Compared to other techniques this approach is easy to compute and allows to preset the price parameter for predictions and simulations. Tests with data from the Data Mining Cup 2012 as well as artificial data demonstrate better results than established sophisticated time series methods.
The recent years and especially the Internet have changed the way on how data is stored. We now often store data together with its creation time-stamp. These data sequences potentially enable us to track the change of data over time. This is quite interesting, especially in the e-commerce area, in which classification of a sequence of customer actions, is still a challenging task for data miners. However, before Standard algorithms such as Decision Trees, Neuronal Nets, Naive Bayes or Bayesian Belief Networks can be applied on sequential data, preparations need to be done in order to capture the information stored within the sequences. Therefore, this work presents a systematic approach on how to reveal sequence patterns among data and how to construct powerful features out of the primitive sequence attributes. This is achieved by sequence aggregation and the incorporation of time dimension into the Feature construction step. The proposed algorithm is described in detail and applied on a real life data set, which demonstrates the ability of the proposed algorithm to boost the classification performance of well known data mining algorithms for classification tasks.