TY  - THES
U1  - Dissertation / Habilitation
A1  - Schaidnagel, Michael
T1  - Automated feature construction for classification of complex, temporal data sequences
N2  - Data collected from internet applications are mainly stored in the form of transactions. All transactions of one user form a sequence, which shows the user´s behaviour on the site. Nowadays, it is important to be able to classify the behaviour in real time for various reasons: e.g. to increase conversion rate of customers while they are in the store or to prevent fraudulent transactions before they are placed. However, this is difficult due to the complex structure of the data sequences (i.e. a mix of categorical and continuous data types, constant data updates) and the large amounts of data that are stored. Therefore, this thesis studies the classification of complex data sequences. It surveys the fields of time series analysis (temporal data mining), sequence data mining or standard classification algorithms. It turns out that these algorithms are either difficult to be applied on data sequences or do not deliver a classification: Time series need a predefined model and are not able to handle complex data types; sequence classification algorithms such as the apriori algorithm family are not able to utilize the time aspect of the data. The strengths and weaknesses of the candidate algorithms are identified and used to build a new approach to solve the problem of classification of complex data sequences. The problem is thereby solved by a two-step process. First, feature construction is used to create and discover suitable features in a training phase. Then, the blueprints of the discovered features are used in a formula during the classification phase to perform the real time classification. The features are constructed by combining and aggregating the original data over the span of the sequence including the elapsed time by using a calculated time axis. Additionally, a combination of features and feature selection are used to simplify complex data types. This allows catching behavioural patterns that occur in the course of time. This new proposed approach combines techniques from several research fields. Part of the algorithm originates from the field of feature construction and is used to reveal behaviour over time and express this behaviour in the form of features. A combination of the features is used to highlight relations between them. The blueprints of these features can then be used to achieve classification in real time on an incoming data stream. An automated framework is presented that allows the features to adapt iteratively to a change in underlying patterns in the data stream. This core feature of the presented work is achieved by separating the feature application step from the computational costly feature construction step and by iteratively restarting the feature construction step on the new incoming data. The algorithm and the corresponding models are described in detail as well as applied to three case studies (customer churn prediction, bot detection in computer games, credit card fraud detection). The case studies show that the proposed algorithm is able to find distinctive information in data sequences and use it effectively for classification tasks. The promising results indicate that the suggested approach can be applied to a wide range of other application areas that incorporate data sequences.
Y2  - 2016
SP  - 138
S1  - 138
CY  - Paisley
ER  -