Students’ Research Works – Spring 2017: Data Analysis, Mining and Machine Learning (PeWe.Data)

User behavior similarity identification for the task of prediction the leaving of the session

Tomáš Bako
bachelor study, supervised by Ondrej Kaššák

Abstract. Clustering is one of the ways how to analyze a big amount of data. However, clustering itself is not a very effective way to do this. Therefore a new way of clustering was proposed – to use clustering over data stream. Clustering over data stream analysis continuous data and every item of data is used only once. After this the item is thrown away and never used more.

In this project we are searching a solution for clustering data from the ALEF system, using an algorithm that works over data stream. This algorithm is one of the common algorithms of clustering over data stream – the CluStream algorithm. It consists of 2 parts – online microclustering part and offline macroclustering part. Microclustering is used to access the stream fast enough, the macroclustering is used to produce final cluster results that could be created at any time of the clustering. The main goal of this project is to find a proper configuration of pre-processed data-set, algorithm input settings and the algorithm itself to make clusters of high enough quality, which would represent similar users of the ALEF system.

to the top | to the main

Predicting User Retention in online enviroment

bergerPatrik Berger
master study, supervised by Michal Kompan

Abstract. The recent growth of market and technology advancement led to the increase of amount of competitors providing online services to its users. In those circumstances acquiring a new user is multiple times more expensive than keeping the existing ones. That makes user retention one of the key metrics of success for such an online service (e-shops, bank services, insurance companies etc.). Successful prediction of churn of a specific user provides an opportunity to change his decision by for instance giving him a special offer. This kind of prevention and identification of churn reasons create huge motivation to explore this area. .

In our work we focus on identification of the set of features to create a user model for further use for the churn prediction. In first stage of our work we plan to build a user model in selected domain and explore the possibilities of automatic feature extraction from the data. As a next step we want to select classifiers and build a structure of a learning ensemble. Finally, we are planning to test our model with a nontrivial dataset from the selected domain.

to the top | to the main

Detection of Anti-social Behavior in Online Communities

Martin Borák
master study, supervised by Ivan Srba

Abstract. Lately, online communities gain importance and popularity on the Web, mainly on places as social networks, CQA systems, online games and news or entertainment portals. Immediate communication with unlimited amount of people on enormous number of topics became a part of everyday life for hundreds of millions people in the world.

Considering the huge amount of members of these communities, content of such communication is often rather diverse. Often there are users who try to disrupt these communications. Wheatear it is by posting pointless messages, sharing links to irrelevant sites, uncalled for sarcasm or by an actual aggressive behavior and rude verbal attacks. One of the most common types of these users are haters, who spread hate via rude, vulgar and often hurtful content pointed at people or things they dislike. Such behavior degrades the quality of discussion, discouraging other users from reading and contributing to it and inevitably from visiting the portal. Also it can be a stimulus for legal issues.

In our work we focus on analysis of antisocial behavior on Web and on automatic detection of haters’ comments on YouTube, which is a portal that mediates multimedia content and is known for high concentration of haters in discussion sections of videos. We created our own dataset and we have developed a machine learning based method to detect hateful comments, using a combination of different features extracted from text, user history, community reaction and hierarchical data. We use co-training, which is a form of semi-supervised learning, to achieve the highest possible performance on our dataset, where a subset of items has been labelled using crowdsourcing.

to the top | to the main

Stream analysis of incoming events using different data analysis methods

Matúš Cimerman
master study, supervised by Jakub Ševcech

Abstract. Data analysis is a non-trivial task and gets even harder when it comes to streaming data analysis. Several constraints need to be matched when analysing data streams, e.g. usage of limited time and memory or real-time latency. In this work, we focus on an ensemble of tools, aiming to ease data analysis process of streaming data for domain experts. Suppose domain expert doesn’t have detailed knowledge of data mining methods and algorithms.

We propose real-time visualization of results and resulting model emphasising occurred drifts. Such visualization helps domain expert understand how model works and how it was affected by drifts in data stream. Using Hoeffding trees and classification task we evaluate both, method for classification quantitatively and visualization qualitatively.

to the top | to the main

Web Site Users’ Behavioral Trend Analysis

Natália Čuláková
bachelor study, supervised by Ondrej Kaššák

Abstract. User behaviour is very individual. However, if we consider a sufficiently large number of users, their behaviour will be a subject to certain trends. In this work we deal with identification of such trends among users of a web site. We identify these trends by searching for frequent patterns. Since we are working with online data these are incoming in fast and voluminous streams and are difficult to process. Therefore, the work focuses on one pass algorithms for processing data streams. Such data analysis can help to better know the users of the web site and using this information we can better customise the website according to the users’ trends.

We analyse current approaches to frequent pattern mining in data streams. Based on existing algorithms we try to design our own, one-pass algorithm aimed at finding frequent itemsets in the data from selected web site. In this algorithm we focus on changes over a period of time. We test this algorithm on a domain of e-shop with discount coupons through a variety of tasks. We also try to modify this algorithm to be domain independent.

to the top | to the main

Recognition of Similarities in User Behavior in Data Stream

Juraj Flamík
master study, supervised by Ondrej Kaššák

Abstract. It would seem, that web site user behavior is highly unique and different from other users behavior. It is based on user current intention and previous experiences with the web site. But the web site itself offers only finite number of possibilities, in which users can behave. Thanks to this fact, we can find users, who behave similarly. Then, we can use this information in tasks like personalization, user modeling, recommendation or prediction.

In our work, we analyze possibilities of user behavior clustering. Because we work with a lot of data in web sites with dynamically changing content, we focus on clustering in data stream. We are solving subtasks like feature engineering, distance and cluster quality measurements. Then we want to use these obtained clusters / behavior similarities to improve chosen prediction task. At the end, we want to test our method on nontrivial real dataset and show, that clustering can help to get better results for chosen prediction task.

to the top | to the main

Web User Behavioral Patterns Recognition in Online Time for Personalized Recommendation

Tomáš Chovaňák
master study, supervised by Ondrej Kaššák

Abstract. Understanding a website user behavior is a crucial assumption for identification of user preferences and recommendation of interesting content to him. Typical and repeating features of user behavior during his visits of the website can be expressed through behavioral patterns. We represent these patterns as frequent itemsets of actions performed by users in their sessions.

We respond to actual trend of Web personalization, focusing on needs of individual users and need of data processing in online time. The reason is that user behavior in many domains (e.g., news, social networks) often changes so it is needed to dynamically react to most recent behavior. Traditional methods identify patterns offline in accurate but time and computationally expensive way. Our novel method for behavioral patterns recognition and application for recommendation in online time combines global patterns with patterns specific to groups of similar users and uses them to recommend next actions based on actual user’s behavior.

We performed several experiments over data from e-learning and news domains. Our results clearly show that the combination of common global patterns and specific group patterns reaches higher recommendation precision than its components used individually. Inclusion of group patterns also brings only constant computational load, which supports its maintenance in production usage.

to the top | to the main

Learning Video Representations for Generating Descriptions

Patrik Gajdošík
master study, supervised by Márius Šajgalík

Abstract. Eye-tracking is a great way to enhance the user experience. That can be either in a direct way, when using it as a new way for users to control applications, or in an indirect way, when eye-tracking is used by interface designers and application creators who use it for usability testing to increase the usability and efficiency of their applications. The problem with eye-tracking is that it requires specialized devices, eye-trackers, that capture the eye gaze. The eye-trackers are not widely spread among ordinary users but usually only accessible in specialized environments. However, web cameras are present in almost every mobile device.

In our work, we propose a solution that would utilize the web-cams and perform eye-tracking with them. For that we decided to use neural networks that are good with data containing noise or lacking quality. We want to design an architecture of a neural network that would take the video captured by a web camera and generate the coordinates of the user’s gaze. We also want to enhance our model that would, in addition to the gaze, recognize some of the simple patterns that can appear in the gaze recordings.

to the top | to the main

Using Machine Learning for Prediting User Behaviour

Martin Jakubík
bachelor study, supervised by Michal Kompan

Abstract. The number of active users of social networks is a critical measure of their popularity. Social network companies use various strategies to keep users active. It is becoming increasingly important to be able to predict the future activity of users. Users that are about to become inactive can be given incentives and reactivated while they are still active.

In this work, we present a simple prediction model that can be used to predict the activity level of Twitter users based on the historical data generated by the users. The goal of this model is to identify the users that are about to become inactive so that they can be targeted and reactivated.

to the top | to the main

Executable documents on data analysis

Jakub Janeček
bachelor study, supervised by Jakub Ševcech

Abstract. Data analysis has really emerged on the surface in the last two decades. It is really important for students to get familiar with data analysis, as it is a quite useful field of study in modern era, when we are overflowing with data. In this work we will concentrate on basic knowlage of data anylisis, especially classification with ensemble learning. We will cover the theory of differences between ensemble learning methods, the comparasion of basic classifiers and ensembles. We will also shine some light on some more concrete parts of classification, like feature engenering, training of classifiers and overfitting.

This will also be covered by executable documents using jupyter notebooks which will serve as tutorials for future students and data alysis enthusiasts.

to the top | to the main

User Modelling for Session End Intent Prediction

kassakOndrej Kaššák
doctoral study, supervised by Mária Bieliková

Abstract. User behaviour in the web site can be modelled from two basic points of view. The first one is the short term behaviour, which reflect user’s actual intent, preferences, goal etc. It captures user’s most actual behaviour and actions but it is typically very noisy, because of influence of user’s actual context, mood and more unpredictable conditions.

The second point of view – long-term behaviour is characterized by more stable preferences identification and capturing user typical customs. On the other side, this kind of behaviour is not so adaptable to changes, it learn trends and hot topics of user behaviour only after longer time period.

To be able to model user preferences and predict future behaviour, it is suitable to combine both data sources and consider them when estimating next user actions. In our research, we focus on task of user session exit intent prediction. This task require to be able to recognize subtle changes in user behaviour in comparison to previous behaviour in different time preriods as well as characteristics of actual user session.

to the top | to the main

Ensuring Robustness against Changes in Web Sites during Data Extraction

Michal Kren
master study, supervised by Ivan Srba

Abstract. The amount of data on the web increases exponentially, so we need an efficient solution for extracting this data because manual approaches are no longer viable. There is an intensive research being made in the field of automatic data extraction from the web with purpose to eliminate, or at least decrease the human input. In this work, we are focusing on web wrappers, more specifically on their maintenance. The structure of a web site changes over time, and even the smallest changes can cause a wrapper to fail. Our goal is to analyze and improve existing solutions for automatic wrapper maintenance to ensure robustness against structural changes of the page over time.

to the top | to the main

Modelling Music Structure using Artificial Neural Networks

Lukáš Marták
master study, supervised by Márius Šajgalík

Abstract. With the era of digital technologies, we can see dramatic evolution of music industry together with radical growth of music content. Libraries are crowded with music, ready to stream compressed, but still great quality audio tracks on demand. As the richness of music content grows, it is crucial to have new methods to describe this content, designed for various purposes. Music Information Retrieval is an interdisciplinary science of retrieving information from music. Various tasks have been identified within the field, which aim to solve different real-world problems.

In this work, we approach the task of Automatic Music Transcription, which is a process of retrieving musical notation from audio piece containing music recording. The main subproblem to be solved here is called Multiple Fundamental-Frequency Estimation. In the past, it has been approached mostly by signal processing domain experts, using handcrafted features to extract information from signal. We approach this problem within the context of emerging field of machine learning, focusing on deep learning methods. To be able to effectively model the structure of musical content within audio signal, we need to build an architecture of deep neural network and optimize it to gain this modelling capacity.

to the top | to the main

Text reading analysis

Jakub Mrocek
master study, supervised by Róbert Móro

Abstract. Our project aims to detect parts of text which are not comprehensible to the human reader using eye-tracking device to monitor their gaze. The amount of text we read every day on a screen is on a raise because of changes in our lifestyle. The main goal of a writer is to encode information in a text representation in a way the reader will be easily decoded. Eventhough the text may seems to be written in a clear and comprehensible way, the reader still may have problem understanding it correctly. This phenomenom may be caused by different mental or intelectual development of the writer and the reader. It is also obvious that comprehension problem may vary amoing different users.

However, it is not trivial to verify if the reader was able to understand the text. Thankfuly, modern technologies provide us with still more sophisticated ways of recording human computer interaction. Our method analyzes data recorded while users were reading text and tries to determine the areas which might have been difficult to read or even incomprehensible to the reader.

to the top | to the main

Conflict detection and visualization in software models

Martin Olejár
master study, supervised by Karol Rástočný

Abstract. The development of software systems includes creation of different model versions of developed system that continuously undergo significant changes. For the purpose of effective progress in the development, it is necessary to detect and identify changes in the models. Furthermore, many changes of models can be conflicting. The conflict changes are created mainly during the parallel work of developer team and must be solved before synchronization of model versions. The problem is not only detection of these conflict changes, but also detection of all model parts that are influenced by these conflict changes. Correct detection and visualization of all conflicts in the models would provide a very good precondition for their solution and successful synchronization of model versions.

In our work, we would like to detect and visualize differences between 2 UML model versions. We plan to use these differences to propose a method for detection, visualization and solution of model conflicts.

to the top | to the main

Generative Adversarial Networks

Adam Rafajdus
master study, supervised by Márius Šajgalík

Abstract. Machine learning, more specific deep learning, has received massive boost of popularity and usability in many domains in the last years, not only thanks to increased computing power and large amount of data, but also thanks to new architectures of neural networks.

Generative adversarial networks are the new architecture of neural networks, in which two models are trained simultaneously and their adversarial relationship (playing min-max game against each other) helps producing better results on set tasks.
Although this framework is still pretty new, it showed its potential on tasks like generating quality image or understanding features from images.

to the top | to the main