Students’ Research Works – Spring 2016: Data Analysis, Mining and Machine Learning (PeWe.Data)

Martin Borák: Detection of anti-social behavior in online communities
Matúš Cimerman: Stream analysis of incoming events using different data analysis methods
Martin Číž: Predicting Interest in Information Sources on the Internet using Machine Learning
Tomáš Chovaňák: Recognition of Web user’s behavioral patterns
Ondrej Kaššák: User Model Specialized for Session Exit Intent Prediction Task
Michal Kren: Assignment of Educational Badges in CQA System Askalot
Martin Měkota: Source Code Search Acknowledging Reputation of Developers
Ľudovít Labaj: Machine learning – a system for automatic creation and testing of derived features
Martin Olejár: Software Modelling Support for Small Teams
Jakub Ondik: Software Modelling Support for Small Teams
Marek Roštár: Similarities in Source Codes
Jakub Ševcech: Stream Data Processing
Peter Truchan: Prediction of User Behavior in a Web Application of the Bank
Peter Uherek: Application of Machine Learning for Sequential Data
Michal Randák: Universal Tool to Assign Badges in Online Communities
Ľubomír Vnenk: Analysis of human work activity on the Web

Detection of anti-social behavior in online communities

Martin Borák
master study, supervised by Ivan Srba

Abstract. Lately, online communities gain importance and popularity on the Web, mainly on places as social networks, CQA systems, online games and news or entertainment portals. Immediate communication with unlimited amount of people on enormous number of topics became a part of everyday life for hundreds of millions people in the world.

Considering the huge amount of members of these communities, content of such communication is often rather diverse. Often there are users who try to disrupt these communications. Wheatear it is by posting pointless messages, sharing links to irrelevant sites, uncalled for sarcasm or by an actual aggressive behavior and rude verbal attacks. The most notable type of these users, are so called trolls, who at first pretend to be regular members of these communities, but then they try to disrupt them by annoying people and starting arguments. Such behavior degrades the quality of discussion, discouraging other users from reading and contributing to it and inevitably from visiting the portal. Also it can be a stimulus for legal issues.

In our work we will focus on analysis of antisocial behavior on Web and on automatic detection of trolls and their posts on portals, which are homes to online communities. One of possible candidates is YouTube, which mediates multimedia content and is known for high concentration of trolls in discussion sections of videos.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 3-4 –>

Text documents clustering

Peter Belai
bachelor study, supervised by Michal Barla

Abstract. People usually encounter great amounts of texts, be it in newspapers, books or on the internet, every day in their lives. But we can be easily overwhelmed by these texts. Ideal solution to this problem is to group these texts, or documents, into groups according to the keyword, selected by us, and subsequently choose only those, interesting for us. But Slovak language, as many others, is rich in words, which share the same spelling, but mean different things. These words are called homonyms. There are number of approaches to this problem, but almost none of them were applied on Slovak language.
The main goal of this study is to use word sense induction; more precisely context based clustering, to distinguish different senses of ambiguous word.

We experiment with different settings, ranging from different sizes of context window, to different matrix dimension reductions using SVD. By this we try to find optimal settings, so the user can distinguish different senses of the ambiguous word more easily. We assume, that with more different senses of the homonym, we need to reduce dimensionality matrix less, than with less ambiguous terms.

to the top | to the main | extended abstract

Source code similarity

Juraj Brilla
bachelor study, supervised by Michal Kompan

Abstract. Nowadays, people have a lots of the possibilities to share information and acquire knowledge from internet. Because of availability information on web is simpler make a plagiarism of some document or source code. This is reason for creating systems which detect the plagiarism and draw attention on it.

Nowadays exist several method to detect plagiarism in source code. In bachelor thesis we analyze possible changes in code which are separated by skills of plagiarist. We split the way of detection plagiarism in source code to two level. The first level consist of abstract syntax tree which separate source code to nodes which together form syntax tree. The second level is represented by n-grams method. The n-grams contain three different methods, which detect similarity more precise. It is depend on compare the small parts of the code.

This method includes other algorithm which compare processed data. My program will work with source codes written is C, Java and C# sharp program language and will be implement in Java.

to the top | to the main | extended abstract

Stream analysis of incoming events using different data analysis methods

Matúš Cimerman
master study, supervised by Jakub Ševcech

Abstract. Nowadays we can see emerging need for data analysis as data occur. Processing and analysis of data streams is a complex task, first, we particuraly need to provide low latency and fault-tolerant solution.

In our work we focus on proposal a set of tools which will help domain expert in process of data analysis. Domain expert do not need to have detailed knowledge of analytics models. Similar approach is popular when we want analyse static collections, eg. funnel analysis. We study possibilities of usage well known methods for static data analysis in domain data streams analysis. Our goal is to apply method for data analysis in domain of data streams. This approach is focused on simplicity in use of selected method and interpretability of results. It is essential for domain experts to meet these requirements because they will not need to have detailed knowledge from such a domains as machine learning or statistics. We evaluate our solution using software component implementing chosen method.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 5-6 –>

Predicting Interest in Information Sources on the Internet using Machine Learning

Martin Číž
master study, supervised by Name Michal Barla

Abstract. The most important goal of each Internet source provider is to capture reader’s interest, so that reader becomes a returning customer. Although it is useful to evaluate previously published articles, there is an opportunity to find article’s potential to be popular before it is even published. There are many attributes that may decide whether an article has a potential. These attributes include title, content, author, source, topic, freshness, credibility.

To extract main topics of articles we use topic model. We look how topics evolved in time by using dynamic topic model. By watching the progress of topic’s words and popularity of topic’s corresponding articles we can train our model to predict popularity for article, which belongs to the topic.

Our method consists of learning how belonging to a certain topic can influence article’s popularity. We use progress of topic’s popularity to see if it was popular in the past. For example if we want to know how popular an article may be this week, we look how its topic was popular previous week. For this purpose we are using SVM for regression analysis. Our dataset contains a collection of newspaper articles published on web along with visits collected over a period of 4 months.

to the top | to the main | extended abstract

Recognition of Web user’s behavioral patterns

Tomáš Chovaňák
master study, supervised by Ondrej Kaššák

Abstract. User’s behavioural patterns represent typical repeating behaviour of website users. Identified behavioural patterns may be used to reveal bottleneck of website, to predict behaviour of many users or revealing their intentions. Existing approaches are mainly focused on finding global behaviour patterns for large groups of users.

Web logs can be transformed to transactional datasets where each session represents transaction. Many methods of finding frequent sequence patterns and frequent itemsets were proposed. In our method we focus on task of finding navigation patterns which are better suited for individual users. We transform web session logs dataset into undirected graph with nodes representing individual sites of website and inter-node links weight being specified by number of different attributes.

Our work conforms to actual trend of Web personalization and focusation on needs of individual users. We don’t search only for behaviour patterns common to wide community of website users, but also behaviour patterns common to smaller groups of users with similar interests and individual behaviour patterns of users differing from global behaviour patterns. In proposed method we examine influence of combined usage of these behaviour patterns on next user’s actions prediction.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 7-8 –>

User Model Specialized for Session Exit Intent Prediction Task

Ondrej Kaššák
doctoral study, supervised by Mária Bieliková

Abstract. User behaviour in the web site can be modelled from two basic points of view. The first one is the short term behaviour, which reflect user’s actual intent, preferences, goal etc. It captures user’s most actual behaviour and actions but it is typically very noisy, because of influence of user’s actual context, mood and more unpredictable conditions.

The second point of view – long-term behaviour is characterized by more stable preferences identification and capturing user typical customs. On the other side, this kind of behaviour is not so adaptable to changes, it learn trends and hot topics of user behaviour only after longer time period.

To be able to model user preferences and predict future behaviour, it is suitable to combine both data sources and consider them when estimating next user actions. In our research, we focus on task of user session exit intent prediction. This task require to be able to recognize subtle changes in user behaviour in comparison to previous behaviour in different time preriods as well as characteristics of actual user session.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 9-10 –>

Assignment of Educational Badges in CQA System Askalot

Michal Kren
bachelor study, supervised by Ivan Srba

Abstract. CQA systems are a widely used platform for sharing knowledge and information. A property that defines a question-answering platform the best is its user base, more importantly their activity. An intensive research is being made to find out how to increase user engagement, productivity and motivation. One such approach that has become extremely popular in recent years is gamification. In our work, we focus on badges and their application in CQA systems. More specifically, we introduce new more complex types of badges designed specifically for educational domain, such as badges awarded on a weekly basis, badges awarded only to a limited number of students etc. Our goal is to motivate students to actively use our faculty CQA system Askalot in their studies. We believe this will push the quality of education at our faculty a step further. In order to evaluate the successfulness of our badges, we implemented a tool to assign these badges in Askalot and consequently we conduct a live uncontrolled experiment at one selected bachelor course.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 11-12 –>

Machine learning – a system for automatic creation and testing of derived features

Ľudovít Labaj
bachelor study, supervised by Marek Ciglan

Abstract. Creating programs using machine learning is nowadays becoming more and more attractive. It can be described as programs created (learned) from data. This method is especially useful in areas where there is too much data for manual processing or is too difficult for humans to formulate precise rules, according to which the program is managed.

The process of creating such programs is not simple at all and is accompanied by a number of problems, such as overfitting, high variance, high number of dimensions, and others. For these problems there is a solution in the form of feature engineering – search and removal of irrelevant parameters and extracting new parameters of existing ones.

Feature engineering is specific to each area, making them more complex and time consuming. For these reasons, automatization of feature engineering is in progress. The aim of this project is to create a prototype for automatic filtering, the derivation and testing parameters and thus improve the accuracy of predictions.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 17-18 –>

Source Code Search Acknowledging Reputation of Developers

Martin Měkota
bachelor study, supervised by Eduard Kuric

Abstract. Newcomers in big software development teams can be assigned to work on a difficult tasks right from the start. From a new member’s perspective finding the right person to get an advice from may prove to be both time consuming and challenging since they might not be acquainted with other team members.

In our work we are attempting to solve this problem by gathering and analyzing information from version control and issue tracking systems and presenting reputable experts. The end result of my thesis will recommend these experts in certain parts of the source code therefore new members will spend less time find them and more time discussing the problem. The reputation of the experts will be based on their activity in the issue tracking system and in version control system where code reviews take place.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 13-14 –>

Computer controlled by voice

Martin Mokrý
bachelor study, supervised by Jakub Ševcech

Abstract. In general, people control the computer in a way, that is the most effective, but also the most natural for them. For the long time the number one choice has been keyboard and mouse. As a result of the continual growth of computing power, it is now possible to control computer ina more comfortable way – human voice.

There are many applications, which can process whole words as commands. We introduce a method, which accepts commands in form of short sounds generated by a vocal tract or hands.It may not look like a more comfortable way, but surely more effective one.Our goal is to implement a module for transformation of sound into a set of features and for classification of commands in real time using classifiers such as Naïve Bayesian or K-NN. We believe, that even small number of features (10-15) can hold enough information for accurate command determination. First series of experiments realized on ESC-50 Dataset for Environmental Sound Classification showed promising results.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 15-16 –>

Software Modelling Support for Small Teams

Martin Olejár
bachelor study, supervised by Karol Rástočný

Abstract. Software modelling process is one of the crucial parts of software development. A creation of a high-quality software model containing as few defects as possible is a prerequisite for a successful project. Besides large teams, small teams also participate in software modelling and have to face many problems during model creation. Small teams need specific support to solve possible problems.

In our thesis, we analyse work of small teams learning the basics of software modelling. For detailed identification of their problems, we have analysed works of small teams worked out in the course Principles of Software Engineering. We focus our attention primarily on model synchronization and secondarily on detection, identification and correction of defects in models and we offer overview of existing algorithms and solutions. Facilitation of parallel collaboration during model creation and high-quality model verification and validation can considerably raise work efficiency of small teams and prevent defects in source code created on the basis of model.

The main goal of this thesis is to develop an optimal method for model synchronization and implement this method as an add-in in the well-known tool Enterprise Architect. Currently, we finish its implementation and we would like to evaluate it by simulation of small team work or by deploying of implemented add-in to students of the course Principles of Software Engineering.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 19-20 –>

Software Modelling Support for Small Teams

Jakub Ondik
bachelor study, supervised by Karol Rástočný

Abstract. Small teams face various problems during software modeling process. These problems include model synchronisation, authorship assessment of model parts and fast defect identification and correction. Solving these problems can raise efficiency of small teams and overall quality of their projects.

Our thesis analyses software models defects of small teams, whose members have little to none experience with software modelling. We focus on the learning process of small team members and we track and visualise their advancements and model defects they make. We propose the method for real time defect identification and correction. This method is implemented as an extension for the modelling tool called Enterprise Architect. We also propose a web service suited for processing data collected by tracking advancements and defects of team members and a module which allows tutors to attach notes either to an element in a diagram or a diagram itself.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 21-22 –>

Universal Tool to Assign Badges in Online Communities

Michal Randák
bachelor study, supervised by Ivan Srba

Abstract.
Nowadays, it is common to use game elements and mechanics in many different software systems. Most of all, it is used in online communities like Stack Overflow or Khan Academy, but its use is much wider. Badges, reputation, and other elements help motivating users in using the system and thus increasing their activity. The goal of our bachelor thesis is to create an universal tool that could effectively evaluate which badges should be granted to users based on their activity in the system and the predefined rules. The communication between the system and our tool will work through simple REST API. The tool will be implemented as a web service in Java. The effectiveness of this tool will be evaluated by utilization of a dataset from the existing system (e.g. Askalot), or by randomly generated events.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 23-24 –>

Similarities in source code

Marek Roštár
bachelor study, supervised by Michal Kompan
Abstract. With an increasing popularity of different programming languages, a problem of finding similar parts of source codes across different programming languages is rising. Finding such parts of codes can be useful for improving source code quality or identifying potential plagiarism. In current day and age there are multiple ways of identifying similarities in the source code or text documents. Most known are text/token based methods, which can be strengthened with stronger preprocessing of given source codes. In this work we focus mainly on identifying similarities using abstract syntax tree. We also explore the possibilities of applying different levels of preprocessing of source code and its benefits from the performance point of view.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 25-26 –>

Stream Data Processing

Jakub Ševcech
doctoral study, supervised by Name Surname

Abstract. Abstract. In the past years, the interest for the domain of stream data processing is building up. Methods for stream data processing are used in every domain where results have to be provided in real time or in tight time constraints.

In our work, we focus on processing of repeating data streams, where reoccurring sequences can be used to compress the stream size and to enable the application of various methods from text processing by transformation of real-valued data streams into streams of symbols. We study the possibility of transformation of metrics running on data streams into sequences of symbols and we explore methods for their analysis. Applications we are focusing on are stream state classification, anomaly detection and forecasting in domains such as electrical energy consumption or other production/consumption processes. Our paramount goal is to facilitate parallel analysis of multiple data streams and multiple metrics running over them.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 27-28 –>

Prediction of User Behavior in a Web Application of the Bank

Peter Truchan
master study, supervised by Mária Bieliková

Abstract. We propose possibilities in measurement and prediction of user behavior and evaluation of measurable users’ characteristic metrics. We chose machine learning algorithms that suited our needs and with the help of these algorithms, we designed model for prediction of user behavior. Input data contain actions and activities made by user in a web application of the bank.

We measured more than 130 000 users in the period of three years. We enriched measured data with data from internal database, which contain information about sex and age of the registered users. We used singular value decomposition for dimensionality reduction. Then we used clustering and sequence algorithms to build prediction model. The main contribution of this paper is the proposal of method for building segments of customers that can work with data from different sources in different formats. The only requirement is to build matrix with users’ characteristics, visited pages, and sequence of their actions.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 29-30 –>

Application of Machine Learning for Sequential Data

Peter Uherek
master study, supervised by Michal Barla

Abstract. The thesis deals with the problem of using of sequential data in machine learning methods, especially in Recurrent Neural Networks. We work with user data that originates from the paywall of foreign news portal. The aim of this thesis is to propose and design a method which would verify several hypotheses.

We want to research the possibilities of predictions in according to user history of the web browsing. We are trying to focus on different approaches of using of sequential data for the predictions. We have two main objects of interest. First one is predictions of the user’s payments for articles and second one is the prediction of the popularity of articles.

Our goal is find possibilities of using Recurrent Neural Networks with user data from news portal and compare a various type of Neural Network architecture. We designed several models of Recurrent Neural Networks with using the Long Short Term Memory architecture. Long Short Term Memory architecture consists of several cell memories which can help when there are very long time lags of unknown size between important events. We also made use simpler type of the Long Short Term Memory architecture, the Gated Recurrent Unit architecture. We have made several experiments to evaluate our solution.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 31-32 –>

Analysis of human work activity on the Web

Ľubomír Vnenk
master study, supervised by Name Surname

Abstract.
It is hard to focus on a single task when using computer if there are so many opportunities to do something funnier, or even worse, if there are many applications with notification system. It is even more important to stay focused and work productively at work, because a company may lose money. Employees admit browsing personal pages 2 hours each day.

In our research we focus on user activity analysis in order to identify moments when the user is rapidly losing productivity. We also analyze users’ activity in order to classify it into one of twelve identified classes representing purpose of the activity (communication, social networks, etc.)

In our experiment we achieved 63% F-score in classification based on activity purpose task. Cooperation of Naïve Bayes classifier and k-nearest neighbor classifier significantly improved accuracy. We achieved 73% F-score in task of classifying activity into productive and non-productive classes with Naïve Bayes classifier. Classifiers cooperation did not help in this case.

to the top | to the main | extended abstract