- Martin Borák: Detection of anti-social behavior in online communities
- Matúš Cimerman: Stream analysis of incoming events using different data analysis methods
- Martin Číž: Predicting Interest in Information Sources on the Internet using Machine Learning
- Tomáš Chovaňák: Recognition of Web user’s behavioral patterns
- Ondrej Kaššák: User Model Specialized for Session Exit Intent Prediction Task
- Michal Kren: Assignment of Educational Badges in CQA System Askalot
- Martin Měkota: Source Code Search Acknowledging Reputation of Developers
- Ľudovít Labaj: Machine learning – a system for automatic creation and testing of derived features
- Martin Olejár: Software Modelling Support for Small Teams
- Jakub Ondik: Software Modelling Support for Small Teams
- Marek Roštár: Similarities in Source Codes
- Jakub Ševcech: Stream Data Processing
- Peter Truchan: Prediction of User Behavior in a Web Application of the Bank
- Peter Uherek: Application of Machine Learning for Sequential Data
- Michal Randák: Universal Tool to Assign Badges in Online Communities
- Ľubomír Vnenk: Analysis of human work activity on the Web
Detection of anti-social behavior in online communities
master study, supervised by Ivan Srba
Abstract. Lately, online communities gain importance and popularity on the Web, mainly on places as social networks, CQA systems, online games and news or entertainment portals. Immediate communication with unlimited amount of people on enormous number of topics became a part of everyday life for hundreds of millions people in the world.
Considering the huge amount of members of these communities, content of such communication is often rather diverse. Often there are users who try to disrupt these communications. Wheatear it is by posting pointless messages, sharing links to irrelevant sites, uncalled for sarcasm or by an actual aggressive behavior and rude verbal attacks. The most notable type of these users, are so called trolls, who at first pretend to be regular members of these communities, but then they try to disrupt them by annoying people and starting arguments. Such behavior degrades the quality of discussion, discouraging other users from reading and contributing to it and inevitably from visiting the portal. Also it can be a stimulus for legal issues.
In our work we will focus on analysis of antisocial behavior on Web and on automatic detection of trolls and their posts on portals, which are homes to online communities. One of possible candidates is YouTube, which mediates multimedia content and is known for high concentration of trolls in discussion sections of videos.
Text documents clustering
bachelor study, supervised by Michal Barla
Abstract. People usually encounter great amounts of texts, be it in newspapers, books or on the internet, every day in their lives. But we can be easily overwhelmed by these texts. Ideal solution to this problem is to group these texts, or documents, into groups according to the keyword, selected by us, and subsequently choose only those, interesting for us. But Slovak language, as many others, is rich in words, which share the same spelling, but mean different things. These words are called homonyms. There are number of approaches to this problem, but almost none of them were applied on Slovak language.
The main goal of this study is to use word sense induction; more precisely context based clustering, to distinguish different senses of ambiguous word.
We experiment with different settings, ranging from different sizes of context window, to different matrix dimension reductions using SVD. By this we try to find optimal settings, so the user can distinguish different senses of the ambiguous word more easily. We assume, that with more different senses of the homonym, we need to reduce dimensionality matrix less, than with less ambiguous terms.
Source code similarity
bachelor study, supervised by Michal Kompan
Abstract. Nowadays, people have a lots of the possibilities to share information and acquire knowledge from internet. Because of availability information on web is simpler make a plagiarism of some document or source code. This is reason for creating systems which detect the plagiarism and draw attention on it.
Nowadays exist several method to detect plagiarism in source code. In bachelor thesis we analyze possible changes in code which are separated by skills of plagiarist. We split the way of detection plagiarism in source code to two level. The first level consist of abstract syntax tree which separate source code to nodes which together form syntax tree. The second level is represented by n-grams method. The n-grams contain three different methods, which detect similarity more precise. It is depend on compare the small parts of the code.
This method includes other algorithm which compare processed data. My program will work with source codes written is C, Java and C# sharp program language and will be implement in Java.
Stream analysis of incoming events using different data analysis methods
master study, supervised by Jakub Ševcech
Abstract. Nowadays we can see emerging need for data analysis as data occur. Processing and analysis of data streams is a complex task, first, we particuraly need to provide low latency and fault-tolerant solution.
In our work we focus on proposal a set of tools which will help domain expert in process of data analysis. Domain expert do not need to have detailed knowledge of analytics models. Similar approach is popular when we want analyse static collections, eg. funnel analysis. We study possibilities of usage well known methods for static data analysis in domain data streams analysis. Our goal is to apply method for data analysis in domain of data streams. This approach is focused on simplicity in use of selected method and interpretability of results. It is essential for domain experts to meet these requirements because they will not need to have detailed knowledge from such a domains as machine learning or statistics. We evaluate our solution using software component implementing chosen method.
Predicting Interest in Information Sources on the Internet using Machine Learning
master study, supervised by Name Michal Barla
Abstract. The most important goal of each Internet source provider is to capture reader’s interest, so that reader becomes a returning customer. Although it is useful to evaluate previously published articles, there is an opportunity to find article’s potential to be popular before it is even published. There are many attributes that may decide whether an article has a potential. These attributes include title, content, author, source, topic, freshness, credibility.
To extract main topics of articles we use topic model. We look how topics evolved in time by using dynamic topic model. By watching the progress of topic’s words and popularity of topic’s corresponding articles we can train our model to predict popularity for article, which belongs to the topic.
Our method consists of learning how belonging to a certain topic can influence article’s popularity. We use progress of topic’s popularity to see if it was popular in the past. For example if we want to know how popular an article may be this week, we look how its topic was popular previous week. For this purpose we are using SVM for regression analysis. Our dataset contains a collection of newspaper articles published on web along with visits collected over a period of 4 months.
Recognition of Web user’s behavioral patterns
master study, supervised by Ondrej Kaššák
Abstract. User’s behavioural patterns represent typical repeating behaviour of website users. Identified behavioural patterns may be used to reveal bottleneck of website, to predict behaviour of many users or revealing their intentions. Existing approaches are mainly focused on finding global behaviour patterns for large groups of users.
Web logs can be transformed to transactional datasets where each session represents transaction. Many methods of finding frequent sequence patterns and frequent itemsets were proposed. In our method we focus on task of finding navigation patterns which are better suited for individual users. We transform web session logs dataset into undirected graph with nodes representing individual sites of website and inter-node links weight being specified by number of different attributes.
Our work conforms to actual trend of Web personalization and focusation on needs of individual users. We don’t search only for behaviour patterns common to wide community of website users, but also behaviour patterns common to smaller groups of users with similar interests and individual behaviour patterns of users differing from global behaviour patterns. In proposed method we examine influence of combined usage of these behaviour patterns on next user’s actions prediction.
User Model Specialized for Session Exit Intent Prediction Task
doctoral study, supervised by Mária Bieliková
Abstract. User behaviour in the web site can be modelled from two basic points of view. The first one is the short term behaviour, which reflect user’s actual intent, preferences, goal etc. It captures user’s most actual behaviour and actions but it is typically very noisy, because of influence of user’s actual context, mood and more unpredictable conditions.
The second point of view – long-term behaviour is characterized by more stable preferences identification and capturing user typical customs. On the other side, this kind of behaviour is not so adaptable to changes, it learn trends and hot topics of user behaviour only after longer time period.
To be able to model user preferences and predict future behaviour, it is suitable to combine both data sources and consider them when estimating next user actions. In our research, we focus on task of user session exit intent prediction. This task require to be able to recognize subtle changes in user behaviour in comparison to previous behaviour in different time preriods as well as characteristics of actual user session.
Assignment of Educational Badges in CQA System Askalot
bachelor study, supervised by Ivan Srba
Abstract. CQA systems are a widely used platform for sharing knowledge and information. A property that defines a question-answering platform the best is its user base, more importantly their activity. An intensive research is being made to find out how to increase user engagement, productivity and motivation. One such approach that has become extremely popular in recent years is gamification. In our work, we focus on badges and their application in CQA systems. More specifically, we introduce new more complex types of badges designed specifically for educational domain, such as badges awarded on a weekly basis, badges awarded only to a limited number of students etc. Our goal is to motivate students to actively use our faculty CQA system Askalot in their studies. We believe this will push the quality of education at our faculty a step further. In order to evaluate the successfulness of our badges, we implemented a tool to assign these badges in Askalot and consequently we conduct a live uncontrolled experiment at one selected bachelor course.
Machine learning – a system for automatic creation and testing of derived features
bachelor study, supervised by Marek Ciglan
Abstract. Creating programs using machine learning is nowadays becoming more and more attractive. It can be described as programs created (learned) from data. This method is especially useful in areas where there is too much data for manual processing or is too difficult for humans to formulate precise rules, according to which the program is managed.
The process of creating such programs is not simple at all and is accompanied by a number of problems, such as overfitting, high variance, high number of dimensions, and others. For these problems there is a solution in the form of feature engineering – search and removal of irrelevant parameters and extracting new parameters of existing ones.
Feature engineering is specific to each area, making them more complex and time consuming. For these reasons, automatization of feature engineering is in progress. The aim of this project is to create a prototype for automatic filtering, the derivation and testing parameters and thus improve the accuracy of predictions.
Source Code Search Acknowledging Reputation of Developers
bachelor study, supervised by Eduard Kuric
Abstract. Newcomers in big software development teams can be assigned to work on a difficult tasks right from the start. From a new member’s perspective finding the right person to get an advice from may prove to be both time consuming and challenging since they might not be acquainted with other team members.
In our work we are attempting to solve this problem by gathering and analyzing information from version control and issue tracking systems and presenting reputable experts. The end result of my thesis will recommend these experts in certain parts of the source code therefore new members will spend less time find them and more time discussing the problem. The reputation of the experts will be based on their activity in the issue tracking system and in version control system where code reviews take place.
Computer controlled by voice
bachelor study, supervised by Jakub Ševcech
Abstract. In general, people control the computer in a way, that is the most effective, but also the most natural for them. For the long time the number one choice has been keyboard and mouse. As a result of the continual growth of computing power, it is now possible to control computer ina more comfortable way – human voice.
There are many applications, which can process whole words as commands. We introduce a method, which accepts commands in form of short sounds generated by a vocal tract or hands.It may not look like a more comfortable way, but surely more effective one.Our goal is to implement a module for transformation of sound into a set of features and for classification of commands in real time using classifiers such as Naïve Bayesian or K-NN. We believe, that even small number of features (10-15) can hold enough information for accurate command determination. First series of experiments realized on ESC-50 Dataset for Environmental Sound Classification showed promising results.
Software Modelling Support for Small Teams
bachelor study, supervised by Karol Rástočný
Abstract. Software modelling process is one of the crucial parts of software development. A creation of a high-quality software model containing as few defects as possible is a prerequisite for a successful project. Besides large teams, small teams also participate in software modelling and have to face many problems during model creation. Small teams need specific support to solve possible problems.
In our thesis, we analyse work of small teams learning the basics of software modelling. For detailed identification of their problems, we have analysed works of small teams worked out in the course Principles of Software Engineering. We focus our attention primarily on model synchronization and secondarily on detection, identification and correction of defects in models and we offer overview of existing algorithms and solutions. Facilitation of parallel collaboration during model creation and high-quality model verification and validation can considerably raise work efficiency of small teams and prevent defects in source code created on the basis of model.
The main goal of this thesis is to develop an optimal method for model synchronization and implement this method as an add-in in the well-known tool Enterprise Architect. Currently, we finish its implementation and we would like to evaluate it by simulation of small team work or by deploying of implemented add-in to students of the course Principles of Software Engineering.
Software Modelling Support for Small Teams
bachelor study, supervised by Karol Rástočný
Abstract. Small teams face various problems during software modeling process. These problems include model synchronisation, authorship assessment of model parts and fast defect identification and correction. Solving these problems can raise efficiency of small teams and overall quality of their projects.
Our thesis analyses software models defects of small teams, whose members have little to none experience with software modelling. We focus on the learning process of small team members and we track and visualise their advancements and model defects they make. We propose the method for real time defect identification and correction. This method is implemented as an extension for the modelling tool called Enterprise Architect. We also propose a web service suited for processing data collected by tracking advancements and defects of team members and a module which allows tutors to attach notes either to an element in a diagram or a diagram itself.
Universal Tool to Assign Badges in Online Communities
bachelor study, supervised by Ivan Srba
Nowadays, it is common to use game elements and mechanics in many different software systems. Most of all, it is used in online communities like Stack Overflow or Khan Academy, but its use is much wider. Badges, reputation, and other elements help motivating users in using the system and thus increasing their activity. The goal of our bachelor thesis is to create an universal tool that could effectively evaluate which badges should be granted to users based on their activity in the system and the predefined rules. The communication between the system and our tool will work through simple REST API. The tool will be implemented as a web service in Java. The effectiveness of this tool will be evaluated by utilization of a dataset from the existing system (e.g. Askalot), or by randomly generated events.
Similarities in source code
bachelor study, supervised by Michal Kompan
Abstract. With an increasing popularity of different programming languages, a problem of finding similar parts of source codes across different programming languages is rising. Finding such parts of codes can be useful for improving source code quality or identifying potential plagiarism. In current day and age there are multiple ways of identifying similarities in the source code or text documents. Most known are text/token based methods, which can be strengthened with stronger preprocessing of given source codes. In this work we focus mainly on identifying similarities using abstract syntax tree. We also explore the possibilities of applying different levels of preprocessing of source code and its benefits from the performance point of view.
Stream Data Processing
doctoral study, supervised by Name Surname
Abstract. Abstract. In the past years, the interest for the domain of stream data processing is building up. Methods for stream data processing are used in every domain where results have to be provided in real time or in tight time constraints.
In our work, we focus on processing of repeating data streams, where reoccurring sequences can be used to compress the stream size and to enable the application of various methods from text processing by transformation of real-valued data streams into streams of symbols. We study the possibility of transformation of metrics running on data streams into sequences of symbols and we explore methods for their analysis. Applications we are focusing on are stream state classification, anomaly detection and forecasting in domains such as electrical energy consumption or other production/consumption processes. Our paramount goal is to facilitate parallel analysis of multiple data streams and multiple metrics running over them.
Prediction of User Behavior in a Web Application of the Bank
master study, supervised by Mária Bieliková
Abstract. We propose possibilities in measurement and prediction of user behavior and evaluation of measurable users’ characteristic metrics. We chose machine learning algorithms that suited our needs and with the help of these algorithms, we designed model for prediction of user behavior. Input data contain actions and activities made by user in a web application of the bank.
We measured more than 130 000 users in the period of three years. We enriched measured data with data from internal database, which contain information about sex and age of the registered users. We used singular value decomposition for dimensionality reduction. Then we used clustering and sequence algorithms to build prediction model. The main contribution of this paper is the proposal of method for building segments of customers that can work with data from different sources in different formats. The only requirement is to build matrix with users’ characteristics, visited pages, and sequence of their actions.
Application of Machine Learning for Sequential Data
master study, supervised by Michal Barla
Abstract. The thesis deals with the problem of using of sequential data in machine learning methods, especially in Recurrent Neural Networks. We work with user data that originates from the paywall of foreign news portal. The aim of this thesis is to propose and design a method which would verify several hypotheses.
We want to research the possibilities of predictions in according to user history of the web browsing. We are trying to focus on different approaches of using of sequential data for the predictions. We have two main objects of interest. First one is predictions of the user’s payments for articles and second one is the prediction of the popularity of articles.
Our goal is find possibilities of using Recurrent Neural Networks with user data from news portal and compare a various type of Neural Network architecture. We designed several models of Recurrent Neural Networks with using the Long Short Term Memory architecture. Long Short Term Memory architecture consists of several cell memories which can help when there are very long time lags of unknown size between important events. We also made use simpler type of the Long Short Term Memory architecture, the Gated Recurrent Unit architecture. We have made several experiments to evaluate our solution.
Analysis of human work activity on the Web
master study, supervised by Name Surname
It is hard to focus on a single task when using computer if there are so many opportunities to do something funnier, or even worse, if there are many applications with notification system. It is even more important to stay focused and work productively at work, because a company may lose money. Employees admit browsing personal pages 2 hours each day.
In our research we focus on user activity analysis in order to identify moments when the user is rapidly losing productivity. We also analyze users’ activity in order to classify it into one of twelve identified classes representing purpose of the activity (communication, social networks, etc.)
In our experiment we achieved 63% F-score in classification based on activity purpose task. Cooperation of Naïve Bayes classifier and k-nearest neighbor classifier significantly improved accuracy. We achieved 73% F-score in task of classifying activity into productive and non-productive classes with Naïve Bayes classifier. Classifiers cooperation did not help in this case.