Students’ Research Works – Spring 2017: Text Processing and Search (PeWe.Text)

Ondrej Čičkán: Comment classification in community question answering
Jakub Gedera: Automatic Context-Based Text Reconstruction for Slovak
Michal Hucko: Clustering and Classification of Student’s Answers to Questions
Samuel Pecár: Automatic Taxonomy Extraction
Branislav Pecher, Michal Kováčik, Jozef Mláka, Pavol Ondrejka: Development of Inovative Application in International Competition
Matúš Pikuliak: Transfer Learning between Languages for Sentiment Analysis
Andrej Vítek: Online support solution for educational exercises
Rastislav Krchňavý: Aspect Based Sentiment Analysis

Comment classification in community question answering

Ondrej Čičkán
master study, supervised by Marián Šimko

Abstract. Community question answering (cQA) portals, such as StackOverflow, have been gaining popularity in recent years. These portals are often unmoderated and it is difficult ot find out relevant answers for your problem in long question thread.

In our work we propose and implement system for automatic clasification of relevant answers to certain question. First, we look at recent works about methods for comparing text similarity. Then we analyse methods used directly on problem of cQA. It was show that solving problem of question answering in cQA enviroment gives us more information than only text similarity. We can use some heuristic rules or analyse relationships with one answer to another, which emergind during discusson.

Dataset for training models and testing is provided by Internation Workshop on Semantic Evalution (SemEval). This shared dataset gives us opportunity to evaluate our solution and compare it to others solution published in SemEval.

to the top | to the main

Automatic Context-Based Text Reconstruction for Slovak

Jakub Gedera
master study, supervised by Marián Šimko

Abstract. Many journals use sentiment analysis to detect misconduct in the discussions. The problem is that on the Internet dominates non-formal language, which complicate task of sentiment analysis. Typical feature of post is that users use emoticons that have strong impact in sentiment analysis. A lot of negative posts include funny emoticons, which affects the accuracy of the result, and vice versa. People fairly confidently assume that they can correctly identify emotions in text messages. Experiments from Chatham University found that this is certainly misleading.

The aim of our work is to reconstruct and normalize the input text. We mean to determine emotions from text and correctly replace emoticons that have different meaning than text emotion. Detecting emotion from text is a relatively new classification task. To solve this problem, we use emotion detection model. We consider Ekman’s six emotions class (joy, sadness, anger, disgust, fear, surprise).

Finally, we plan to compare success of sentiment analyzer on posts before reconstruction and after using our method that replace emoticons in post based on emotion of post.

to the top | to the main

Clustering and Classification of Student’s Answers to Questions

Michal Hucko
bachelor study, supervised by Mária Bieliková

Abstract. We analyse answers from students on questions, in which is impossible to identify finite number of solutions. We use text clustering and classification. We concentrate on ones written in Slovak language, which are just few words long. Our research question is: How can answer clustering help? In our work we apply different methods and algorithms used in text analysis. Using this method with real time presentation services can lead to an improvement on lectures.

Evaluating student’s answers is essential part of teacher’s work. The checking can be time consuming when facing more than hundreds of records. We can see a big lack of automatic methods. It is even impossible to notice some similar parts in answers while going through them one after another. In this case document clustering would be helpful, to monitor similarity in the test answering. Text classification can support teacher in the situations when there is enough labeled data from previous tests. We are working with documents (answers), which are just few words long.

Our main goal is to provide additional information about common mistakes of the class, to person, who is checking results. Gaining this information automatically, could be very helpful with summarizations of the test. Teachers would get structural view of answers clustered in several groups.

to the top | to the main

Aspect Based Sentiment Analysis

Rastislav Krchňavý
master study, supervised by Marián Šimko

Abstract. Text data produced on social networks like Facebook continuously increase. Reading and evaluating these posts manualy is very time consuming so our research is oriented to analyzing sentiment of these texts. Our goal is to deterimine sentiment of the aspects (topics) discussed in comment section on social networks.

This will be useful in data and marketing analysis, identifying positive and negative aspects of product or finding strong and weak parts of company. We will create method which identifies aspects, measure the sentiment of them and provide results suitable for furter research.

to the top | to the main

Automatic Taxonomy Extraction

Samuel Pecár
master study, supervised by Marián Šimko

Abstract. Automatic taxonomy extraction is one of the tasks of extensive process called ontology learning. This task follows tasks like term extraction or concept discovery. This project was inspired by task of International Workshop on Semantic Evaluation (SemEval) called Taxonomy extraction evaluation. This task is concerned with automatically extraction of hierarchical relations and his goal is to get the highest quality of taxonomic hierarchy.
We designed a method to extract taxonomic relations using morpho-syntactic, pattern and graph-based approach. In some steps we use vector space and other semantic knowledge like synsets. Our goal was to obtain comparable results as projects participating in SemEval task.
We evaluated our method on same datasets provided by organizers of this task in standard metrics like precision and recall.

to the top | to the main

Development of Inovative Application in International Competition

Branislav Pecher, Michal Kováčik, Jozef Mláka, Pavol Ondrejka
bachelor study, supervised by Jakub Šimko

Abstract. We are participating in an international challenge Imagine Cup. We decided, it would be a good idea to create an application which would make expense tracking much less time consuming. To accomplish this, our application will need to be able to “read” bills using OCR and then categorize specific items.

It consists of 4 parts. First part is image capture and its preprocessing using computer vision. This part detects bill in the image, crops it, rotates it and fixes distortions present on image (noise, bad lighting). In addition it performs an adaptive thresholding.

Second part is Optical Character Recognition (OCR) using machine learning. It receives the image from first part, segments it and detects individual characters and words and then outputs the text into text file.

The third part is semantic processing of the recognized text. Here we extract the basic information, such as DKP, items and total sum. Every item will also be categorized to its category. The information will be extracted by finding and using a different patterns present on receipts.

The fourth part is system architecture. Here we are providing API for individual applications that require our processing pipeline. In addition, in this part, a web portal for viewing expenses will be implemented.

to the top | to the main

Transfer Learning between Languages for Sentiment Analysis

Matúš Pikuliak
doctoral study, supervised by Mária Bieliková

Abstract. Machine learning approaches for Natural Language Processing tasks are notoriously demanding for hard-to-obtain annotated data. This is typical also for sentiment analysis task that is trying to automatically predict the emotional sentiment contained in text. Only several major languages have sufficient datasets to work with current state-of-the-art techniques. Smaller languages can not use these techniques because they simply lack necessary data.

In our work we tackle this problem using transfer learning to transfer knowledge about sentiment from models trained in resource-rich language to other languages. We use distributional word representations in English called sentiment embeddings to model sentiment for English. Then we try to transfer this knowledge to German words using multilingual word embeddings, which is also distributional model with words in more languages. We can then use this prediction over German words in standard sentiment analysis setting to predict sentiment of German sentences. Experiments with this method proved that we are able to achieve results comparable with state-of-the-art German sentiment lexicons.

to the top | to the main

Online support solution for educational exercises

Andrej Vítek
bachelor study, supervised by Jozef Tvarožek

Abstract. Asking questions about subject matter is an efficient way for measuring student’s knowledge. However, creating questions from text is a time consuming and difficult task, that often requires domain knowledge.

In my work I am looking for ways to automate this process. My task is to create automatic question generation system for slovak language and I focus on reading comprehension. Because of that I chose to focus on domain of slovak epic literature.

to the top | to the main