Students’ Research Works – Autumn 2016: Text Processing and Search (PeWe.Text)

Ondrej Čičkán: Comment classification in Community Question Answering
Jakub Gedera: Reconstruction and Normalization of Slovak Texts
Štefan Grivalský: Natural language processing using neural networks
Michal Hucko: Looking for interensting places in user’s records
Rastislav Krchňavý: Aspect-Based Sentiment Analysis
Lukáš Manduch: Active fight against spam
Róbert Móro: Navigation Leads for Exploratory Search and Navigation in Digital Libraries
Samuel Pecár: Ontology learning from text
Branislav Pecher, Michal Kováčik, Jozef Mláka, Pavol Ondrejka: Development of Inovative Application in International Competition
Matúš Pikuliak: Neural Language Models
Michal Puškáš: Advanced search and visualization
Márius Šajgalík: Modeling Text Semantics
Andrej Švec: Modelling the appropriatness of text posts
Andrej Vítek: Online support solution for educational exercises
Filip Vozár: Sentiment analysis from text about given object

Comment classification in Community Question Answering

Ondrej Čičkán
master study, supervised by Marián Šimko

Abstract. Community Question Answering (CQA) forums have become very popular in last past years. It is widely open to public and everyone can contribute to problem solving of others and so it provides large repository of knowledge. Find out the best answer to new question in existing repository of questions and anwers would be useful not only for CQA services to reduce question duplicate, but also for automatic question answering.

In our work, we focus on ranking comments under question thread in the CQA forums. We will use annotated data published in SemEval 2017 (international workshop on semantic evaluation).

to the top | to the main

Reconstruction and Normalization of Slovak Texts

Jakub Gedera
master study, supervised by Marián Šimko

Abstract. Many journals use sentiment analysis to detect misconduct in the discussions. The problem is that on the Internet dominates non-formal language, which complicate task of sentiment analysis. Typical feature of post is that users use emoticons that have strong impact in sentiment analysis. A lot of negative posts include funny emoticons, which affects the accuracy of the result, and vice versa. People fairly confidently assume that they can correctly identify emotions in text messages. Experiments from Chatham University found that this is certainly misleading.

The aim of our work is to reconstruct and normalize the input text. We mean to determine emotions from text and correctly replace emoticons that have different meaning than text emotion. Detecting emotion from text is a relatively new classification task. To solve this problem, we use emotion detection model. We consider Ekman’s six emotions class (joy, sadness, anger, disgust, fear, surprise).

Finally, we plan to compare success of sentiment analyzer on posts before reconstruction and after using our method that replace emoticons in post based on emotion of post.

to the top | to the main

Natural language processing using neural networks

Štefan Grivalský
bachelor study, supervised by Márius Šajgalík

Abstract. Natural language processing is a field at the intersection of computer science, artificial intelligence, and computational linguistics, which is focused on the analysis and comprehension of human (natural) language. Many different researches in this field aim to assemble information about human comprehension and language use. This knowledge is later used to develop tools and techniques, which can be used for computer systems, for manipulations and the use of natural language which aims to fulfil concrete tasks.

In our work, we focus on the categorization of texts, more specifically on language identification task using neural networks. We can divide this task into two basic parts, of which the first is composed of the alphabet identification and the second of the analysis of linguistic features. To solve this problem, we investigate the suitability of multiple currently popular neural network architectures.

to the top | to the main

Looking for interensting places in user’s records

Michal Hucko
bachelor study, supervised by Mária Bieliková

Abstract. Nowadays is internet full of opportunities to collect data from different sources. We can collect information about user’s behavior from different angles. Text input, mouse events and even the eyes. But we cannot analyze all of them.

In my bachelor’s work I am analysing user records from web services. Especially those records which were collected while answering questions. I concentrate on longer string answers where is not just one possible answer. Sometimes it is very hard to evaluate each from these documents manually. It is even impossible when we are dealing whit thousands and more records. Automatization of this process would be helpful.

My main goal is to help with checking these answers. I am trying to answer the question: How can answer clustering help with evaluating the content of answer? In my work I am trying to apply different metrics and methods used mostly in text classification. My dataset consist of students answers which were collect at our university during last years.

to the top | to the main

Aspect-Based Sentiment Analysis

Rastislav Krchňavý
master study, supervised by Marian Šimko

Abstract. Text data produced on social networks like Facebook continuously increase. Reading and evaluating these posts manualy is very time consuming so our research is oriented to analyzing sentiment of these texts. We are working on tool which can find aspects from texts and compute the sentiment value for them. This will be useful in data and marketing analysis.

to the top | to the main

Active fight against spam

Lukáš Manduch
bachelor study, supervised by Jakub Ševcech

Abstract. In today’s society full of personal computers, spammers often abuse credibility of ordinary users by sending fraud emails. Worst kind of such emails tries to get money from trustful users.

In my work I am going to analyze common patterns in such emails and then create automated system, which will try to keep useless conversation with spammers as long as possible and make their job a little bit harder.

to the top | to the main

Navigation Leads for Exploratory Search and Navigation in Digital Libraries

Róbert Móro
doctoral study, supervised by Mária Bieliková

Abstract. Although the prevalent search paradigm on the web is a keyword search, it is not very suitable for a wide range of search tasks that have exploration, learning and investigation as their goals. These tasks are at the focus of exploratory search which is characterized by ill-defined information needs of the users, is often open-ended and requires use of different search strategies.

In our work, we focus on the domain of digital libraries of research articles, namely on the scenario of a researcher novice whose task is to explore a new domain. We propose an approach of exploratory search and navigation using navigation leads, with which we augment the search results, and which serve as navigation starting points allowing users to follow a specific path by filtering only documents pertinent to the selected lead. Our main contribution is in examining the different aspects of selecting the suitable navigation leads as well as the means of highlighting their information scent—their informational and navigational value—by considering various kinds of user feedback and properties of the information space of digital libraries of research articles.

to the top | to the main

Ontology learning from text

Samuel Pecár
master study, supervised by Marián Šimko

Abstract. Ontology learning from text is the extensive process of creation ontologies from text corpora. This process consist of several major subtasks like term extraction, concept discovery and learning relations. Currently, we focus on state-of-art analysis and identify various types of approaches to taxonomy learning. Taxonomy learning is very important part of ontology learning and can be divided in several subtasks as relation discovery, taxonomy construction and taxonomy cleaning.

Taxonomies are very useful tools and providing valuable input for many complex tasks like question answering and textual entailment. Task from International Workshop on Semantic Evaluation (SemEval) is concerned with automatically extracting hierarchical relations from text corpora and subsequent taxonomy construction. Main goal of this task is extraction hypernym-hyponym relations and task is not concerned with any relation indicating subordination between terms. Our aim is to propose a method for taxonomy extraction and construction using state-of-art approaches from other ontology learning tasks.

to the top | to the main

Development of Inovative Application in International Competition

Branislav Pecher, Michal Kováčik, Jozef Mláka, Pavol Ondrejka
bachelor study, supervised by Jakub Šimko

Abstract. We are participating in an international challenge Imagine Cup. We decided, it would be a good idea to create an application which would make expense sharing much easier. To accomplish this, our application will need to be able to “read” bills using OCR.

It consists of 4 parts. First part is image capture and its preprocessing using computer vision. This part detects bill in the image, rotates it, crops it and fixes distortions (noise, bad lighting).

Second part is Optical Character Recognition (OCR) using machine learning. It gets the image from first part, segments it and detects individual characters and words and then outputs them to a text file.

The third part is semantic processing of the recognized text. Here we extract the basic information, such as DKP, items and total sum. We will also perform auto-correction and natural language processing.

The fourth part is system architecture. Here we are providing API for individual applications that require our processing pipeline. Part of this is also demo application for “Bločková lotéria” and clone of “Splitwise” extended with OCR. Therefore you don’t need to write it down manually.

to the top | to the main

Neural Language Models

Matúš Pikuliak
doctoral study, supervised by Mária Bieliková

Abstract. While solving different tasks related to natural language processing, researches are often creating certain models of languages. Usually these models are based on statistical processing of vast text corpora. They are trying to capture the essence of language by finding certain patterns in how the words of this language are used. Currently the most interesting and talked-about approach is to use neural networks. These models are then called neural language models (NLM).

Recent developments in this field make these models viable for a plethora of tasks and they are often even suprassing the more traditional method of natural language processing. NLMs are being used even in the most advanced tasks such as machine translation, language generation, image captioning, sentiment analysis etc. The basic idea of NLMs is to project all the words from language to certain latent vector space using neural network. This vector space is trained to capture semantic information from the corpus and the captured information can be later extracted and used.

In our work, we focus on using NLMs for different tasks. We are also interested in researching these models themselves and finding out how they work and how to understand them, as they are currently being incomprehensible to us because of the nature of latent vector spaces neural network are creting.

to the top | to the main

Advanced search and visualization

Michal Puškáš
bachelor study, supervised by Michal Kompan

Abstract. Mendeley, Citeseerx, Google Scholar and lot of others web search engines, that are freely accessible are basically digital libraries providing access to academic papers. This is the place, where people, mainly researchers, can find useful papers without much effort. However, users often find difficulties caused by bad interfaces and are forced to perform routine actions manually. Also, visual side of returned content often degrades user experience and actual value of information.

My goal is to analyze available informations in library system of our university. First, We are going to explore current trends in well known academic search engines. Then we need to identify main pros and cons in different searching techniques. Next step is to design advanced search engine and filter, which will help people to find relevant information fast and easily. Final step is to design and create visualy atractive form of presentation of our search results.

to the top | to the main

Modeling Text Semantics

Márius Šajgalík
doctoral study, supervised by Mária Bieliková

Abstract. In natural language processing, words are often treated as atomic units. However, we need to recognize similarities between words to enhance machine understanding of a text. We can either model relations between words directly using an ontology or indirectly, by modeling word features. While the first approach is strictly explicit and easily presentable, the second approach can represent word relations in a distributed manner across multiple features, which makes it more memory efficient. Simultaneously, we can use latent features as coordinates of word vector, which enables us to use vector operations on words like addition or measuring vector similarity. Thus, we can easily compose joint meaning of short phrases and retrieve most similar words for any feature vector or dictionary word in the dictionary.

In our work, we focus on the modeling of discriminative representation of text documents. We explore two approaches. In the first one, we extract key concepts by combining ontology with PageRank. In the second approach, we extract discriminative keywords by combining word vectors with discriminative frequency statistics. It turns out that discriminative representation is helpful in multiple domains and can be used even for modeling discriminative representation of users based on their personal document collections.

to the top | to the main

Modelling the appropriatness of text posts

Andrej Švec
bachelor study, supervised by Márius Šajgalík

Abstract. The discussions on news and discussion websites are full of negativity, trolls and hateful speech. These may represent legal problems to the websites owners and though they want to do someting about it. However the volume of comments posted on the websites is huge and so they need an automated solution.

We want to model the appropriatness of comments in online discussions and focus on comments that are innapropriate because of their content. To find these comments we want to use deep learning, artificial neural networks especially generative adversarial networks.

We want to test our model on real data from discussion website or social network.

to the top | to the main

Online support solution for educational exercises

Andrej Vítek
bachelor study, supervised by Jozef Tvarožek

Abstract. Asking questions about subject matter is an efficient way for measuring student’s knowledge. However, creating questions from text is a time consuming and difficult task, that often requires domain knowledge.

In my work I am looking for ways to automate this process. My task is to create automatic question generation system for slovak language and I focus on domain of slovak epic literature. This task consists from 3 major subtasks. First I need to process text in natural language using available tools and then choose which parts of text are suitable. Last subtask is to generate questions from those parts.

to the top | to the main

Sentiment analysis from text about given object

Filip Vozár
bachelor study, supervised by Marián Šimko

Abstract. Online media monitoring is important for companies that wish to maintain their brand reputation. Sentiment analysis of online content is a popular way of gathering public opinions. With increasing amount of user-generated online content, there is a need for automated methods for sentiment analysis about monitored object, as existing methods provide sub-optimal results.

We focus on building a model, that when given a text and a monitored object, can evaluate a way in which the monitored object is mentioned in the text. We evaulate our model on texts mostly from official news reports and press.

to the top | to the main