Students’ Research Works – Spring 2016: Text Processing and Search (PeWe.Text)

Extended Abstract Template 

Students’ Projects

moderateIT

Jakub Adam, Monika Filipčiková, Andrej Švec, Filip Vozár
bachelor study, supervised by Jakub Šimko

Abstract. As internet is growing, people are getting more opportunities to be connected by a wide range of devices: computers, smartphones and tablets. The freedom of communication is loosing borders. People have the pleasure of confidentiality and thus become more open. Sometimes, they are so open, they even behave unpolitely and rudely to others. Communication and user content on the Web became unregulated. Comments and open discussions easily became a place of hate and offensive behaviour. Discussions contain vulgarisms, ad hominem attacks, offensive words and many other forms of malicious behavior. All this leads to either closing of discussions, companies paying extra money to manage communications on their web sites, or social disgradation of people. We think that communication has to be managed, not omitted.

Our goal is to get rid of poisoning and unpolite comments in online discussions. Dealing with this issue costs companies, such as news portals, considerable sums of money. Our automated solution helps human moderators to quickly detect problematic comments and keep the discussion focused on a given topic. It will also prevent unnecessary conflicts between discussion participants.

Since there is a lot of aspects which can describe a single comment, we will be using many detectors, which will rate a comment based on one particular aspect, and then join the outputs together. This will result in a single number describing the likelihood that a comment is inapproppriate. Few examples of such detectors might be – detection of correlation between a comment and an article using RAKE (Rapid Automated Keyword Extraction), TF-IDF and ElasticSearch, detection of swear words, analyzing the likelihood of inapproppriate comment based on authors past behavior and so on. We will also use statistics and machine learning to help us tune thresholds and parameters for specific detectors.

to the top | to the main | extended abstract

Automatic Text-Checking for Slovak Language

Ondrej Čičkán
bachelor study, supervised by Marián Šimko

We encounter text-checking almost in every word processing program, web browser and other applications. Late detection of spelling mistakes in the curriculum vitae, book or diploma work can be unpleasant for the author. The function of text-checking tool is automatically detect these errors and propose their corrections. It may also be useful in other programs, which require that the input text is written correctly.

Our goal is to offer a tool that automatically check a text in the Slovak language and detect the largest possible percentage of error. We decided to use a statistical method, where we use language and error model. These models will help us to choose the correct word from list of multiple suggested corrections for misspelled word. This method also allows us to correct the real-word errors.

Our solution is based on existing tools for the text-checking Korektor developed at Charles University in Prague by Michal Richter. We will gather text in Slovak language, process it, and create language and error model for Slovak language. Then we set up these models to work in Korektor. We are planning to create our own web services using Korektor.

The success of our models will be tested on our own data and compared to results achieved by existing tools for automatic text-checking for Slovak language like Hunspell and build-in spellchecker in Microsoft Word.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 63-64 –>

Sentiment analysis in Slovak text

Rastislav Krchňavý
bachelor study, supervised by Marián Šimko

Abstract.Social networks are in last few years wildly used. Users do not only communicate with other users, but they are also discussing some topics. Our task is to determine whether user‘s status (comment, tweet, review,…) is positive or negative and how much.

Our solution is working with Slovak language. English and Slovak have many differences, for example word flexion, double negation and diacritics. Except of these our analyzer is dealing with emoticons, stop words, unnecessary punctuation and a lot more.

We implement a sentiment analysis tool based on Naive Bayes algorithm for Slovak language.  Naive Bayes is an algorithm which calculates probability that a certain text belongs to some category. Our solution has 5 categories (strongly positive, positive, neutral, negative, and strongly negative) and our goal is to reach similar accuracy compared to existing solutions for other languages.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 65-66 –>

Navigation Leads for Exploratory Search and Navigation in Digital Libraries

Róbert Móro
doctoral study, supervised by Mária Bieliková

Abstract. Although the prevalent search paradigm on the web is a keyword search, it is not very suitable for a wide range of search tasks that have exploration, learning and investigation as their goals. These tasks are at the focus of exploratory search which is characterized by ill-defined information needs of the users, is often open-ended and requires use of different search strategies.

In our work, we focus on the domain of digital libraries of research articles, namely on the scenario of a researcher novice whose task is to explore a new domain. We proposed an approach of exploratory search and navigation employing a concept of navigation leads with which we augment the search results. Conceptually, navigation leads are important words automatically extracted from the documents present in the information space. We distinguish two types of navigation leads: view navigation leads which provide a global overview of the domain, and document navigation leads which highlight terms (keywords) relevant in the context of a single document (search result) as well as the terms with the highest navigational value.

Our contribution lies in following. Firstly, considering the specifics of the domain of digital libraries, we utilize the data characteristic for this domain in order to improve the keyword extraction in the process of potential navigation lead candidates identification. Secondly, the proposed approach of exploratory search and navigation supports query formulation and refinement considering previous information needs of the users and their feedback. It also supports navigation in the information space in a series of navigational steps with the goal of improving domain sense-making and increasing important concepts coverage and understanding.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 67-68 –>

Keyword Extraction in Slovak

Adam Rafajdus
bachelor study, supervised by Name Surname

Natural language processing is fast evolving and important subfield of artificial intelligence, trying to advance connection between computers and human language. In recent years, not only thanks to new methods of learning distributed representation of words, such as Word2Vec, and progression in area of neural networks, we can improve existing processes and also create new effective methods for natural language processing.

In this work, we proposed a novel neural network architecture, whose main purpose is to extract keywords from Slova texts, applying interesting properties of word vectors and modern architecture of recurrent neural network LSTM – Long Short-Term Memory. These keywords, as a word representation of the text, should be useful in next level of text processing, such as text categorisation.

to the top | to the main | extended abstract
<!– | In Proc. of Spring 2016 PeWe Workshop, pp. 69-70 –>

Modelling User Interests in Latent Feature Vector Space based on Topical Discriminativeness

Márius Šajgalík
doctoral study, supervised by Mária Bieliková

Abstract. User modelling includes modelling various different characteristics like user goals, interests, knowledge, background and much more. However, evaluation of each of these characteristics can be very difficult, since every user is unique and objective evaluation of each modelled feature often requires huge amount of training data. That requirement cannot be easily satisfied in public research environment, where personal information is too confidential to be publicly accessible. In a common research environment, we are confronted with training the model on only a small sample of data, which mostly requires humans to evaluate the model manually, which is often very subjective and time-consuming.

We examine a novel approach to evaluate user interests by formulating an objective function on quality of the model. We focus on modelling user interests in form of keywords aggregated over web pages from user browsing history. By treating users as categories, we can formulate an objective function to extract user interests represented as discriminative words, which can be used to discriminate the user within given community, which effectively avoids extracting words that are just too generic.

to the top | to the main | extended abstract