After getting the opportunity to study in the USA next year, I had to finish my bachelor thesis one month earlier than expected. Still, I’m quite happy with the result.

Its title: Implementation of Modiﬁed Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction

The abstract:

Next word prediction is the task of suggesting the most probable word a user will type next. Current approaches are based on the empirical analysis of corpora (large text files) resulting in probability distributions over the different sequences that occur in the corpus. The resulting language models are then used for predicting the most likely next word. State-of-the-art language models are based on *n*-grams and use smoothing algorithms like modified Kneser-Ney smoothing in order to reduce the data sparsity by adjusting the probability distribution of unseen sequences. Previous research has shown that building word pairs with different distances by inserting wildcard words into the sequences can result in better predictions by further reducing data sparsity. The aim of this thesis is to formalize this novel approach and implement it by also including modified Kneser-Ney smoothing.

But to be clear: This novel approach, called Generalized Language Models, was not my idea. After working together on the Typology project, René presented this concept as a topic for my bachelor thesis and I was happy to implement it.

You can find my thesis here.

Today, I had my colloquium which focused more on the implementation of smoothed Generalized Language Models for large datasets (6GB+). You can find the slides here. Also, the slides of my Oberseminar talk which focused on the theoretical parts are here.

The next step is the evaluation of smoothed Generalized Language Models. I’m looking forward to find out if the results really are better than smoothed *n*-gram language models!

## 6 Responses to “Final Version of My Bachelor Thesis and Colloquium”

## Massimo Morelli

Hallo Martin.

I am reading your Thesis as I found that is the simplest explanation around of KN smoothing. I need to implement KN in R for a project. Could you contact me (as I do not find your email here)? I would ask you some little clarification on the procedure in Appendix C.1. Thank you in advance and congratulation for your clarity of expression.

## Martin Körner

Ciao Massimo,

thank you for the compliments! :-)

I will contact you via email. Thanks for pointing out that there was no email address on my website. I fixed that.

## Lyan Verwimp

Hi Martin,

First of all, congrats with you bachelor thesis! I’m currently trying to work with the generalized language modeling toolkit, and trying to understand how it works. Could you maybe explain to me how the continuation counts are constructed, and more specifically, what the four columns of numbers after each n-gram stand for? E.g. if I have

‘s middag is 4 2 2 0 (it’s a Dutch corpus)

in the _111_ file, what exactly do 4, 2, 2 and 0 stand for?

If possible, can you contact me by mail? Thank you in advance!

## Martin Körner

Hi Lyan,

thank you! I’ll write you an Email but here’s a short answer to your question:

The first of the four columns is the N_{1+} count needed for calculating the discount value of Kneser-Ney smoothing. The following three columns stand for the different counts that are needed for the calculation of the discount value of modified Kneser-Ney, namely N_{1}, N_{2}, and N_{3+}. I was asked this question before and it is true that this part is nearly not documented at all. One of the few hints can be found here:

https://github.com/renepickhardt/generalized-language-modeling-toolkit/blob/master/src/de/typology/splitter/Aggregator.java#L139-L142

Cheers,

Martin

## Lyan Verwimp

Hi Martin,

Do you mean that the counts are the continuation probabilities, the discount factor itself or the scaling factor? I’m a bit confused. ;-)

Lyan

## Martin Körner

Hi Lyan,

sorry for the late reply. So using your example “‘s middag is 4 2 2 0″ in _111_, we have the following situation:

_111_ means that we take fivegrams and insert wildcard words at the first and fifth position. Do you understand this part? Then, the 4 is N_{1+}(* ‘s middag is *) as it is defined in my bachelor thesis in equation 42b. The first 2 is N_{1}(* ‘s middag is *), the second 2 is N_{2}(* ‘s middag is *), and the 0 is N_{3+}(* ‘s middag is *). Hope this helps.

Martin