So finally I found some time to write again. The main reason for the silence on this blog was (and still is) by bachelor thesis. Normally, when I get asked “What is your bachelor thesis about?”, I answer something like “It’s about next word prediction”. If you are interested in some details, please continue reading :-)
Today I had a first talk about my bachelor thesis. You can find the slides here.
In case you don’t have time to look at the slides, I’ll try to summarize them with a few sentences:
Next word prediction works by analyzing large text files (called corpora). Normally, one would split the text into sequences of the length n and count how often this sequence occurred in the corpus. The sequence together with the count is then called n-gram. In order to actually predict the next word, conditional probabilities are built based on those n-grams. Now we can search the highest probability and use the predicted word as a suggestion.
Our new idea was to insert wildcard words into those n-grams. We call the resulting language model “Generalized Language Model”. By adding wildcard words, the data sparsity of the language model gets reduced.
The next step was to compare the results of n-gram language models with our new approach. Those results seemed quite promising. But in order to get a solid answer to the question “Are Generalized Language Models better than n-gram language models?” we have to do some more work. We need to implement state-of-the-art smoothing techniques for language models. Smoothing methods try to estimate the probability of unseen sequences and a commonly used smoothing method is called Modified Kneser-Ney Smoothing.
This is where my bachelor thesis gets relevant. One aim of the thesis is to implement Modified Kneser-Ney Smoothing for n-gram language models as well as for Generalized Language Models. Then I want to compare the results of different types of Generalized Language Models with n-gram Language Models. The programming part is nearly finished but the To Do list is still quite long…
And when I’m talking about “we”, I mean René Pickhardt and me. He is the advisor of my bachelor thesis and I’m working for him as a student assistant for a year now. The idea of Generalized Langauge Models originates from Typology which was originally implemented by Paul Wagner and Till Speicher. You can find more info on Typology on Renés blog.
So now you know more about the context of my bachelor thesis. Thanks for reading and feel free to ask any questions about the thesis!