Check out the whole project on GitHub: https://github.com/MayADevBe/Swedish-FrequencyList-8Sidor
This is the second of two articles written about a weekend project of creating a Swedish frequently word list. This part will be about the Natural Language Processing Part. How I cleaned the data, extracted and counted the words. Make sure to read the first article, where I explain the Data Mining part for this project.
A little bit of Theory
Natural Language Processing (NLP) is a way for the computer to process and analyze human language. The goal is to make communication between humans and machines possible. This includes, for example, communication through typed text.
NLP is not an easy task for computers. It is part of the area of Artificial Intelligence.
NLP consist of a lot of different tasks, which are performed sequentially in a pipeline.
- Sentence Boundary Detection: Splits the text into individual sentences.
- Tokenization: Splits the sentences into individual ’tokens’ (word, punctuation).
- Part-of-Speech Tagging (Pos-Tagging): Identifies the syntactic role of each token.
- Chunking (shallow parsing): Looks for phrases with syntactic meaning within sentences.
- Stop Words Removal - Stop Words are terms that are the most frequent words. They are not useful for the NLP process and therefore are removed.
- Lemmatization: Find the root of a term. Similar to Stemming, which finds the stem of the word. Used to deal with, for example, different verb tenses.
Named Entity Recognition (NER) identifies entities - which are words or phrases with a special meaning. It labels them with a concept. The most common concepts are Organizations, Locations, Persons and Time.
Additionally, entities can be mapped to task-specific vocabulary (concepts) with the help of ‘Gazetteer Lists’. A Gazetteer List is like a large dictionary that can contain tokens or chunks that are well-known entities grouped under different concepts.
Before analyzing the data, it has to be cleaned. This step is very dependent on the dataset and what the task is.
Since my goal is to make a frequency list of the most common Swedish words, I have to remove all the non-Swedish terms. This includes mainly names, numbers and also some English words.
Regarding the dataset, I have already mentioned in the previous article that the way of obtaining the article content is by extracting the text of the whole
div element. However, some of the sub-elements contained text like ‘8 SIDOR/TT’, which should also be removed.
Natural Language Processing can be very dependent on the languages. A few things I tried were not working very well with the Swedish corpus.
To get started, I first needed to load the dataset. I created my own dataset and saved the article data (link, headline, content) in JSON files (Read more in the first article).
To load the data, I first opened the file and then loaded the JSON objects.
For this part of the project, I used two libraries.
- The collections library implements specialized container datatypes. I used it to order dictionaries.
- The nltk library is a very popular Natural Language Processing library. It trained on a big corpus of data and implemented the tasks that I described in the ‘NLP Pipeline’ paragraph. To use it, I first needed to import it and download the necessary data needed for the used python packages. This needs to be done once (the lines in the comments).
For the natural language processing, I tried to do the previously mentioned steps. However, the result was not so good. One of the reasons might be that it is not an English dataset. Another reason, however, is a problem that appeared during the extraction step. If you look at one of the extracted articles, you can see that the extracted text from the website sometimes misses a space after the end of a sentence (Example: ‘… varmt.Då …’). I found this can lead to being outputted as one token.
Since I only needed the tokens and no further information - like what type of token it is - I looked for another way. And I found the
RegexTokenizer, which works a lot better for my needs and dataset. It is not deterred by the missing spaces and also gets rid of all punctuation.
The code for extracting the tokens from one article looks like the following:
Now that I had the tokens, the next step was to clean and normalize them. This step is necessary so that the counting works better.
After looking at the data/tokens, to see what needed to be cleaned and normalized, I did the following steps:
- First, I loaded the necessary data.
- Next, I needed to get rid of the names since there is no reason for names to be in a frequency list of vocabulary. My first try was to use the NER and filter out tokens with the ‘Person’ tag. This did not work so well. Therefore, I decided to create a list per hand with names of people and places, which you can also find in the GitHub project. They were filtered out.
- I made all the tokens lower case since in Swedish, like in English, most words are written with a lower case, and therefore no important information was lost.
- While looking through the data, I also noticed numbers that I also removed.
- I also created a small list per hand with tokens that were not cleaned in the other steps, like the occasional English word and some tokens specific to the data source/website. I treated this list similar to how you would treat a stop word list.
After cleaning the data, I could now count the tokens. I did this by creating a dictionary, with the token as key and the count as value. Since I was interested in the most frequent words, I sorted the dictionary before saving it to a file.
I calculated the frequency in this manner, in parts for the articles in batch, a category and then a final list.
The result of this part of the project where part frequency lists that can be found in the ‘articles_freq’ and ‘kategory_freq’ folders. The full final frequency list - not normalized and normalized - can be found in the root folder of the project. For the ‘result.json’, the result of this project, I removed all tokens with less than six appearances from the final frequency list, leaving 2875 words. This could, of course, be further shortened by only looking at the top 1000 words.
Here is an example of how the file looks:
Looking at the beginning of the result list, I can see that the most common words are similar to other languages. They are mostly articles, prepositions and conjunctions and adverbs. This is to be expected since these small words are used very often in comparison to verbs and nouns.
Conclusion and Potential Additions
This was a very fun weekend project that let me combine two of my interest as well as freshen up my knowledge in Data Mining and Natural Language Processing. The resulting frequency list is a useful list to start learning Swedish vocabulary.
I do have some ideas that could be added to the project. I decided, however, to not do anything more for this now since I only planned it as a weekend project. Still, I wanted to give a quick overview of potential additions:
- Further normalization could be done.
- To better the resulting list lemmatization or stemming could be done. This way, for example, conjugated verbs or nouns in plural could be counted as one.
- More data could be extracted from the internet to have an even bigger corpus. This could be done on the same website or, to vary the domain, other sites could be included.
- More/deeper analysis on the result and articles could be done.
- The dataset could be used for other analyses.
I hope my write-up of this project was helpful for you. If you have any questions, feel free to reach out. And check out the whole project on my GitHub: https://github.com/MayADevBe/Swedish-FrequencyList-8Sidor for the complete code and results.