The flagship application of natural language processing is the use of voice assistants. The ability to communicate with our device through speech (or text) is made possible through the use of sophisticated processing and data analysis techniques and machine learning (in this case, highly sophisticated neural networks).
Although machine learning and NLP are not the same thing, very often the two are combined, enabling us to use high-tech solutions.
Among the tasks of interest to the field of NLP, we can distinguish a number of procedures performing text processing such as tokenization, lemmatization or stemming, enabling further analysis of text data.
data. We can use such data to perform syntactic analysis (based on word structure, and word count) or semantic analysis (contextual, meaningful). Such techniques lead to the study of overtone analysis, or making classifications of texts.
The first step will be to create a project to show how the above-mentioned techniques work.
- Let's create a new directory with the project in the directory of our choice by running the commands:
mkdir nlpprocessing cd nlpprocessing
- Then let's prepare a new project by following the commands:
npm init touch index.js
- Let's also install the Natural library with the command:
npm install natural
Thus prepared, we can use the design to represent the functionality of the Natural library.
The basic task in natural language processing is to divide the text into suitably small units of meaning, which, depending on the case, can be a word, a sentence or another unit (for example, words can be divided into even smaller parts, such as syllables, suffixes or prefixes).
Each such unit, or so-called token, has its own meaning. So we can see that tokenization is a process that divides a text into smaller pieces, more targeted in terms of meaning. The process of tokenization is of great importance in further language processing, such as in machine learning algorithms. Properly performed tokenization affects the performance and results of the algorithm.
Text can be divided in many ways. Among the simplest is the division of texts based on space characters. The result of such a process is thus a set of words. During such a process additional tasks are often carried out, such as changing the size of letters, removing spacing and punctuation marks or formatting.
Often an advantageous procedure from the perspective of further analysis, is the removal of words from the so-called stop-list, i.e. a set of words most popular for a given language (for example, the words the, this in English), which are not relevant to further processes.
Let’s look at how to perform such a process using the Natural library. For this we will use the token method of the WordTokenizer object.
- Suppose we want to split the following text into tokens:
const data = "John Doe is sometimes used to refer to a typical male in other contexts as well, in a similar manner to John Q."
- First, let's import the Natural library, adding the code:
const natural = require("natural");
- And let's create a new object of the WordTokenizer class:
const tokenizer = new natural.WordTokenizer();
- To perform the tokenization process, let's execute the tokenize method:
const tokenizedData = tokenizer.tokenize(data); console.log(tokenizedData)
- The result is an array of tokens:
[ 'John', 'Doe', 'is', 'sometimes', 'used', 'to', 'refer', 'to', 'a', 'typical', 'male', 'in', 'other', 'contexts', 'as', 'well', 'in', 'a', 'similar', 'manner', 'to', 'John', 'Q' ]
As you can see above, punctuation marks have been removed from the text, and the text has been divided into words, based on space characters. Capital letters have been preserved.
The text prepared in this way can be successfully used in further steps.
A popular technique used to normalize data is stemming. The purpose of this process is to eliminate certain differences between words, and thus significantly reduce the set of words occurring in the text.
Elimination of these differences, involves avoiding plural forms, inflectional endings, suffixes, verb forms, etc. After such a process, the result is a set of so-called stems, which are not necessarily correct words in the context of the language. I will not say this process can be used when further analysis is important for language forms or varieties, due to the fact that only the basic meaning of the word is preserved during the process.
- So we will carry out an example of the stemming process on a previously tokenized sentence:
const stems = tokenizedData.map(natural.PorterStemmer.stem) console.log(stems)
- The result of this process will be an array:
[ 'john', 'doe', 'is', 'sometim', 'us', 'to', 'refer', 'to', 'a', 'typic', 'male', 'in', 'other', 'context', 'as', 'well', 'in', 'a', 'similar', 'manner', 'to', 'john', 'Q' ]
Levenstein distance is a measure used to assess the number of operations required to transform a word into another word. Among such operations we can distinguish insertion, deletion and replacement of a character. With this measure, we can clearly measure the difference between the selected words. Such a measure is useful for assessing the correctness of a word’s spelling and automatic error correction.
- Let's look at how to calculate the values of such a measure. Let's assume that we want to calculate it for two words: typical and types:
const words = ["typical", "types"] console.log(natural.LevenshteinDistance.apply(null, words))
The number of operations required to change such a word is 4 (swap i -> e, c -> s and remove the letters a and l).
- Having a corpus with correct words, we can make spelling corrections and check correctness:
const corpus = ['something']; const spellcheck = new natural.Spellcheck(corpus);
- Let's examine the correctness of such a word:
const word = 'smthing'; console.log(spellcheck.isCorrect(word));
The result, of course, will be false.
- We can also make spelling corrections:
The result will be [ ‘something’ ], which is the correct word.
With a large enough corpus, we can do this in a more natural way, taking into account more real cases.
Among the tasks in natural language processing, we can distinguish. the study of the semantics of the text, that is, its meaning. The Natural library allows us to study the overtones (positive or negative) for a selected text. We can perform this process using the SentimentAnalyzer class.
- Let's first declare our analysis class:
const Analyzer = natural.SentimentAnalyzer; const stemmer = natural.PorterStemmer; const analyzer = new Analyzer("English", stemmer, "afinn");
The parameters of the constructor of the Analyzer class, here are the language, the stemmer object and the dictionary (from a predefined list in the library).
- To perform the analysis, let's run the following command:
console.log(analyzer.getSentiment(["I", "love", "dogs"]));
The result is 0.66. This means that with a probability of 2/3, the sentence “I love dogs” is a sentence that has positive overtones.
According to the rules of probability calculus, we can assume that below a value of 0.5, we would interpret the sentence as having negative overtones.
The task of binary classification is to divide the selected set into two classes, labeling each document accordingly (giving it a label). So let’s assume that we have a dataset in which part of the documents (in our case, sentences) is information about cats, and part about dogs. Our task is to create a classifier that examines what type of animal is mentioned in the sentence.
- Let's use the second one in our example:
Const classifier = new natural.BayesClassifier();
- Let's add a collection of data. Let's also give each sentence an appropriate label:
classifier.addDocument('i love dogs', 'dog'); classifier.addDocument('dogs are the best friends', 'dog'); classifier.addDocument('i have a cat', 'cat'); classifier.addDocument('cats are amazing', 'cat');
- Next, we need to train our classifier:
- Let's test the action by introducing a sentence:
console.log(classifier.classify('my dog is fast'));
The result of the classification process will be a “dog” label.
Want to create a project based on machine learning and artificial intelligence algorithms? Looking for an experienced team of specialists?
Check out what we have to offer!