Machine Learning in NLP (Natural Language Processing) in practice – part I

I’d like to present solution regarding application Machine Learning in NLP (Natural Language Processing) for opinion mining from practical point of viewOpinion mining (sometimes known as sentiment analysis or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. Generally speaking, sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event. The attitude may be a judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author or speaker), or the intended emotional communication (that is to say, the emotional effect intended by the author or interlocutor) [1]

Main goal our system will be recognizing  positive or negative opinion. For instance, we can detect entries that contain opinion or entries that promotes hatred and intolerance, inappropriate content. Like youtube which used ML in image analysis to detect Al Kaida’s extermic videos, in a similar way we can use ML to analyze the similar problem but for text analysys. Huge Internet portals can protect the community from opinions or entries that should not take place.

Our goals:
1. System predictions has to work in real time – recognise patterns in mili- or nano- seconds
2. System has to possibility to adaptive learning in case new  patterns without stop main process of classification
3. System will be evaluate in real time the content as positive or negative with displaying the probability estimated by trained model



Main system sending text to trained model (Neural Network based on Logistic regression architecture) which classify pattern (text, opinion, document) in real time. Trained model is not subject to the learning process in this part of the system. Learning process is realized by completely different another module, which is outside the classification process. In the case of new standards, the adaptive system does not need to conduct the whole neural network training for all models but only for new ones. This is a big advantage of adaptive learning. When process learning is finished then the latest model is saved to file and replacing the old one

Mathematical model 

Presented solution bases on Bag of words model. The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision. The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.[2] 

The bag-of-words model is a way of extracting features from the text for use in machine learning algorithms. Text is converted into vectors that can be used by the machine learning algorithm. The process of converting NLP text into numbers is called vectorization. To evaluate importantcy a word in a document or in a collection we will apply statistical measure TF-IDF. TF-IDF stands for term frequency-inverse document frequency. TF-IDF weight is a statistical measure used to evaluate how important a word is to a document in a collection. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the collection. More details regarding mathematical formulas You can find on–idf

Learning process

In learning process we will use data contains 50 000 records comes from iMDB repository. Our solution we can apply without any problems for massive data thanks to out-of-core learning technique even if we have not access to supercomputers. Learning process for this technique is based on reading small subsets from massive data set from Hard Disk and sending to memory to our ML Algorithm

I. Preprocessing text for every document (tokenize function):
1) clearing text from unnecessary characters (i.e. emoticons or another: spaces, white characters etc)
2) text splitting into table of words
3) word stemming using one of the algorithms: Porter, Snowball, Lancaster alternative (can be applied not for all languages):
4) stop-word removal – function to eliminate word which has not impact on recognize process
II. Vectorization – conversion text into vectors
III. Applying Logistic Regression Model (Neural Network) based on stochastic gradient descent (SGD) function in learning process. This model is perfect for massive data because based on trained weights and doesn’t need storage all patterns in memory like in kNN model where all patterns have to storage in memory to calculate distance between patterns and input data. The second important feature of our model is adaptive learning opportunities. In the case of new data for learning, we don’t have to use once again all training data used oon the begining only use incremental training data. Trained model is saving to disk and next system sends it to classification module – which load trained module to memory – thanks to this solution we avoid the re-learning process

The out-of-core learning technique is amazing for dealing with a limited amount of memory on ordinary computer in relation to the massive training set and for the time of learning, because learning time in this case (for iMDB set) on a regular computer took less than a one minute!!!!

to be continued..