Machine Learning 8 : Text Learning
TEXT LEARNING
Using text as a feature is impossible because it can be of various length.
So we can use different algorithms to deal with it.
Before it, we should process the text:
- There are very large number of words that they don’t provide any information to the text (such as a, and, the, etc), they are called stop words and they should be removed before analysis.
- There are many word that can have multiple “flavors” (like response, unresponsive, respond, etc) so we need to search for the common root of the word and use it as substitute. Those words are prepared by linguistic expert.
Algorithms:
Then we can use one of this two algorithm:
- Bag of Words (BoW) is an algorithm that counts how many times a word appears in a document. The text is represented as a bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity
- Term-frequency-inverse document frequency (TF-IDF) is another way to judge the topic of an article by the words it contains. It measures the number of times that words appear in a given document (term frequency). But because words, such as “and” or “the”, appear frequently in all documents, those are systematically discounted. That’s the inverse-document frequency part. The more documents a word appears in, the less valuable that word is. That’s intended to leave only the frequent AND distinctive words as markers.
Sklearn:
import nltk
nltk.download() [only the first time]
[stopwords for english language]
from nltk.corpus import stopwords
sw = stopwords.words(“english”)
[extract common root of every word in the text]
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(“english”)
stemmer.stem(“responsiveness”)
[count words occurences BOW]
from nltk.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
bag_of_words = vectorizer.fit(email_list)
bag_of_words = vectorizer.transform(email_list)
print vectorizer.vocabulary_.get(“great”)
[use tfidf]
from sklearn.feature_extraction.text import TfidfVectorizer()
feature_train_trasformed = vectorizer.fit_transform(feature_train)
feature_test_trasformed = vectorizer.transform(feature_test)
0 commenti