Machine Learning 8 : Text Learning

by mcmaur - 14:00

TEXT LEARNING

Using text as a feature is impossible because it can be of various length.

So we can use different algorithms to deal with it.

Before it, we should process the text:

There are very large number of words that they don’t provide any information to the text (such as a, and, the, etc), they are called stop words and they should be removed before analysis.

There are many word that can have multiple “flavors” (like response, unresponsive, respond, etc) so we need to search for the common root of the word and use it as substitute. Those words are prepared by linguistic expert.

Algorithms:

Then we can use one of this two algorithm:

Bag of Words (BoW) is an algorithm that counts how many times a word appears in a document. The text is represented as a bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity
Term-frequency-inverse document frequency (TF-IDF) is another way to judge the topic of an article by the words it contains. It measures the number of times that words appear in a given document (term frequency). But because words, such as “and” or “the”, appear frequently in all documents, those are systematically discounted. That’s the inverse-document frequency part. The more documents a word appears in, the less valuable that word is. That’s intended to leave only the frequent AND distinctive words as markers.

Sklearn:

import nltk

nltk.download() [only the first time]

[stopwords for english language]

from nltk.corpus import stopwords

sw = stopwords.words(“english”)

[extract common root of every word in the text]

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(“english”)

stemmer.stem(“responsiveness”)

[count words occurences BOW]

from nltk.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

bag_of_words = vectorizer.fit(email_list)

bag_of_words = vectorizer.transform(email_list)

print vectorizer.vocabulary_.get(“great”)

[use tfidf]

from sklearn.feature_extraction.text import TfidfVectorizer()

feature_train_trasformed = vectorizer.fit_transform(feature_train)

feature_test_trasformed = vectorizer.transform(feature_test)

Tags : dev, development, job, learn, learning, machine, machine learning, scikit, sklearn, software, study, uda

MauroCerbai

Machine Learning 8 : Text Learning

0 commenti

About me

Follow Me

recent posts

Categories

Blog Archive

MauroCerbai

Machine Learning 8 : Text Learning

You May Also Like

0 commenti

About me

Follow Me

recent posts

Categories

Blog Archive