MauroCerbai: novembre 2017

EVALUATION METRICS

The most simple and immediate metric is accuracy

accuracy = labeled correctly / all data

but it depends very much on the number of data in input so with different data is not comparable.

To resolve this we use the confusion matrix

Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa).

Analyzing this data we can extract this two data:

recall = how many times you get correctly? (similar to accuracy)

true positive / ( true positive + false negative )

precision = once predicted x, what is the probability that is really x?

true positive / ( true positive + false positive )

No commenti

CROSS VALIDATION

One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set).

The conventional validation works partitioning the data set into two sets of 70% for training and 30% for test for example.

Sklearn:

from sklearn import cross_validation

feature_train, feature_test, label_train, label_test = cross_validation.train_test_split (iris_data, iris_target, test_size=0.4, random_state=0)

[train]

pca.fit (feature_train)

pca.transform(feature_train)

svc.train(feature_train)

[test]

NO FIT (you want to use the same function as in the training)

pca.transform(feature_test)

svc.train(feature_test)

K-Fold:

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data.

So, we can explain also like this:

repat k times

pick 1 block of data as test
train against the othe k-1 block
test on testing set

average final result

Sklearn:

from sklearn.cross_validation import KFold

kf = KFold(len(authors), 2)

for train_indices, test_indices in kf:

feature_train = [word_data[ii] for ii in train_indices]

feature_test = [word_data[ii] for ii in test_indices]

authors_train = [authors[ii] for ii in train_indices]

authors_train = [authors[ii] for ii in test_indices]

GridSearchCV:

Parameter tuning is the process of selecting the values for a model's parameters that maximize the accuracy of the model.

Scikit-learn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score.

By default, the GridSearchCV's cross validation uses 3-fold KFold or StratifiedKFold depending on the situation.

Sklearn:

parameters = { ‘kernel’: (‘linear’, ‘rbf’), C [1, 10])

svr = svm.SVC

clf = grid_search.GridSearchCV(svr, parameters)

clf.fit(iris_data, iris_target)

print clf.best_params_

No commenti

PRINCIPAL COMPONENT ANALYSIS

PCA find a new coordinates system that is detained from the old one by translation and rotation only centering the data. The goal is to try making a composite feature that more directly probes the underlying phenomenon ( square footage + number of rooms → size ).

How to determine the pca:

The pca of a dataset is the direction that has the largest variance (variance = spread of data distribution) because it retains the maximum amount of original information. That is true because projecting the original data on the longer axis of the new coordinate system we can have a more spread data value and lose the minimum amount of information possible.

When to use:

access latent feature
dimensionality reduction

visualize high dimensional data
reduce noise
use as preprocessing (reducing input for later algo [ eigenfaces] )

Sklearn:

from sklearn.decomposition import PCA

pca = PCA(n_components=2)

pcs.fit(data)

print pca.explained_variance_ratio_

first_pc = pca.components_[0]

second_pc = pca.components_[1]

x_train_pca = pcs.transform(X_test)

No commenti

Every time I open Android Studio this come to mind is this

IMHO

Observer Pattern? MVC? MVP? MVVM? I don't know

Do Androids (Programmer) Dream of Code Order?

Take a look at this post I found useful. Bye

No commenti

FEATURE SELECTION

Feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for four reasons:

simplification of models to make them easier to interpret by researchers/users
shorter training times
to avoid the curse of dimensionality
enhanced generalization by reducing overfitting

Add a new feature (feature extraction):

Feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations.

Use human intuition
Code the new feature
Visualize
Repeat

Getting rid of a feature (feature selection):

There is an optimal number of feature that balances the bias and variance. So the process to find this point is called regularization.

There are two big univariate feature selection tools in sklearn: SelectPercentile and SelectKBest. The difference is pretty apparent by the names: SelectPercentile selects the X% of features that are most powerful (where X is a parameter) and SelectKBest selects the K features that are most powerful (where K is a parameter).

Lasso Regression:

One of this methods is Lasso regression that it introduces a penalty parameter for the number of feature, like this:

minimum SSE + |B|

So the formula implies that we find the perfect balance between the minimum sum of squared errors and the number of feature.

What also does is find the best feature because every “y” feature has a “m” coefficient so if you order your because by the m value you get a list of the most important feature.

Sklearn:

from sklearn.linea_model import Lasso

regression = Lasso()

regression.fit(features, labels)

regression.predict([2,4])

print regression.coef_

No commenti

TEXT LEARNING

Using text as a feature is impossible because it can be of various length.

So we can use different algorithms to deal with it.

Before it, we should process the text:

There are very large number of words that they don’t provide any information to the text (such as a, and, the, etc), they are called stop words and they should be removed before analysis.

There are many word that can have multiple “flavors” (like response, unresponsive, respond, etc) so we need to search for the common root of the word and use it as substitute. Those words are prepared by linguistic expert.

Algorithms:

Then we can use one of this two algorithm:

Bag of Words (BoW) is an algorithm that counts how many times a word appears in a document. The text is represented as a bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity
Term-frequency-inverse document frequency (TF-IDF) is another way to judge the topic of an article by the words it contains. It measures the number of times that words appear in a given document (term frequency). But because words, such as “and” or “the”, appear frequently in all documents, those are systematically discounted. That’s the inverse-document frequency part. The more documents a word appears in, the less valuable that word is. That’s intended to leave only the frequent AND distinctive words as markers.

Sklearn:

import nltk

nltk.download() [only the first time]

[stopwords for english language]

from nltk.corpus import stopwords

sw = stopwords.words(“english”)

[extract common root of every word in the text]

from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(“english”)

stemmer.stem(“responsiveness”)

[count words occurences BOW]

from nltk.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

bag_of_words = vectorizer.fit(email_list)

bag_of_words = vectorizer.transform(email_list)

print vectorizer.vocabulary_.get(“great”)

[use tfidf]

from sklearn.feature_extraction.text import TfidfVectorizer()

feature_train_trasformed = vectorizer.fit_transform(feature_train)

feature_test_trasformed = vectorizer.transform(feature_test)

No commenti

FEATURE SCALING

Feature scaling is a method used to standardize the range of independent variables or features of data (data normalization). Why? Because the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

Rescaling:

Rescaling is the method to scaling the range of features to a scale that range in [0, 1].

X’ = X - Xmin/ Xmax - Xmin

Xmin minimum value in the data

Xmax maximum value in the data

Sklearn:

from sklearn.preprocessing import MinMaxScaler

weights = numpy.array ([115.0], [140.0], [175.0])

scaler = MinMaxScaler()

rescaled_weight = scaler .fit_transform(weights)

No commenti

Newer Posts

Older Posts

MauroCerbai

About me

Follow Me

recent posts

Categories

Blog Archive