• Home
  • About me
  • Curriculum
  • Projects
Facebook Linkedin Twitter

MauroCerbai

Software Engineer

EVALUATION METRICS
The most simple and immediate metric is accuracy
accuracy = labeled correctly / all data


but it depends very much on the number of data in input so with different data is not comparable.



To resolve this we use the confusion matrixconfusionMat.png


Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa).


Analyzing this data we can extract this two data:





  • recall = how many times you get correctly? (similar to accuracy)
true positive / ( true positive + false negative )


  • precision = once predicted x, what is the probability that is really x?

true positive / ( true positive + false positive )

Share
Tweet
Pin
Share
No commenti
CROSS VALIDATION
One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set).
The conventional validation works partitioning the data set into two sets of 70% for training and 30% for test for example.


Sklearn:
from sklearn import cross_validation
feature_train, feature_test, label_train, label_test = cross_validation.train_test_split (iris_data, iris_target, test_size=0.4, random_state=0)


[train]
pca.fit (feature_train)
pca.transform(feature_train)
svc.train(feature_train)


[test]
NO FIT (you want to use the same function as in the training)
pca.transform(feature_test)
svc.train(feature_test)


K-Fold:
K-fold_cross_validation_EN.jpgIn k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data.



So, we can explain also like this:
  • repat k times
    • pick 1 block of data as test
    • train against the othe k-1 block
    • test on testing set
  • average final result


Sklearn:
from sklearn.cross_validation import KFold
kf = KFold(len(authors), 2)
for train_indices, test_indices in kf:
feature_train = [word_data[ii] for ii in train_indices]
feature_test = [word_data[ii] for ii in test_indices]
authors_train = [authors[ii] for ii in train_indices]

authors_train = [authors[ii] for ii in test_indices]



GridSearchCV:
Parameter tuning is the process of selecting the values for a model's parameters that maximize the accuracy of the model.
Scikit-learn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score.
By default, the GridSearchCV's cross validation uses 3-fold KFold or StratifiedKFold depending on the situation.

Sklearn:
parameters = { ‘kernel’: (‘linear’, ‘rbf’), C [1, 10])
svr = svm.SVC
clf = grid_search.GridSearchCV(svr, parameters)
clf.fit(iris_data, iris_target)
print clf.best_params_



Share
Tweet
Pin
Share
No commenti
PRINCIPAL COMPONENT ANALYSISGaussianScatterPCA.jpg


PCA find a new coordinates system that is detained from the old one by translation and rotation only centering the data. The goal is to try making a composite feature that more directly probes the underlying phenomenon ( square footage + number of rooms → size ).


How to determine the pca:
The pca of a dataset is the direction that has the largest variance (variance = spread of data distribution) because it retains the maximum amount of original information. That is true because projecting the original data on the longer axis of the new coordinate system we can have a more spread data value and lose the minimum amount of information possible.projection.png


When to use:
  • access latent feature
  • dimensionality reduction
    • visualize high dimensional data
    • reduce noise
    • use as preprocessing (reducing input for later algo [ eigenfaces] )


Sklearn:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pcs.fit(data)
print pca.explained_variance_ratio_
first_pc = pca.components_[0]
second_pc = pca.components_[1]
x_train_pca = pcs.transform(X_test)

Share
Tweet
Pin
Share
No commenti
Every time I open Android Studio this come to mind is this

IMHO


Observer Pattern? MVC? MVP? MVVM? I don't know
Do Androids (Programmer) Dream of Code Order?

Take a look at this post I found useful. Bye
Share
Tweet
Pin
Share
No commenti
FEATURE SELECTION
Feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for four reasons:
  • simplification of models to make them easier to interpret by researchers/users
  • shorter training times
  • to avoid the curse of dimensionality
  • enhanced generalization by reducing overfitting


Add a new feature (feature extraction):
Feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant, facilitating the subsequent learning and generalization steps, and in some cases leading to better human interpretations.
  • Use human intuition
  • Code the new feature
  • Visualize
  • Repeat


Getting rid of a feature (feature selection):imgg.png
There is an optimal number of feature that balances the bias and variance. So the process to find this point is called regularization.
There are two big univariate feature selection tools in sklearn: SelectPercentile and SelectKBest. The difference is pretty apparent by the names: SelectPercentile selects the X% of features that are most powerful (where X is a parameter) and SelectKBest selects the K features that are most powerful (where K is a parameter).


Lasso Regression:
One of this methods is Lasso regression that it introduces a penalty parameter for the number of feature, like this:


minimum SSE + |B|
So the formula implies that we find the perfect balance between the minimum sum of squared errors and the number of feature.
What also does is find the best feature because every “y” feature has a “m” coefficient so if you order your because by the m value you get a list of the most important feature.


Sklearn:
from sklearn.linea_model import Lasso
regression = Lasso()
regression.fit(features, labels)
regression.predict([2,4])

print regression.coef_

Share
Tweet
Pin
Share
No commenti
TEXT LEARNING
Using text as a feature is impossible because it can be of various length.
So we can use different algorithms to deal with it.


Before it, we should process the text:
  • There are very large number of words that they don’t provide any information to the text (such as a, and, the, etc), they are called stop words and they should be removed before analysis.
  • There are many word that  can have multiple “flavors” (like response, unresponsive, respond, etc) so we need to search for the common root of the word and use it as substitute. Those words are prepared by linguistic expert.


Algorithms:
Then we can use one of this two algorithm:
  • Bag of Words (BoW) is an algorithm that counts how many times a word appears in a document. The text is represented as a bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity
  • Term-frequency-inverse document frequency (TF-IDF) is another way to judge the topic of an article by the words it contains. It measures the number of times that words appear in a given document (term frequency). But because words, such as “and” or “the”, appear frequently in all documents, those are systematically discounted. That’s the inverse-document frequency part. The more documents a word appears in, the less valuable that word is. That’s intended to leave only the frequent AND distinctive words as markers.


Sklearn:
import nltk
nltk.download() [only the first time]


[stopwords for english language]
from nltk.corpus import stopwords
sw = stopwords.words(“english”)


[extract common root of every word in the text]
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(“english”)
stemmer.stem(“responsiveness”)


[count words occurences BOW]
from nltk.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
bag_of_words = vectorizer.fit(email_list)
bag_of_words = vectorizer.transform(email_list)
print vectorizer.vocabulary_.get(“great”)


[use tfidf]
from sklearn.feature_extraction.text import TfidfVectorizer()
feature_train_trasformed = vectorizer.fit_transform(feature_train)

feature_test_trasformed = vectorizer.transform(feature_test)

Share
Tweet
Pin
Share
No commenti
FEATURE SCALING
Feature scaling is a method used to standardize the range of independent variables or features of data (data normalization). Why? Because the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.


Rescaling:
Rescaling is the method to scaling the range of features to a scale that range in [0, 1].
X’ = X - Xmin/ Xmax - Xmin
Xmin minimum value in the data
Xmax maximum value in the data
Sklearn:
from sklearn.preprocessing import MinMaxScaler
weights = numpy.array ([115.0], [140.0], [175.0])
scaler = MinMaxScaler()

rescaled_weight = scaler .fit_transform(weights)

Share
Tweet
Pin
Share
No commenti
Newer Posts
Older Posts

About me


Smiley face
Computer Science Degree, technology enthusiast, programmer, interested in startup & innovation, curious, precise & organized.

Follow Me

  • Facebook
  • Linkedin
  • Twitter
  • Bitbucket
  • Github

recent posts

Categories

  • dev
  • development
  • software
  • learn
  • learning
  • machine
  • machine learning
  • study
  • android
  • google
  • job
  • scikit
  • sklearn
  • app
  • udacity
  • gdg
  • google play
  • html
  • code
  • electronics
  • linux
  • script
  • uda
  • webgl
  • database
  • gdgmilano
  • help
  • open source
  • programming
  • smartphone
  • torino
  • weekend
  • work
  • workshop
  • 3d
  • firebase
  • gps
  • greatmind
  • hardware
  • location
  • personal computer
  • start up
  • .bashrc
  • GB
  • PS3
  • Vallée des Merveilles
  • action
  • analytics
  • audio
  • avi
  • bayes
  • books
  • bug
  • cpu
  • dinolib
  • docker
  • fake
  • ffmpeg
  • force
  • francaise
  • france
  • francia
  • free
  • gear 360
  • gglass
  • git
  • gitconfig
  • glass
  • hdd
  • hike
  • hiring
  • jenkins
  • joke
  • kde
  • kmix
  • magnetism
  • material
  • materialdesign
  • merge-it
  • messaging
  • microservices
  • mint
  • naive bayes
  • navigation drawer
  • nemo
  • nikola
  • nikolatesla
  • pc
  • ram
  • reading
  • refuge
  • samsung
  • space
  • spain
  • ssd
  • steam
  • tesla
  • unturned
  • valle delle meraviglie
  • veromix
  • versioning
  • windows
  • wizard
  • wolley
  • wolleybuy
  • xvid

Blog Archive

  • ottobre (1)
  • settembre (1)
  • gennaio (1)
  • novembre (1)
  • maggio (1)
  • aprile (1)
  • marzo (3)
  • febbraio (3)
  • gennaio (1)
  • novembre (7)
  • ottobre (4)
  • settembre (3)
  • agosto (1)
  • luglio (1)
  • settembre (1)
  • agosto (1)
  • giugno (2)
  • aprile (2)
  • marzo (1)
  • febbraio (3)
  • gennaio (2)
  • novembre (1)
  • agosto (2)
  • luglio (2)
  • giugno (3)
  • marzo (1)
  • novembre (1)
  • ottobre (1)
  • agosto (1)
  • giugno (1)
  • maggio (2)
  • marzo (2)
  • febbraio (1)
Facebook Linkedin Twitter Bitbucket Github

Created with by ThemeXpose | Distributed By Gooyaabi Templates