• Home
  • About me
  • Curriculum
  • Projects
Facebook Linkedin Twitter

MauroCerbai

Software Engineer

DECISION TREEdtee.png
Is a flowchart structure in which each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. The paths from root to leaf represent classification rules.





Parameters:
  • min-samples-split : controls if there is enough samples to split further.
  • criterion : algorithms for choosing a variable at each step that best splits the set. Different algorithms use different metrics for measuring "best".
common ones : gini, information gain (explained after)


Information gain:
It’s a value indicating how much is “useful” the feature if splitted. The algorithm tries to maximizes information gain.
INFORMATION GAIN = ENTROPY (parent) - [weighted average] ENTROPY (children)


The Entropy is a measure of impurity in data.
i -Pi log2 (Pi)
0 < entropy > 1
Pi is the fraction of examples in class i
i is the sum of all classes
If all data is from the same class → entropy = 0
If all data is evenly split between classes → entropy = 1


EXAMPLE:
                 FEATURE                                                                 LABEL
GRADE
SPEED LIMIT
SPEED
steep
yes
slow
steep
yes
slow
flat
no
fast
steep
no
fast



In this case the information gain provided by choosing the feature “SPEED LIMIT” is higher than “GRADE”.


Sklearn:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(feature_training, label_training)
prediction = clf.predict(feature_test)

accuracy = clf.score(feature_test, label_test)

Share
Tweet
Pin
Share
No commenti
SUPPORT VECTOR MACHINE220px-Svm_separating_hyperplanes_(SVG).svg.png
Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

A good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.

In addition to performing linear classification, SVMs can efficiently perform a nonlinear classification using what is called the “kernel trick”, implicitly mapping their inputs into high-dimensional feature spaces.

Parameters:
  • C : controls tradeoff between smooth decision boundary and classifying points correctly.
small C → larger hyperplane (more points not correct)
large C → smaller hyperplane (less smooth boundary)
  • Gamma É£ : define how far the influence of a single training example reaches
low É£ → far (wiggling decision boundary)
high É£ → close (straight smooth decision boundary)
  • Kernel : it is a function that takes a not linear separable input and adding more dimensions to it it give a linear separable space. Then the solution of it is still valid on the original space in input
common ones : linear, rbf, poly, sigmoid, precomputed

Overfitting:
Pay attention to overfitting your data because you can score an high accuracy on training points but it is not a good generalization, so in new data it might score a bad accuracy.ovrftting.png

Sklearn:
from sklearn.svm import SVC
clf = SVC(kernel=”linear”)
clf.fit(feature_training, label_training)
prediction = clf.predict(feature_test)

accuracy = clf.score(feature_test, label_test)

Share
Tweet
Pin
Share
No commenti

NAIVE BAYES


Are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong independent assumptions between the feature.

Theorem:
P(A|B) = P(B|A) P(A) / P(B)

In a hospital the probability of a liver disease is 10%, the probability of the patient being alcoholic is 5% and among those with a liver disease there are a 7% that are alcoholic. What is the probability of a liver disease if the patient is alcoholic?
P(L) = 0,1
P(A) = 0,05
P(A|L) = 0,07
P(L|A) = P(A|L)*P(L) / P(A) = 0.07*0.01 / 0.05 = 0.14 => 14%

It’s a popular method for text categorization with word frequencies as the features but not their order, it assume that the value of a particular feature is independent of the value of any other feature. Basically it count the occurrences of a word in a particular text sample and assign a probability to that, when you need to attribute a particular “email” to someone then it compare the probability of every word of being written by a certain person.

SENDER : CHRIS - Love 0.1 - Deal 0.8 - Life 0.1
SENDER : SARA - Love 0.5 - Deal 0.2 - Life 0.3

P(CHRIS) = 0.5 = P(SARA)
TEXT : Love deal
P(CHRIS) = 0.1*0.8*0.5 = 0.04 -> 0.04/0.09 = 44%
P(SARA) = 0.5*0.2*0.5 = 0.05 -> 0.05/0.09 = 55%

Sklearn:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(feature_training, label_training)
prediction = clf.predict(feature_test)
accuracy = clf.score(feature_test, label_test)


Share
Tweet
Pin
Share
No commenti
Newer Posts
Older Posts

About me


Smiley face
Computer Science Degree, technology enthusiast, programmer, interested in startup & innovation, curious, precise & organized.

Follow Me

  • Facebook
  • Linkedin
  • Twitter
  • Bitbucket
  • Github

recent posts

Categories

  • dev
  • development
  • software
  • learn
  • learning
  • machine
  • machine learning
  • study
  • android
  • google
  • job
  • scikit
  • sklearn
  • app
  • udacity
  • gdg
  • google play
  • html
  • code
  • electronics
  • linux
  • script
  • uda
  • webgl
  • database
  • gdgmilano
  • help
  • open source
  • programming
  • smartphone
  • torino
  • weekend
  • work
  • workshop
  • 3d
  • firebase
  • gps
  • greatmind
  • hardware
  • location
  • personal computer
  • start up
  • .bashrc
  • GB
  • PS3
  • Vallée des Merveilles
  • action
  • analytics
  • audio
  • avi
  • bayes
  • books
  • bug
  • cpu
  • dinolib
  • docker
  • fake
  • ffmpeg
  • force
  • francaise
  • france
  • francia
  • free
  • gear 360
  • gglass
  • git
  • gitconfig
  • glass
  • hdd
  • hike
  • hiring
  • jenkins
  • joke
  • kde
  • kmix
  • magnetism
  • material
  • materialdesign
  • merge-it
  • messaging
  • microservices
  • mint
  • naive bayes
  • navigation drawer
  • nemo
  • nikola
  • nikolatesla
  • pc
  • ram
  • reading
  • refuge
  • samsung
  • space
  • spain
  • ssd
  • steam
  • tesla
  • unturned
  • valle delle meraviglie
  • veromix
  • versioning
  • windows
  • wizard
  • wolley
  • wolleybuy
  • xvid

Blog Archive

  • ottobre (1)
  • settembre (1)
  • gennaio (1)
  • novembre (1)
  • maggio (1)
  • aprile (1)
  • marzo (3)
  • febbraio (3)
  • gennaio (1)
  • novembre (7)
  • ottobre (4)
  • settembre (3)
  • agosto (1)
  • luglio (1)
  • settembre (1)
  • agosto (1)
  • giugno (2)
  • aprile (2)
  • marzo (1)
  • febbraio (3)
  • gennaio (2)
  • novembre (1)
  • agosto (2)
  • luglio (2)
  • giugno (3)
  • marzo (1)
  • novembre (1)
  • ottobre (1)
  • agosto (1)
  • giugno (1)
  • maggio (2)
  • marzo (2)
  • febbraio (1)
Facebook Linkedin Twitter Bitbucket Github

Created with by ThemeXpose | Distributed By Gooyaabi Templates