Machine Learning 3 : Decision Tree

by - 13:19

DECISION TREEdtee.png
Is a flowchart structure in which each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. The paths from root to leaf represent classification rules.





Parameters:
  • min-samples-split : controls if there is enough samples to split further.
  • criterion : algorithms for choosing a variable at each step that best splits the set. Different algorithms use different metrics for measuring "best".
common ones : gini, information gain (explained after)


Information gain:
It’s a value indicating how much is “useful” the feature if splitted. The algorithm tries to maximizes information gain.
INFORMATION GAIN = ENTROPY (parent) - [weighted average] ENTROPY (children)


The Entropy is a measure of impurity in data.
i -Pi log2 (Pi)
0 < entropy > 1
Pi is the fraction of examples in class i
i is the sum of all classes
If all data is from the same class → entropy = 0
If all data is evenly split between classes → entropy = 1


EXAMPLE:
                 FEATURE                                                                 LABEL
GRADE
SPEED LIMIT
SPEED
steep
yes
slow
steep
yes
slow
flat
no
fast
steep
no
fast



In this case the information gain provided by choosing the feature “SPEED LIMIT” is higher than “GRADE”.


Sklearn:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(feature_training, label_training)
prediction = clf.predict(feature_test)

accuracy = clf.score(feature_test, label_test)

You May Also Like

0 commenti