DECISION TREE
Is a flowchart structure in which each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. The paths from root to leaf represent classification rules.
Parameters:
- min-samples-split : controls if there is enough samples to split further.
- criterion : algorithms for choosing a variable at each step that best splits the set. Different algorithms use different metrics for measuring "best".
common ones : gini, information gain (explained after)
Information gain:
It’s a value indicating how much is “useful” the feature if splitted. The algorithm tries to maximizes information gain.
INFORMATION GAIN = ENTROPY (parent) - [weighted average] ENTROPY (children)
The Entropy is a measure of impurity in data.
i -Pi log2 (Pi)
|
0 < entropy > 1
Pi is the fraction of examples in class i
i is the sum of all classes
|
If all data is from the same class → entropy = 0
If all data is evenly split between classes → entropy = 1
EXAMPLE:
FEATURE LABEL
GRADE
|
SPEED LIMIT
|
SPEED
|
steep
|
yes
|
slow
|
steep
|
yes
|
slow
|
flat
|
no
|
fast
|
steep
|
no
|
fast
|
In this case the information gain provided by choosing the feature “SPEED LIMIT” is higher than “GRADE”.
Sklearn:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(feature_training, label_training)
prediction = clf.predict(feature_test)
accuracy = clf.score(feature_test, label_test)