Machine Learning 11 : Cross Validation
CROSS VALIDATION
One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set).
The conventional validation works partitioning the data set into two sets of 70% for training and 30% for test for example.
Sklearn:
from sklearn import cross_validation
feature_train, feature_test, label_train, label_test = cross_validation.train_test_split (iris_data, iris_target, test_size=0.4, random_state=0)
[train]
pca.fit (feature_train)
pca.transform(feature_train)
svc.train(feature_train)
[test]
NO FIT (you want to use the same function as in the training)
pca.transform(feature_test)
svc.train(feature_test)
K-Fold:
So, we can explain also like this:
- repat k times
- pick 1 block of data as test
- train against the othe k-1 block
- test on testing set
- average final result
Sklearn:
from sklearn.cross_validation import KFold
kf = KFold(len(authors), 2)
for train_indices, test_indices in kf:
feature_train = [word_data[ii] for ii in train_indices]
feature_test = [word_data[ii] for ii in test_indices]
authors_train = [authors[ii] for ii in train_indices]
authors_train = [authors[ii] for ii in test_indices]
GridSearchCV:
Parameter tuning is the process of selecting the values for a model's parameters that maximize the accuracy of the model.
Scikit-learn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score.
By default, the GridSearchCV's cross validation uses 3-fold KFold or StratifiedKFold depending on the situation.
Sklearn:
parameters = { ‘kernel’: (‘linear’, ‘rbf’), C [1, 10])
svr = svm.SVC
clf = grid_search.GridSearchCV(svr, parameters)
clf.fit(iris_data, iris_target)
print clf.best_params_
0 commenti