I'm very happy to announce that I've been selected for this scholarship involving the famous Google product and the amazing learning platform Udacity. Thank you! |
CLUSTERING
Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).
It’s a unsupervised learning because the answer is not provided, there is not a label associated with the data indicating the correct answer
K-Means algorithm:
It can be achieved by various algorithms. The most basic and most used is k-means clustering.
Given an initial set of k cluster center randomly chosen, the algorithm proceeds by alternating between two steps:
- Assignment step: Assign each data points to the "nearest" cluster center (the one with the least squared Euclidean distance)
- Update step: For each cluster center calculate the new position in order to minimize the total distance between itself and the data points (minimize total quadratic distances)
The algorithm has converged when the assignments no longer change. There is no guarantee that the optimum is found, it does however find a local optimum, and is commonly run multiple times with different random initializations as choosing the best of multiple runs.
Sklearn:
from sklearn.cluster import KMeans
clf = KMeans(n_cluster=8, [ number of cluster]
n_init=10, [repeat the algorithm with different cluster center on initial step]
max_iter=300) [maximum number of iteration]
clf.fit(data)
kmeans.predict(test)
kmeans.cluster_centers_
OUTLIERS
An outlier is an observation point that is distant from other observations.
Detention:
You simply follow this flow:
- Train the algorithm
- Remove ~10% of data with the largest residual error
- Train again and evaluate the accuracy test
- Repeat
LINEAR REGRESSION
In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Basically a regression output is not discrete but a function like:
y = ax + b
y is target
a is slope
b is intercept
Error metrics:
The error is calculated
error = actual data - predicted data
The best linear regression is the one that
minimizes all points( actual - predicted)2
algo: ordinary least squares (ols) - gradient descent
But it’s not perfect because it’s an high value if you have multiple data point and lower with fewer, so it’s not comparable very well.
Instead R2
R squared measures the fraction of the variance of the dependent variable expressed by the regression. In simple linear regression it is simply the square of the correlation coefficient. It’s independent from the number of data.
(not good) 0 < r2 > 1 (good)
Sklearn:
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
clf.fit(feature_training, label_training)
prediction = clf.predict(feature_test)
accuracy = clf.score(feature_test, label_test) [ -> R2 error metric ]
slope = clf.coef_
intercept = clf.intercept_