MauroCerbai: ottobre 2017

I just received this email.

Congratulations!
Dear Mauro,
We are excited to offer you a Google Developer Challenge Scholarship to the Android Developer track.We received applications from many talented and motivated candidates, and yours truly stood out.

I'm very happy to announce that I've been selected for this scholarship involving the famous Google product and the amazing learning platform Udacity. Thank you!

No commenti

CLUSTERING

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

It’s a unsupervised learning because the answer is not provided, there is not a label associated with the data indicating the correct answer

K-Means algorithm:

It can be achieved by various algorithms. The most basic and most used is k-means clustering.

Given an initial set of k cluster center randomly chosen, the algorithm proceeds by alternating between two steps:

Assignment step: Assign each data points to the "nearest" cluster center (the one with the least squared Euclidean distance)

Update step: For each cluster center calculate the new position in order to minimize the total distance between itself and the data points (minimize total quadratic distances)

The algorithm has converged when the assignments no longer change. There is no guarantee that the optimum is found, it does however find a local optimum, and is commonly run multiple times with different random initializations as choosing the best of multiple runs.

Sklearn:

from sklearn.cluster import KMeans

clf = KMeans(n_cluster=8, [ number of cluster]

n_init=10, [repeat the algorithm with different cluster center on initial step]

max_iter=300) [maximum number of iteration]

clf.fit(data)

kmeans.predict(test)

kmeans.cluster_centers_

No commenti

OUTLIERS

An outlier is an observation point that is distant from other observations.

Detention:

You simply follow this flow:

Train the algorithm
Remove ~10% of data with the largest residual error
Train again and evaluate the accuracy test

Repeat

The residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest

No commenti

LINEAR REGRESSION

In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Basically a regression output is not discrete but a function like:

y = ax + b

y is target

a is slope

b is intercept

Error metrics:

The error is calculated

error = actual data - predicted data

The best linear regression is the one that

minimizes all points( actual - predicted)2

algo: ordinary least squares (ols) - gradient descent

But it’s not perfect because it’s an high value if you have multiple data point and lower with fewer, so it’s not comparable very well.

Instead R2

R squared measures the fraction of the variance of the dependent variable expressed by the regression. In simple linear regression it is simply the square of the correlation coefficient. It’s independent from the number of data.

(not good) 0 < r2 > 1 (good)

Sklearn:

from sklearn.linear_model import LinearRegression

clf = LinearRegression()

clf.fit(feature_training, label_training)

prediction = clf.predict(feature_test)

accuracy = clf.score(feature_test, label_test) [ -> R2 error metric ]

slope = clf.coef_

intercept = clf.intercept_

No commenti

Newer Posts

Older Posts

MauroCerbai

About me

Follow Me

recent posts

Categories

Blog Archive