HandsonML 9. Unsupervised Learning
1. Basic
Why unsupervised:
Lots of datasets don't have label.
Exploit unlabeled data without human labeling
Types:
dimension reduction
clustering = identify similar instance & group them together
anomaly detection
density estimation
example of clustering: semi-supervised learning, customer segmentation, data analysis, anomaly detection, search engine.
2. K-means
Find K centoids from grouping
placing k centroids randomly
repeatedly label the instances (which centroid is the nearest)
update centroids (center position of the group)
until the centroids stop moving.
Good: Guarantee to converge
Good: Fast
Bad: Could converge to sub-optimal solution
Bad: K is predefined
2.1. Improve sub-optimal solutions
run the algorithm multiple times with different random initialization and keep the best solution.
2.2. K-means as data preprocessing
$$ need to copy image @ page 251
2.3. K-means in semi-supervised learning
(random pick) supervised learning ( train on 1st 50 samples) = 83.3% accuracy
k-means to identify 50 clusters (train on 50 centroid images) = 92.2% accuracy
label propagation = 94.0% accuracy
Full dataset (70k labeled samples) = 96.9%
Label Propagation
Add label to unlabeled data by k-means clustering centroids.
Good performance due to good propagated labeled accuracy ~ 99%
Active learning (co-work with human)
Train a model on the labeled instances gathered so far.
This model makes predictions on all unlabeled instances.
The most uncertain instances are given to the expert to be labeled. (i.e., when probability is lowest)
Iterate until the performance improvement stops.
3. DBSCAN
Defines clusters as continuous regions of high density
For each instance, counts # of instances within a small distance ε as neighbors.
An instance is a core when it has # of neighbors > n.
Cluster and merge cores & their neighbors.
Two hyper-params: ε and min_samples
Robust to outlier
~ linear complexity
$$ add page 257, figure 9-14
4. GMM (Gaussian mixture model)
A probabilistic model
GMM assumes that the instances were generated from a mixture of several Gaussian distributions whose parameters are unknown.
Last updated