Data-driven approaches

image classification: to assign a label to a image

challenges: illumination, background clutter, occlusion, raccoon, deformation, etc

process:

collect a dataset of images and labels
use ML algorithms to train a classifier (takes in images & labels, return model) training function: memorize all data & label
evaluate the classifier on new images (takes in model & test images, return outputs) predict function: predict the label of the test image

original idea: output the label of the “nearest” image from the test image.

L1 distance (Manhattan distance): $\displaystyle d_1(I_1, I_2) = \sum_p |I_1^p - I_2^p|$
- training time: $O(1)$ (no training)
- prediction time: $O(samples · dim)$.
- we want to accerlarate prediction.
L2 distance (Euclidean distance): $d_2(I_1, I_2) = \sqrt{\sum_p (I_1^p - I_2^p)^2}$
L1 is sensitive on feature values; L2 is not.
want to preserve the feature $\rightarrow$ L1 distance; feature is arbitrary $\to$ L2 distance

K-Nearest Neighbor: take majority vote from nearest K neighbors.

that will create “white regions”. We should collect more samples for these areas.

Hyperparameter: choices about the alg themselves.

is often dataset-dependent or problem-dependent
referred to hyperparameter tuning in ML alg
idea #1: choose the hpp that work best on the training data.
idea #2: choose the hpp that work best on the test data. (kind of cheating)
idea #3: choose the part of training data, and regard the rest as evaluate test. (may not be the good representative.)
idea #4: cross-validation. split data into folds. and iterarate them as evaluate test and calc the average score. (less practiced on large scale datasets)
use intuition to set hpp

Linear Classification

parametric approach:

$f(x, W)$ (takes in image and parameters, give labels)
linear classifier $f(x,W) = Wx + b$，where $W$ is $labels · pixels$，$x$ is $pixels · 1$，bias $b$ is $labels · 1$
linear classifier is the most basic component in neuron networks.
visual viewpoint: what template per class is like
geometric viewpoint: find the hyperplane that separate each class from others.
problems for linear classifier:
- cannot regard separated areas as one label
- cannot create curve boundaries.

process:

define a loss func that quantifies the errors of the labels on the training data
come up with a way of efficiently finding the param to minimize the loss func.

softmax classifier:

interpret raw classifier scores as probability distribution (all are positive and the sum is $1$)
if scores is $s = f(x_i;W)$, then $\displaystyle P(Y = k \mid X = x_i) = \frac{e^{s_k}}{\sum_je^{s_j}}$
the same framework as logistic regression.
it’s not the only loss function used in classification.

SVM / hinge loss function: $\displaystyle L_i = \sum_{j \ne y_i} \max(0, s_j - s_{y_i} + 1)$

objective function:

to maximize the probability of the correct label.
loss function 1: $L = -\log(P(Y = k \mid X = x_i))$ whose range is $[0, +\infty)$
loss function 2: KL divergence $D_{KL}(P||Q) = \sum_y P(y) \log\frac{P(y)}{Q(y)}$
loss function 3: cross entropy $H(P,Q) = H(p) + D_{KL}(P||Q)$