July 1 2015 / Day 8

We continued our discussion about optimization. One way this can be achieved is by parameter tuning, beginning with an objective function then going to an optimization algorithm. A loss function measures unhappiness if you use the weighted vector to make a prediction on x but the correct output was y. Examples of loss functions for binary classification include zero one loss, perceptron loss, hinge loss, and logistic loss. Examples of loss functions for regression include squared loss and absolute deviation loss.

A loss minimization framework helps to minimize the training loss (also known as the training error or empirical risk), which is the average loss over all the training examples. However, there are two losses that are typically used to minimize the training loss. The first is the mean, which tries to accommodate every example in the training data set. The second is the median, which is more robust to outliers.

A general approach is to optimization is to use iterative optimization, which essentially starts at some starting point w and tries to tweak w so that the objective function value decreases. To do this, we rely on the gradient of the function, which tells us which direction to move in to decrease the objective the most. This is called gradient descent. Gradient descent has two parameters, the step size η (which specifies how aggressively we want to pursue a direction) and the number of iterations T. However, each iteration requires going over all training examples, expensive when you have a lot of data points.

All that’s left to do before we can use gradient descent is to compute the gradient of the objective function TrainLoss. For squared loss, the gradient descent is the residual (prediction – target) times the feature vector φ(x).

We moved our discussion to clustering and segmentation. Image segmentation is the process of identifying groups of pixels that go together and separating images into coherent objects. Clustering is the process of grouping similar points together and representing them with a single token. It is important to cluster because it helps to summarize data, count, segment, and predict. There are two types of clustering, bottom-up and top-down. One aspect I found particularly interesting was the Gestalt theory for perceptual grouping, which is the theory that grouping is key to visual perception and that elements in a collection can have properties that result from relationships. This theory makes intuitive sense but it is actually very difficult to translate into algorithms.

Agglomerative clustering sets every point as its own cluster, finds the most similar (i.e. Euclidian distance) pair of clusters, and merges it into parent clusters. This creates a dendogram. Another clustering theory is the k-means theory, which is an iterative algorithm that places centroids at random initial locations, assigns the nearest points to the cluster, and for each cluster, creates a new centroid which is the  mean of all points x. The program stops when none of the cluster assignments change. K-means++ prevents arbitrarily bad local minima by choosing the new centroids according to a probability. To evaluate the clusters, we test their generative quality  and discriminative quality.

Leave a comment