Object Recognition

Object recognition is very similar to object classification, maybe even detection. It is quite sensible to ask: what is the difference between object classification and object recognition? Even I would say that I have trouble. One way that I like to think about it is that classification is more about differentiating instances that perhaps are clearly different (fork vs spoon vs knife, or cat vs dog vs face), whereas maybe recognition is more about differentiating objects within a class (dinner fork vs salad fork vs seafood fork, or different celebrities faces). We more often hear “face recognition” versus “face classification.” Likewise we more often hear animal classification versus animal recognition (as in cat vs dog vs human vs elephant).

So, here we study methods that deal with differentiating objects within a given class of objects (or so I believe …). It may involve differentiation across different types of objects too.

/*

(1) object detector with boosting: 
http://people.csail.mit.edu/torralba/shortCourseRLOC/boosting/boosting.html

*/

Module #1: Bag-of-Words

Before people got into the auto-learning of feature spaces, it was common to try to hand craft a feature space, or come up with a mechanism for creating feature spaces that generalized. One such mechanism is called Bag-of-Words. It comes from the text classification field. Imagine trying to classify a book as either horror, romance, comedy, or tragedy. You would imagine that the language in each would be slightly different, that the words that inspired related emotions would vary. By comparing the frequency of special words in the books, one could identify the type of stories the books contained. Computer vision researcher made an effort to translate that same concept to imagery. But how do images have words? Well, let's see …

Week #1: Clustering to Define Words

Study k-means clustering algorithm and the algorithmic steps for k-means clustering.
Download (or clone) the clustering skeleton code here
Implement k-means clustering algorithm working in RGB space by following the algorithmic steps. You are welcome to implement from scratch without skeleton code.
Test your algorithm on segmenting the image segmentation.jpg using k=3
Try different random initialization and show corresponding results.
Comment on your different segmentation results.

Matlab Notes: Matlab has several functions that can assist with the calculations so that you do not have to process the data in a for loops. For example, there is pdist2, the mean function can process data row-wise or column-wise if specified properly, and there are ways to perform sub-assignments using either a binary array or the list of indices themselves. You should be taking advantage of these opportunities to have compact code.

Week #2: Object Recognition

Study the bag-of-words approach for classification/Recognition task
We begin with implementing a simple but powerful recognition system to classify faces and cars.
Check here for skeleton code. First, follow the README to setup the dataset and vlfeat library.
In our implementation, you will find vlfeat library very useful. One may use vl_sift, vl_kmeans and vl_kdtreebuild.
Now, use first 40 images in both categories for training.
Extract SIFT features from each image
Derive k codewords with k-means clustering in module 1.
Compute histogram of codewords using kd-tree algorithm using vlfeat.
Use the rest of 50 images in both categories to test your implementation.
Report the accuracy and computation time with different k

Week #3: Spatial Pyramid Matching (SPM)
Usually objects have different properties across the spatial scales, even though they may appear common at one given scale. Consequently, differentiation of objects is often improved by concatenating feature vectors or agglomerating feature descriptors that exist at different scales. This set of activities will explore how well that can work.

Study Spatial Pyramid Matching which can improve BoW apporach by concatenating histogram vectors.
We will implement a simplified version of SPM based on your molude 2
First, for each traning image, divide it equally into a 2 × 2 spatial bin.
Second, for each of the 4 bins, extract the SIFT features and compute the histograms of codewords as in module 2
Third, concatenate the 4 histogram vectors in a fixed order. (hint: the a vector has 4k dimension.)
Forth, concatenate the vector you have in module 2 with this vector (both weighted by 0.5 before concatenated).
Finally, use this 5k representation and re-run the training and testing again.
Compare the results from module 3 and module 2. Explain what you observe.

Week #4: Base Feature Exploration
At this point, you should have a good object recognition system in hand, capable of differentiating input objects through their words. To arrive at the system we employed the SIFT descriptor, which is but one such means to generate an identifier vector (one could even say a hash vector) from an image region. There are many more feature descriptors that have been invented since SIFT was. Let's see if we can explore other ones.

In modules 4, try to replace sift with other feature descriptor. For example, you may want to see how SURF or HOG works.
Select one feature descriptor, rerun the training and testing, and show your comparison results and observations.

Module #2: Alternative Classifiers

The vanilla Bag-of-Words algorithm utilizes the k nearest neighbors (kNN) algorithm to output the final decision. As the size of the training set increases, so does the kNN algorithm. Certainly, we humans don't have that problem. We don't take longer to classify things even though our corpus of known objects increase from age 1 to age 10 and beyond. kNN doesn't try to distill any inherent structure in the data to simplify the decision process.

Here we will explore support vector machines (SVM) as a means to perform the final decision.

Week #1: understanding SVM

Go to libSVM and scroll down to Graphic Interface.
See square in the middle? Select a color by clicking on the “change” button and draw some dots within the square. Then, change another color and draw some dots again. Finally, click on the “run” button, you will find SVM found a boundary to separate these two groups!
Here we would like to use SVM classifier to help us to train a model to predict if a coming image is a car or face by drawing a hyper-plane in feature space, so we no longer have to compare each new image to all the training image(k-nearest neighbor approach in module #1)
Go to “Download LIBSVM” section and download the libraries(select appropriate one based on your OS)
Extract the compressed file, open your Matlab, and browse into the folder, say C:\libsvm
Now type command » mex -setup
After you select your compiler for MEX-files, get into /matlab folder » cd('C:\libsvm\matlab')
Compile it » make
Okay, now your should have SVM libraries in your computer! you can add path by » addpath('C:\libsvm\matlab')
Now, we'd like to run some toy example from here
» labels = double(rand(10,1)>0.5);
» data = rand(10,5);
» model = svmtrain(labels, data, '-s 0 -t 2 -c 1 -g 0.1')
Now you should be able to understand the basic usages. Please carefully read the README file in the folder and run the example. Output the result of accuracy_l and accuracy_p to demonstrate you've run the example yourself. It is very important your have correctly installed libSVM. We will use libSVM for the next task!
(optional) anyone who is interested in SVM, I highly recommend Prof. Winston's online video course.

Week #2: apply SVM to car and face dataset

If you did the toy example in libSVM and understood commands correctly, now you are ready to apply this powerful library on our previous dataset - car and face!.
In the toy example, data is actually a list of feature and it's true label. Now, we would like to use our bag-of-words features here.
Generate bag-of-words feature for car and face as you did in your previous tasks (collect sift features, kmeans, and compute the histogram for each image. We use histogram as the feature for each image).
Mimic the toy example, for each of the image, we set an label for car (and different label for face). Also for each of the image we have the histogram as its feature like you see in toy example.
You have to generate two files, one for training and one for testing.
Use the commands you learned last week, report your SVM accuracy.

Week #3: Kernel Trick

So far we've applied SVM on our dataset to distinguish car and face. We are going to learn more about SVM to see how it works.
If you played with the nice SVM GUI here by generating some points in different colors, you can observe that the separating hyperplanes are not always a straight line. Why is it?
Actually, one of the good properties of SVM is that you can apply different “kernels” to project your data to higher dimension in order to keep them apart and separate them.
Please read carefully about Kernel trick. Figure.1 is an example that you can use a straight line to separate two groups. Figure 3. is an example that a straight line is not able to separate them. Can you plot another example that you need a kernel to help you?
Now, figure out how you can change different kernel in libSVM. Write down your answer.
Here we provide an easier dataset "mnist_4_9_3000". It contains hand-written digit “4” and “9” for you to do binary classification as you did on car and face. Use first 2000 for training and 1000 for testing.
You can use the following command to visualize your data
» load mnist_49_3000;
» [d,n] = size(x);
» i = 1; % index of image to be visualized
» imagesc(reshape(x(:,i),[sqrt(d),sqrt(d)])’) % notice the transpose
Train SVM as you did before on this 4_9_3000 dataset with linear kernel. Compare the results with different kernels.
Train SVM on your car and face dataset with linear kernel. Compare the results with different kernels. Which dataset has more improvements with different kernels. Why?

Week #4: Cross Validation

Besides kernel, another important thing to apply SVM correctly is to select good hyper-parameters.
If you checked SVM GUI here, you might notice that different C might lead to different performance.
Please quickly review the SVM material again, and explain what is C parameter here?
One way to select good hyper-parameter is to apply “cross validation”.
Please read carefully about Cross Validation. The idea is mainly to leave some data untouched, and use it to test your selected parameter. And repeat this step using different untouched portion in your training data.
Now, apply cross validation by using 10% of your data for cross validation. Try it on mnist_49_3000. Test C = 10^-2, 10^-1, 1, 10^1, 10^2, which one gives you best performance?
What is the difference between K-fold and LOO(leave one out)?

ECE4580 Learning Modules

Patricio Vela: Course Wiki

Table of Contents

Object Recognition

Module #1: Bag-of-Words

Module #2: Alternative Classifiers