CSCC11-Classification

Instructions

Download the starter file on the course webpage. The file requires the same login and password as the video lectures. Once downloaded and unpacked, a directory called Exam will exist. Please do not change the directory structure or the headers of functions in python files. All modified python files and others you generate (e.g., plots) should be left in the Exam directory (which you’ll submit). Remember to read all instructions and comments in the python files very carefully.

Written work

For written answers you can either write on paper or use an editor that supports equations (eg Word or Latex). If writing on paper, please scan or photograph and save as a png file or perhaps pdf (and make sure it is easily legible). If formatting your answers in Word or latex etc, please save as pdf. All such png or pdf files should be left in the Exam directory for submission.

Environment

You’ll use the same environment you’ve been using on requests. You will need Python3,
Numpy, Matplotlib, and Pickle; you may not use any other python packages. Please pre-install if working on your own computer. If you’d rather work on a UTSC machine, remote login to one of these lab machines:

IC lab (ranging from 01 to 50): i406-[01-50]
BV lab (ranging from 01 to 40): b473-[01-40]

Submission

Submit your exam on the Mathlab server (like requests). Do not modify the Exam directory tree or method/function headers therein.

Q1. Classification with Gaussian Class Conditionals
In Chapter 8.4 of the online course lecture notes, we introduced the Gaussian class-conditional (GCC) model for classification. The generative model for a data point x, in the two class case. Where c is the class random variable; it takes on the value 0 for one class, and 1 for the other. The decision function for the CC model is the log posterior ratio.

To learn the model, given N data points, we estimate the class priors. For GCC models the likelihoods are Gaussian, for which we estimate the means and covariances.

For the problem below you are asked implement the GCC model. You will then apply your algorithm to three datasets, each having a training set and a test set. You will train this model, visualize the Gaussians, and identify the properties of the GCC model. Starter code is in directory gcc.

Implementing GCC: You should implement the GCC model by completing the methods train and predict in gcc.py :

train fits the GCC model. The means and covariances for the likelihoods should be stored in self.means and self.covariances. The learned priors are stored in self.priors.
predict takes inputs and computes the probability of each input being assigned to each class.
You can test your code by running debug_gcc.py. It provides simple test cases and feedback to help debug your code. Once you are confident with your implementation, run train_gcc.py on the three datasets mentioned above by modifying the dataset variable in the main block. This code trains a GCC model on the specified dataset, computes the test accuracy, and displays the Gaussian likelihoods with the training and test data.

Provide a brief answer (no more than 3 or 4 sentences) for each of the following questions in Questions.txt:

Does GCC perform well on generic_1? Why or why not?
Does GCC perform well on generic_2? Why or why not?
Does GCC perform well on generic_3? Why or why not?
If input feature vectors have length D, how many parameters do we need to learn for each Gaussian?
Suppose you decide to use Naive Bayes with GCC. In words, what would you have to change in your code? How many parameters would you need to learn for each Gaussian in this case?
Suppose certain elements of the input feature vectors are highly correlated. Why is this problematic for Naive Bayes?
How could you transform the data to mitigate this problem with Naive Bayes?
Q2. Multi-Line Regression with Expectation-Maximization
In request 4 we fit a Gaussian mixture model (GMM) to data using the EM Algorithm. The mixture model provides a way to assign points to clusters in a probabilistic sense. That is, we compute the posterior probability that point i should be owned by (i.e., assigned to) the jth component of the mixture. Where K is the number of components in the mixture. Given the ownership probabilities, we then computed weighted means and covariances, the parameters of the Gaussian components. This works well for density estimation and clustering, but is actually much more general. To that end, here we’re going to consider the use of EM for parametric model fitting; i.e., a mixture of regressors.

Suppose you have a bunch of points in the plane, like those shown in the plot below. Normally you might not know how many lines are in any given dataset, but here it seems clear enough that you should be able to fit two lines. But to do this you need to figure out a way to determine which points belong to which line. And this is where the EM algorithm comes in, since we can use the posterior probability that a given point comes from a given line. This will allow us to simultaneously regress to multiple lines (not just two).

To find the model parameters we minimize the negative log likelihood with the added constraint that mixing probabilities sum to 1. For this we add a Lagrange multiplier.

We’ll treat the noise variance 2 as a hyper-parameter, and set it manually in the code.

The EM algorithm alternates between inferring the posterior over the request of points to lines in the E-step, and the estimation of the model parameters in the M-step. For the E-step, as with the GMM in A4, we’ll write the posterior ownership probability.

Written Part
The questions below ask you to derive the equations for the E-step and the M-step. Save your answers as written.pdf in the directory em_line_fitting. E.g., you could take pictures of written work on paper, or format electronically, e.g., with LateX.

Programming Part
Implement multi-line fitting using the EM algorithm described above. The starter code is in directory em_line_fitting. The associated methods are:

_fit: fits K linear models to training data. Each iteration contains the E-step and the M-step. The iteration terminates when there is negligible change in the ownership probabilities.
_e_step: computes the ownership probabilities for each data point.
_m_step: computes the mixing probabilities and the parameters of each of the linear models.
_visualize: visualizes training data with fitted lines and ownership probabilities for each point.
You will need to complete the methods _e_step and _m_step. The equations for these steps are given above. The starter code includes a way to specify initial conditions for the optimization; this includes the initial line parameters and the mixing probabilities.

You can run train_em_line_fitting.py to debug your code with pre-defined test cases (test_case = [i], where [i] = 0, 1, 2, 3). This script determines whether your learned model parameters, mixing probabilities, and ownership probabilities are correct. If you set visualize = True, this method will display at each iteration the current estimates of the fitted lines, along with a plot that shows, for each model, the ownership probability for each data point. Once you’re confident in the code, we’ll analyze the results.

The EM algorithm will find one of several local optima. Some are good, and some are not great. The chance of getting “stuck” in poor minima often depends on the initial guess and the value of the model standard deviation. To explore this, we’ve pre-specified several test cases, with different numbers of lines and values of . Run train_em_line_fitting.py for each test case, setting test_case to be 0 through 4. If you set test_case = False, you can play with the algorithm by manually changing the seed, stddev, num_lines.

When your code works correctly, the first four test cases should produce the following plots. In these plots, each component (line) and its associated ownership probabilities (the bar graph) have a unique colour.

Now, reflect on these estimates and how they change with each iteration as the model fit evolves from the initial guess to the final estimate. And provide brief answers (no more than 5 sentences) to the following questions in Questions.txt :

In test case 0, the two lines fit well. The ownership probabilities are either 0 or 1 near the right and left sides of the plot, but near 0.5 closer to the middle. Explain why this occurs and why it’s useful.
In test case 1, the lines do not fit the data well. Explain what might be the source of the problem in this case.
Test cases 2 and 3 fit two and three lines, but in both cases not all lines appear to fit the data well. Select one of these two test cases and explain what might be the source of the problem, making reference to the model variance and the ownership probabilities.