ASTR/PHYS 7730 Lab Exercise

Lab Exercises (lab11)

Recommended reading from Statistics, Data Mining, and Machine Learning in Astronomy ("SDMML") and Numerical Recipes ("NR"):

SDMML: Chapters 6, 7

Exercise 1.

CLassification and machine learning, continued from last time: Using the The digits files in the director /u/inscc/bromley/courses/ap7730/data/, train-images-idx3-ubyte [images] and train-labels.idx1-ubyte [labels], represent 60,000 handwritten digits (0-9). Following code in ~bromley/courses/ap7730/examples/digits.py, run PCA on the data, displaying the first 40x40 images.

Modify the code to produce these same images, from reconstructed the top 10 elements.

Plot the coefficients of the first two principle components in a scatter plot. Color your points according to their true label (e.g., all the 0's are red, 1's are green, etc.

Make a cut according to the values of the first two PCA coefficients that isolates the zeros. Plot samples of your data--images that are classified as a zero according to your cut.

Exercise 2.

Modify your code to work on a subset of the data, those images labelled "0" and "1".

Run a k-means classifier to derive an image label. Plot the first 2 PC's and color than according to their cluster label. How do the k-mean labels compare to the written digits?

Try "training" the classifier on the first half of the images to see how well cluster labels are predicted on the remaining half. How many mistakes are made, if the goal is to classify the true (handwriting) labels?

Repeat, using labels 4 and 9.

Exercise 3.

In the full data set of handwritten letters, classify the images as "0" or "not zero". In this instance try both k-means and support vector machine classification. Vary the number of PCA eigens that you must use to get a robust classification.