Recommended reading from Statistics, Data Mining, and Machine Learning
in Astronomy ("SDMML") and Numerical Recipes ("NR"):
SDMML: Chapters 6, 7
CLassification and machine learning, continued from last time: Using the The digits files in the director /u/inscc/bromley/courses/ap7730/data/, train-images-idx3-ubyte [images] and train-labels.idx1-ubyte [labels], represent 60,000 handwritten digits (0-9). Following code in ~bromley/courses/ap7730/examples/digits.py, run PCA on the data, displaying the first 40x40 images.
Modify the code to produce these same images, from reconstructed the top 10 elements.
Plot the coefficients of the first two principle components in a scatter plot. Color your points according to their true label (e.g., all the 0's are red, 1's are green, etc.
Make a cut according to the values of the first two PCA coefficients that isolates the zeros. Plot samples of your data--images that are classified as a zero according to your cut.
Run a k-means classifier to derive an image label. Plot the first 2 PC's and color than according to their cluster label. How do the k-mean labels compare to the written digits?
Try "training" the classifier on the first half of the images to see how well cluster labels are predicted on the remaining half. How many mistakes are made, if the goal is to classify the true (handwriting) labels?
Repeat, using labels 4 and 9.