Assignment 4 (a04) Learning/Classification/Clustering


In the following two exercises consider the handwritten digits data set (as in previous lab assignments), in directory /u/inscc/bromley/courses/ap7730/data:
data/train-images-idx3-ubyte data/train-labels.idx1-ubyte data/t10k-images-idx3-ubyte data/t10k-labels-idx1-ubyte
(source: http://yann.lecun.com/exdb/mnist/). The top two files are a "training set", binary data with 60,000 28x28 pixel images of handwritten digits 0-9, and the corresponding labels. Code in the class examples directory (digit*.py) provides examples of unpacking these files. The bottom two are 10k test images and labels.

The code you submit for credit in these exercises should use an absolute path to open these files (i.e., have your code read directly from these files, not a local copy).

Exercise 1.

Write a code crazy8s.py that identifies all occurrances of "8" in the test images of handwritten digits. Use the training images and their corresponding labels to construct the classification scheme. Have your code print out the number of 8's you identified in the test set, the fraction of missed 8's and the number of false positives. (Use the test labels in this way *only* to check whether your classification algorithm!)

Exercise 2.

Write a data classification code, digitsclass.py for all 10 digits using the training data for handwritten digits described above. Apply the classification to the 10k test images. Indicate the fraction of time that the true images where correctly identified.

Exercise 3.

The Sloan Digital Sky Survey (SDSS) is mapping hundreds of millions of stars and galaxies--Prof. Gail Zasowski is a leader in this effort, currently serving and the spokesperson for this large-scale, international project.

The file with absolute path

/u/inscc/bromley/courses/ap7730/data/sdssgax_stripe82_z0.1.csv
contains the "redshift" and angular sky positions of over 20,000 galaxies in SDSS's "Stripe 82", a well studied region of the sky. Because the universe is exanding, redshift (the relative wavelewngth shift of atomic spectra seen in extragalactic sources) is proportional to distance from us; multiplication of redshift (unitless) by c (speed of light) gets a radial velocity, and division by the "expansion rate of the universe"--the Hubble constant H (about 70 km/s/Mpc where Mpc = 3.08e22 meters))--gets a physical distance. The sky position angles "ra" (right ascension) and "dec" (declination) are equivalent to longitude and latitude. Stripe 82 observes galaxies seen in a 3 degree swath containing the Earth's equatorial plane, spanning 120 degrees.

The plots below show maps of galaxies in Stripe 82 within a redshift of 0.1. They were constructed with example code stripe82plot.py in the examples directory. The left plot is the whole sample, while the right plot is a zoom-in. For reference, we are at the origin.

In
Notice the finger-like patterns in the zoom-in. They exist because the observed galaxy redshifts are affected not only by cosmological expansion but Doppler shift from "peculiar motion". In dense galaxy clusters, peculiar motion can be significant (a few 100 km/s), and can distort the redshift "distance" map.

For this exercise, write a code that finds the finger-like structures associated with galaxy clusters, identifies the mean redshift of each one, as well as the velocity dispersion (which is a measure of the cluster mass). Submit a script, stripe82.py, that does this at leasts for galaxies in the zoom-in region, listing the mean location in ra, dec, redshift and velocity dispersion of each cluster with 5 or more members. Submit also a plot (like the ones above) that shows the clusters you identified. You might also want to plot the non-cluster glaxies using a faint,light color behind the cluster galaxy points.

I recommend a friends-of-friends algorithm (Huchra & Geller 1982), using the fof2d.py + cluster2d example codes. The example code is 2-D, this problem is 3-D; I recommend extending the 2d to 3d using a simple linking parameter (radius), but then extending this to the case here, where the linking parameter is different in the radial and perpendicular directions. As you can see in the figure, a decent linking parater in the radial dir is about Delta_rs of 0.002, while 4e-5 is better in the perpendicular direction (plane of sky), since this translates to O(100) kiloparsecs, typical of spacing between galaxies in groups.

For the record: I used SQL Skyserver with this query to download the csv file for this exercise.

bcb 03/2019