Instructors: Daniel Rockmore
Course on canvas.dartmouth.edu.⇗
Syllabus
Introduction & Overview:
- Wednesday: Linear algebra review & Simple clustering: Vectors, inner products, projections, bases, coordinates, measurement basis vs. analysis basis, change of basis
- Friday: linkage analysis, k-means clustering, DBSCAN Links to an external site.; Sources: Complex Systems – A Data-Driven Approach (Chapter 4) Download Complex Systems – A Data-Driven Approach (Chapter 4); Networks from feature vectors
Readings: The Human Costs of AI Links to an external site., by Sue Halpern for Friday, 1/5.
Week 2:
1/8,10,12
Feature vectors and Networks
- Monday: Clustering and beginnings of PCA
- Wednesday: PCA; Networks from feature vectors
- Friday: Networks from feature vectors; Eigenvectors as min/max
Readings: To page 136 in "The Circle" (Dave Eggers)
Week 3:
1/17,18,19
Dimension reduction
- Monday: MLK Day
- Wednesday: "Quiz" on background linear algebra; Eigenvalues as optimization (Rayleigh-Ritz) Beginnings of spectral clustering
- Thursday: Discussion of "The Circle"
- Friday: Spectral clustering (Ratio Cut into two clusters); HW1 Due.
Readings: To page 280 in "The Circle"; A Coder Considers the Waning Days of the Craft Links to an external site., by James Somers; Why the Godfather of AI Fears What He Built Links to an external site. by Joshua Rothman
Week 4:
1/22,24,25, 26
Spectral Clustering, Partition Decoupling, Project Proposals
- Monday: Finish spectral clustering - k clusters, how to choose k, first look at null models
- Wednesday: Partition Decoupling Method
- Thursday: Project proposals
- Friday: PDM (wrap-up); The Rand Index
Cluster Comparison, Geometry of Data Clouds
- Monday: Rand Index, Adjusted Rand Index, Mutual Information, Adjusted Mutual Information
- Wednesday: More ARI and AMI, Modularity Download Modularity (message of comparing to the "random configuration" - while preserving some structure)
- Friday: ARI and AMI for labeled data and cluster validation; MI as a measure of similarity; dist(X,Y) = H(X,Y)-MI(X,Y) as a distance between RVs; MDS for distance data visualization; Belkin-Niyogi Laplacian Eigenmaps Download Belkin-Niyogi Laplacian Eigenmaps (and relation to the Random Walk Laplacian): Note - on page 20 of Von Luxburg's paper Download paper the argument is made that the B-N algorithm (i.e., the steps that we went through in class and are summarized on pages 2 and 3 of the B-N paper) is often the way to go for spectral clustering. This uses the first several smallest eigenvectors for the random walk Laplacian (D^{-1}L).
Diffusion geometry, local curvature, Initial exploration of your project data
- Monday: Diffusion geometry as geometry uncovered by a random walk/Markov chain/diffusion process. We worked from this nice paper.
Download this nice paper. They have a summary of the algorithm for implementing it on page 5. A couple of important annotations:
- (1) You might want to choose a neighborhood threshold for the random walk (i.e., k-nn for some good choice of k or epsilon for all points in an epsilon ball)
- (2) The Markov matrix P in the paper has the transition probabilities P(i,j) that records the probability of transition from i to j so that a row of P are the transitions out of i and the rows of P^m for any m >0 are the transition probabilities after m steps;
- (3) After you compute the Markov matrix in the paper - for every time step t you get a new metric dist that depends on m. That is the formula D_t under Eqn 8 in the paper. On the one hand it is the diffusion distance on the underlying manifold (and you can compute this!!!) which is the same as a simple Euclidean distance if we choose the right coordinates. Those new coordinates depend on an eigenvalue calculation (next step):
- (4) The matrix P has n real eigenvalues lambda_1 >=...>= lambda_n. Since we are working with rows in the Markov matrix we now want to find row eigenvectors: v_i^T P = lambda_i v_i^T. Once you find those, then the nonlinear embedding of the data is (for any dimension d and time step t) the map that takes data point X_i -> ((lambda_1)^t v_1(i),...,(lambda_d)^t v_d(i)). For visualizing, take d=2 and plot picture for a bunch of t. If you are handy with visualizations, see if you can make a movie by letting the points move in space as t grows.
- Curvature: We mentioned curvature at the end. The Curvenet code and related paper can be found on this Github page. Links to an external site.
- Wednesday: Some recapitulation of Monday and some details:
- The embedding of data points in terms of their Markov transition matrix (for any time t) is a nonlinear embedding of the data into a Euclidean space of higher dimension where the Euclidean metric produces diffusion distance.
- This turns out to be the same as imposing a non-Euclidean metric on the same space where the embedding takes points to independent and canonical basis vectors: x_i -> e_i (in R^N where N = |X|) where <v,w> = v^T D^{-1} w for D to the (diagonal) degree matrix.
- Friday: Diffusion curvature; Intrinsic Dimension computation; Regression and a first look at Logistic Regression:
- Diffusion curvature - using the diffusion distance (for some fixed number of steps T), this allows you to define for any r > 0 the (diffusion) ball of radius r centered at x. Then given r, define the diffusion curvature at x to be the the average probability that after T steps, you stay in the ball B(x,r) - this is the same as the average over all y in the ball of the probabilities of transition from x to y in T steps. Note that this gives a value between 0 and 1 (no points of negative or zero curvature).
- Intrinsic Dimension - simple method due to Fan et. al, accomplished by (1) finding a minimal "open cover" (fewest number epsilon balls needed to cover the whole data set) and then (2) for each set in the resulting partition of the data, compute PCA. This gives you a set of local PCs and associated variances (eigenvalues). (3) With this, order the eigenvalues of each set in the partition, define the Total total variance (Ttv) as the sum of all these eigenvalues and then define the kth total eigenvalue as the sum of all the kth eigenvalues. Add up total eigenvalues from biggest to smallest until you accumulate some predetermined fraction of the Ttv - that index that you stop at is the Intrinsic Dimension. Here Links to an external site. is the scikit learn documentation.
- Linear Regression - Given data X (a set of feature vectors whose entries are the values of independent variables) and a vector of responses (dependent variables), solve for the "best" linear combination of some predetermined functions of the independent variables - i.e., the parameters Theta that give the best linear approximation to the response vector. The answer is a solution to the corresponding normal equations: A^TA F(x) = A^Ty where F represents the embedding of the data into the new vector space of dimension equal to the number of terms in the linear model.
- Logistic Regression - Again, data set X, but now the response function (dependent variable) is categorical. In the simplest case of two categories, we are finding a set of weights w and bias b such that for each x in X, sigma(w^Tx + b) for the logistic function sigma(z) = 1/(1+exp(-z)), does the best job of separating the points to either side of 0 depending on their class (0 or 1). Chapter 5 of this online textbook,Speech and Language Processing (3rd ed.) Links to an external site., by Dan Jurafsky and James H. Martin is the best explanation of LR that I have ever read.