machine learning clustering & retrieval

In this case study, finding similar documents, you will examine similarity-based algorithms for retrieval. In Machine Learning Clustering & Retrieval training course, you will also examine structured representations for describing the documents in the corpus, including clustering and mixed membership models, such as latent Dirichlet allocation (LDA). You will implement expectation maximization (EM) to learn the document clusterings, and see how to scale the methods using MapReduce.

By attending Machine Learning Clustering & Retrieval workshop, Participants will:

  • Create a document retrieval system using k-nearest neighbors.
  • Identify various similarity metrics for text data.
  • Reduce computations in k-nearest neighbor search by using KD-trees.
  • Produce approximate nearest neighbors using locality sensitive hashing.
  • Compare and contrast supervised and unsupervised learning tasks.
  • Cluster documents by topic using k-means.
  • Describe how to parallelize k-means using MapReduce.
  • Examine probabilistic clustering approaches using mixtures models.
  • Fit a mixture of Gaussian model using expectation maximization (EM).
  • Perform mixed membership modeling using latent Dirichlet allocation (LDA).
  • Describe the steps of a Gibbs sampler and how to use its output to draw inferences.
  • Compare and contrast initialization techniques for non-convex optimization objectives.
  • Implement these techniques in Python.

COURSE AGENDA

  • Retrieval as k-nearest neighbor search
  • 1-NN algorithm
  • k-NN algorithm
  • Document representation
  • Distance metrics: Euclidean and scaled Euclidean
  • Writing (scaled) Euclidean distance using (weighted) inner products
  • Distance metrics: Cosine similarity
  • To normalize or not and other distance considerations
  • Complexity of brute force search
  • KD-tree representation
  • NN search with KD-trees
  • Complexity of NN search with KD-trees
  • Visualizing scaling behavior of KD-trees
  • Approximate k-NN search using KD-trees
  • Limitations of KD-trees
  • LSH as an alternative to KD-trees
  • Using random lines to partition points
  • Defining more bins
  • Searching neighboring bins
  • LSH in higher dimensions
  • (OPTIONAL) Improving efficiency through multiple tables

 

  • The goal of clustering
  • An unsupervised task
  • Hope for unsupervised learning, and some challenge cases
  • The k-means algorithm
  • k-means as coordinate descent
  • Smart initialization via k-means++
  • Assessing the quality and choosing the number of clusters
  • Motivating MapReduce
  • The general MapReduce abstraction
  • MapReduce execution overview and combiners
  • MapReduce for k-means
  • Other applications of clustering
  • Motiving probabilistic clustering models
  • Aggregating over unknown classes in an image dataset
  • Univariate Gaussian distributions
  • Bivariate and multivariate Gaussians
  • Mixture of Gaussians
  • Interpreting the mixture of Gaussian terms
  • Scaling mixtures of Gaussians for document clustering
  • Computing soft assignments from known cluster parameters
  • (OPTIONAL) Responsibilities as Bayes' rule
  • Estimating cluster parameters from known cluster assignments
  • Estimating cluster parameters from soft assignments
  • EM iterates in equations and pictures
  • Convergence, initialization, and overfitting of EM
  • Relationship to k-means
  • Mixed membership models for documents
  • An alternative document clustering model
  • Components of latent Dirichlet allocation model
  • Goal of LDA inference
  • The need for Bayesian inference
  • Gibbs sampling from 10,000 feet
  • A standard Gibbs sampler for LDA
  • What is collapsed Gibbs sampling?
  • A worked example for LDA: Initial setup
  • A worked example for LDA: Deriving the resampling distribution
  • Using the output of collapsed Gibbs sampling