2020-Spring-DSC190-Introduction to Data Mining

Undergraduate Class, HDSI, UCSD, 2020

Class Time: Tuesdays and Thursdays, 9:30AM to 10:50AM. Room: https://ucsd.zoom.us/j/632116245. Piazza: piazza.com/ucsd/spring2020/dsc190a00

Online Lecturing

Due to the COVID-19, this course will be delivered over Zoom: https://ucsd.zoom.us/j/632116245

Overview

This course mainly focuses on introducing current methods and models that are useful in analyzing and mining real-world data. It will cover frequent pattern mining, regression & classification, clustering, and representation learning. No previous background in machine learning is required, but all participants should be comfortable with programming, and with basic optimization and linear algebra.

There is no textbook required, but here are some recommended readings:

Prerequisites

Math, Stats, and Coding: (CSE 12 or DSC 40B) and (CSE 15L or DSC 80) and (CSE 103 or ECE 109 or MATH 181A or ECON 120A or MATH 183)

TAs and Tutors

  • Teaching Assistants: Ria Aggarwal (r2aggarw AT eng.ucsd.edu) and Dheeraj Mekala (dmekala AT eng.ucsd.edu)
  • Tutors: Zhenyu Bi (z1bi AT ucsd.edu) and Yang Li (yang AT ucsd.edu)

Office Hours

Note: all times are in Pacific Time.

Grading

  • Homework: 8% each. Your lowest (of four) homework grades is dropped (or one homework can be skipped).
  • Midterm: 26%.
  • Data Mining Challenge: 25%.
  • Project: 25%.
  • You should complete all work individually, except for the Project.
  • Late submissions are NOT accepted.

Lecture Schedule

Recording Note: Please download the recording video for the full length. Dropbox website will only show you the first one hour.

HW Note: All HWs due before the lecture time 9:30 AM PT in the morning.

WeekDateTopic & SlidesEvents
103/31 (Tue)Introduction: Data Types, Tasks, and Evaluations [slides] [recording]HW1 out
104/02 (Thu)Supervised - Least-Squares Regression and Logistic Regression [slides] [recording] 
204/07 (Tue)Supervised - Overfitting and Regularization [slides] [annotated slides] [recording]HW1 Due, HW2 out
204/09 (Thu)Supervised - Support Vector Machine [slides] [annotated slides] [recording] 
304/14 (Tue)Supervised - Naive Bayes and Decision Tree [slides] [annotated slides] [recording] 
304/16 (Thu)Supervised - Ensemble Learning: Bagging and Boosting [slides] [annotated slides] [recording] 
404/21 (Tue)Cluster Analysis - K-Means Clustering & its Variants [slides] [annotated slides] [recording]HW2 Due, HW3 out
404/23 (Thu)Cluster Analysis - “Soft” Clustering: Gaussian Mixture [slides] [annotated_slides] [recording] 
504/28 (Tue)Cluster Analysis - Density-based Clustering: DBSCAN [slides] [annotated_slides] [recording] 
504/30 (Thu)Cluster Analysis - Principle Component Analysis [slides] [annotated slides] [recording]DM Challenge out
605/05 (Tue)Pattern Analysis - Frequent Pattern and Association Rules [slides] [annotated slides] [recording] 
605/07 (Thu)Midterm (24 hours on this date) 
705/12 (Tue)Recommender System - Collaborative Filtering [slides] [annotated slides] [recording]HW3 Due, HW4 out
705/14 (Thu)Recommender System - Latent Factor Models [slides] [annoated slides] [recording] 
805/19 (Tue)Text Mining - Zipf’s Law, Bags-of-words, and TF-IDF [slides] [annotated slides] [recording] 
805/21 (Thu)Text Mining - Advanced Text Representations [slides] [annotated slides] [recording] 
905/26 (Tue)Network Mining - Small-Worlds & Random Graph Models [slides] [annotated slides] [recording] 
905/28 (Thu)Network Mining - HITS, PageRank, Personalized PageRank and Node Embedding [slides] [annotated slides] [recording] 
1006/02 (Tue)Sequence Mining - Sliding Windows and Autoregression [slides] [annotated slides] [recording] 
1006/04 (Thu)Text Data as Sequence - Named Entitry Recognition [slides] [annotated slide] [recording]HW4 Due extended to 06/07 9:30 AM PT

Homework (24%)

Your lowest (of four) homework grades is dropped (or one homework can be skipped). Gradescope code: M665VN.

Midterm (26%)

It is an open-book, take-home exam, which covers all lectures given before the Midterm. Most of the questions will be open-ended. Some of them might be slightly more difficult than homework. You will have 24 hours to complete the midterm, which is expected for about 2 hours.

Data Mining Challenge (25%)

It is a individual-based data mining competition with quantitative evaluation. The challenge runs from April 30 0:00:01 AM to May 17 4:59:59 PM PT. Note that the time displayed on Kaggle is in UTC, not PT.

Project (25%)

Instructions for both choices are available here. Project due extended to Jun 9.

Here is a quick overview:

  • Choice 1: Team-Based Open-Ended Project
    • 1 to 4 members per team. More members, higher expectation.
    • Define your own research problem and justify its importance
    • Come up with your hypothesis and find some datasets for verification
    • Design your own models or try a large variety of existing models
    • Write a 4 to 8 pages report (research-paper like)
    • Submit your codes
    • Up to 5% bonus for working demos/apps towards the total course grade.
  • Choice 2: Individual-Based Deep Dive into Data Mining Methods
    • Implement a few models learned from this course from scratch.
    • Skeleton codes will be provided. Your work is more like “filling in blanks”.
    • Each model has a point associated with it. 6 points required.
    • Write a report (pages based on points) describing your interesting findings.
    • Up to 5% bonus towards the total course grade. Roughly 1 point, 1%.