2026-Spring-DSC148-Introduction to Data Mining

Undergraduate Class, HDSI, UCSD, 2026

Class Time: Tuesdays and Thursdays, 12:30 PM to 1:50 PM. Room: PODEM 1A20 (1st week over Zoom). Piazza: piazza.com/ucsd/spring2026/dsc148

Online Lecturing

To offer waitlist students opportunities to learn more about this course, in the first week, we deliver the lectures over Zoom: https://ucsd.zoom.us/j/98143217527. These lectures will be recorded.

Overview

This course mainly focuses on introducing current methods and models that are useful in analyzing and mining real-world data. It will cover frequent pattern mining, regression & classification, clustering, and representation learning. No previous background in machine learning is required, but all participants should be comfortable with programming, and with basic optimization and linear algebra.

There is no textbook required, but here are some recommended readings:

Prerequisites

Math, Stats, and Coding: (CSE 12 or DSC 40B) and (CSE 15L or DSC 80) and (CSE 103 or ECE 109 or MATH 181A or ECON 120A or MATH 183)

TAs

  • Teaching Assistants: Benjamin TenWolde (betenwolde AT ucsd.edu)

Office Hours

Note: all times are in Pacific Time.

Grading

  • Homework: 8% each. Your lowest (of four) homework grades is dropped (or one homework can be skipped).
  • Midterm: 26%.
  • Data Mining Challenge: 25%.
  • Project: 25%.
  • You should complete all work individually, except for the Project.
  • Late submissions are NOT accepted.

Lecture Schedule

Recording Note: Please download the recording video for the full length. Dropbox website will only show you the first one hour.

HW Note: All HWs due by the end of the day on the due date, i.e., 11:59 PM PT.

WeekDateTopic & SlidesEvents
103/31 (Tue)Introduction: Data Types, Tasks, and EvaluationsHW1 out
104/02 (Thu)Supervised - Least-Squares Regression and Logistic Regression 
204/07 (Tue)Supervised - Overfitting and RegularizationHW2 out
204/09 (Thu)Supervised - Support Vector MachineHW1 Due
304/14 (Tue)Supervised - Naive Bayes and Decision Tree 
304/16 (Thu)Supervised - Ensemble Learning: Bagging and Boosting 
404/21 (Tue)Cluster Analysis - K-Means Clustering & its VariantsHW2 Due, HW3 out
404/23 (Thu)Cluster Analysis - “Soft” Clustering: Gaussian Mixture 
504/28 (Tue)Cluster Analysis - Density-based Clustering: DBSCAN 
504/30 (Thu)Cluster Analysis - Principle Component AnalysisDM Challenge out
605/05 (Tue)Pattern Analysis - Frequent Pattern and Association Rules 
605/07 (Thu)Midterm (no class, 24 hours on this date) 
705/12 (Tue)Recommender System - Collaborative FilteringHW3 Due, HW4 out
705/14 (Thu)Recommender System - Latent Factor Models 
805/19 (Tue)Text Mining - Zipf’s Law, Bags-of-words, and TF-IDF 
805/21 (Thu)Text Mining - Advanced Text RepresentationsDM Challenge due
905/26 (Tue)Network Mining - Small-Worlds & Random Graph Models, HITS, PageRank 
905/28 (Thu)Network Mining - Personalized PageRank and Node Embedding 
1006/02 (Tue)Sequence Mining - Sliding Windows and Autoregression 
1006/04 (Thu)Text Data as Sequence - Large Language ModelsHW4 Due

Homework (24%)

Your lowest (of four) homework grades is dropped (or one homework can be skipped).

  • HW1: Concepts and Evaluations (8%). This homework mainly focuses on the data mining concepts and how to evaluate different tasks.
  • HW2: Regression and Classification (8%). This homework mainly focuses on regression and classification tasks.
  • HW3: Cluster and Pattern Analysis (8%). This homework mainly focuses on clustering methods and frequent pattern mining methods.
  • HW4: Applications (8%). This homework mainly focuses on recommender system, text mining, and network mining.

Midterm (26%)

It is an open-book, take-home exam, which covers all lectures given before the Midterm. Most of the questions will be open-ended. Some of them might be slightly more difficult than homework. You will have 24 hours to complete the midterm, which is expected for about 3 to 4 hours.

  • Start: May 7, 12:30 PM PT
  • End: May 7, 12:30 PM PT
  • Midterm problems download: TBD
  • Please make your submissions on Gradescope.

Data Mining Challenge (25%)

It is a individual-based data mining competition with quantitative evaluation. The challenge runs from April 30 to May 21. Note that the time displayed on Kaggle is in UTC, not PT.

  • Challenge Statement, Dataset, and Details: Please see the challenge’s Rule tab.
  • Kaggle challenge link: TBD

Project (25%)

Project due on Sunday, Jun 7 End of the Day. Here is a quick overview:

  • Team-Based Open-Ended Project
    • 1 to 4 members per team. More members, higher expectation.
    • Define your own research problem and justify its importance
    • Come up with your hypothesis and find some datasets for verification
    • Design your own models or try a large variety of existing models
    • Write a 4 to 8 pages report (research-paper like)
    • Submit your codes
    • Up to 5% bonus for working demos/apps towards the total course grade.