2026-Winter-CSE261-DSC253-Advanced Data-Driven Text Mining
Graduate Class, CSE, UCSD, 2026
Class Time: Tuesdays and Thursdays, 8:00 to 9:20 AM. Room: COA 125. Piazza: piazza.com/ucsd/winter2026/cse261dsc253
Online Lectures for the First Week
To offer waitlist students opportunities to learn more about this course, in the first week we deliver the lecture over Zoom: https://ucsd.zoom.us/j/98878563153. The lectures will be recorded.
Overview
This course mainly focuses on introducing current methods and models that are useful in analyzing and mining real-world text data. It will put emphasis on unsupervised, weakly supervised, and distantly supervised methods for text mining problems, including information retrieval, open-domain information extraction, text summarization (both extractive and generative), and knowledge graph construction. Bootstrapping, comparative analysis, learning from seed words and existing knowledge bases will be the key methodologies.
There is no textbook required, but there are recommended readings for each lecture (at the end of the slides).
- You MUST enroll in 4 units
- We need your time commitment for projects
- Feel free to audit the course with 0 units
Prerequisites
Knowledge of Machine Learning and Data Mining; comfortable coding using Python, C/C++, or Java; math and stats skills.
TA and Office Hours
- Jingbo Shang
- Office Hour: Wednesdays, 9 AM to 10 AM
- Zoom link: https://ucsd.zoom.us/my/jshang
- TAs:
- Letian Peng (lepeng AT ucsd.edu)
- Office Hour: TBD
- Location: TBD
- Letian Peng (lepeng AT ucsd.edu)
Note: all times are in Pacific Time.
Grading
- Homework: 30%. There will be two homework assignments. 15% each.
- Text Mining Challenge: 30%.
- Project: 40%.
- You should complete all work individually, except for the Project.
- Late submissions are NOT accepted.
Lecture Schedule
Recording Note: Please download the recording video for the full length. The Dropbox website will only show you the first hour.
HW Note: All HWs due by the end of the day, Pacific Time.
| Week | Date | Topic & Slides | Events |
|---|---|---|---|
| 1 | 01/06 (Tue) | Intro, Logistics, and Course Project | |
| 1 | 01/08 (Thu) | Basics: Zipf’s Law, bag-of-words, and TF-IDF | HW1 out |
| 2 | 01/13 (Tue) | Word Embedding: word2vec and GloVe | |
| 2 | 01/15 (Thu) | Language Models: from N-Gram to Neural LMs | |
| 3 | 01/20 (Tue) | Information Retrieval: from BM25 to Learning to Rank | Project Proposal Due (End of the Day) |
| 3 | 01/22 (Thu) | Sentiment Analysis and Document Classification | |
| 4 | 01/27 (Tue) | Topic Modeling: PLSA, LDA, and HMM | HW1 Due, DM challenge rollout |
| 4 | 01/29 (Thu) | Phrase Mining: from Unigrams to Multi-word Phrases | HW2 out |
| 5 | 02/03 (Tue) | Entity Set Expansion: from Seed Words to Sets | |
| 5 | 02/05 (Thu) | Entity Recognition: from Supervised to Data-Driven | |
| 6 | 02/10 (Tue) | Distant Supervision for Relation Extraction | |
| 6 | 02/12 (Thu) | Text-Rich Network: a collaboration between Texts and Networks | |
| 7 | 02/17 (Tue) | Topic Taxonomy Construction | |
| 7 | 02/19 (Thu) | Weakly Supervised Text Classification | |
| 8 | 02/24 (Tue) | Learning with Noisy Data | HW2 due |
| 8 | 02/26 (Thu) | Label Bias in Weak Supervision & Few-shot NER | DM challenge due |
| 9 | 03/03 (Tue) | Large Language Models | |
| 9 | 03/05 (Thu) | Project Presentations | |
| 10 | 03/10 (Tue) | Project Presentations | |
| 10 | 03/12 (Thu) | Project Presentations |
Homework (30%)
- HW1. Text Classification with Different Techniques.
- Due: Jan 27
- HW2. Phrase Mining Applications and Future Work.
- Due: Feb 24
Data Mining Challenge (30%)
It is an individual-based text mining competition with quantitative evaluation. The challenge runs during the quarter; exact start/end dates will be announced.
- Challenge statement, dataset, and details: TBD
- Kaggle challenge link: TBD
- Survey to map Kaggle account names to student names: TBD
Project (40%)
Overview
- Team-Based Open-Ended Project
- 1 to 4 members per team. More members, higher expectation.
- 3 to 4 members are recommended, given the limited presentation slots.
Final Deliverables
- Project Proposal (5%) instruction
- Define your own research problem and justify its importance
- Be ambitious! We could aim for ACL/EMNLP conference!
- Research Paper (20%)
- Report due: Mar 15, 2026, end of the day, Pacific Time.
- Write a 5 to 9 pages report (research-paper like following ACL template). The pages here do not include references.
- Come up with your hypothesis and find some datasets for verification
- Design your own models or try a large variety of existing models
- Submit your codes and datasets; GitHub repos are welcome
- Up to 5% bonus for working demos/apps towards the total course grades
- Presentation (20%)
- The orders will be decided randomly after the teams are formed.
- The slides must be ready 2 days before the presentation date. So other students can have the access and think about questions.
- The presentation follows a typical conference style: 20 mins for each team including Q&A
- Question Asking and Handling (5%)
- Asking questions is an important part of research. You are strongly encouraged to ask questions to other teams. It will be a part of your presentation grade.
- Handling questions is also an important skill for researchers.