2023-Spring-CSE291-DSC253-Advanced Data-Driven Text Mining

Graduate Class, CSE, UCSD, 2023

Class Time: Tuesdays and Thursdays, 12:30 to 1:50PM. Room: EBU3B (CSE) 2154. Piazza: piazza.com/ucsd/spring2023/cse291i00

Online Lecturing for First Week

To offer waitlist students opportunities to learn more about this course, in the first week, we deliver the lecture over Zoom: https://ucsd.zoom.us/j/98881116686. The lectures will be recorded.


This course mainly focuses on introducing current methods and models that are useful in analyzing and mining real-world text data. It will put emphasis on unsupervised, weakly supervised, and distantly supervised methods for text mining problems, including information retrieval, open-domain information extraction, text summarization (both extractive and generative), and knowledge graph construction. Bootstrapping, comparative analysis, learning from seed words and existing knowledge bases will be the key methodologies.

There is no textbook required, but there are recommended readings for each lecture (at the end of the slides).

  • You MUST enroll for 4 units
    • We need your time commitment for projects
    • Feel free to audit the course with 0 unit


Knowledge about Machine Learning and Data Mining; Comfortable coding using Python, C/C++, or Java; Math and Stat skills.

TA and Office Hours

  • Jingbo Shang
  • TA: Bill Hogan (whogan AT ucsd.edu)
    • Office Hour: Thursdays, 2 to 3 PM
    • CSE Basement Room: CSE B270A

Note: all times are in Pacific Time.


  • Homework: 30%. There will be two homework assignments. 15% each.
  • Text Mining Challenge: 30%.
  • Project: 40%.
  • You should complete all work individually, except for the Project.
  • Late submissions are NOT accepted.

Lecture Schedule

Recording Note: Please download the recording video for the full length. Dropbox website will only show you the first one hour.

HW Note: All HWs due before the lecture time 9:30 AM PT in the morning.

WeekDateTopic & SlidesEvents
104/04 (Tue)Intro, Logistics, and Course Project 
104/06 (Thu)Basics: Zipf’s Law, Bags-of-words, and TF-IDFHW1 out
204/11 (Tue)Word Embedding: word2vec and GloVe 
204/13 (Thu)Language Models: from N-Gram to Neural LMs 
304/18 (Tue)Information Retrieval: from BM25 to Learning to RankProject Proposal Due (End of the Day)
304/20 (Thu)Sentiment Analysis and Document Classification 
404/25 (Tue)Topic Modeling: PLSA, LDA, and HMMHW1 Due (before lecture time), DM challenge roll-out
404/27 (Thu)Phrase Mining: from Unigrams to Multi-word PhrasesHW2 out
505/02 (Tue)Entity Set Expansion: from Seed Words to Sets 
505/04 (Thu)Entity Recognition: from Supervised to Data-Driven 
605/09 (Tue)Distant Supervision for Relation Extraction 
605/11 (Thu)Text-Rich Network: a Collaboration between Texts and Networks 
705/16 (Tue)Topic Taxonomy Construction 
705/18 (Thu)Weakly Supervised Text Classification 
805/23 (Tue)Learning with Noisy DataHW2 due (before lecture time)
805/25 (Thu)Label Bias in Weak Supervision & Few-shot NERDM challenge due
905/30 (Tue)Large Language Models 
906/01 (Thu)Project Presentations 
1006/06 (Tue)Project Presentations 
1006/08 (Thu)Project Presentations 

Homework (30%)

  • HW1. Text Classification with Different Techniques. HW1.zip
    • Due: April 25, before lecture time
  • HW2. Phrase Mining Applications and Future Work. HW2.pdf
    • Due: May 23, before lecture time

Data Mining Challenge (30%)

It is a individual-based text mining competition with quantitative evaluation. The challenge runs from April 25 2023 0:00:01 AM to May 25, 2023 4:59:59 PM PT. Note that the time displayed on Kaggle is in UTC, not PT.

Project (40%)


  • Team-Based Open-Ended Project
    • 1 to 4 members per team. More members, higher expectation.
    • 3 to 4 members are recommended, given the limited presentation slots.

Final Deliverables

  • Project Proposal (5%) instruction
    • Define your own research problem and justify its importance
    • Be ambitious! We could aim for ACL/EMNLP conference!
  • Research Paper (20%)
    • Report due on Jun 11, End of the day, Pacific Time.
    • Write a 5 to 9 pages report (research-paper like following ACL template). The pages here do not include references.
    • Come up with your hypothesis and find some datasets for verification
    • Design your own models or try a large variety of existing models
    • Submit your codes and datasets; Github repos are welcome
    • Up to 5% bonus for working demos/apps towards the total course grades
  • Presentation (20%)
    • The orders will be decided randomly after the teams are formed.
    • The slides must be ready 2 days before the presentation date. So other students can have the access and think about questions.
    • The presentation follows a typical conference style: 20 mins for each team including Q&A
  • Question Asking and Handling (5%)
    • Asking questions is an important part of research. You are strongly encouraged to ask questions to other teams. It will be a part of your presentation grade.
    • Handling questions is also an important skill for researchers.