2021-Winter-MGTA415-Working with Unstructured Data

Graduate Class, Rady Management School, UCSD, 2021

Class Time: Wednesdays, 6PM to 9PM. Room: https://ucsd.zoom.us/j/95861287987. Piazza: piazza.com/ucsd/winter2021/mgta415

Online Lecturing

Due to the COVID-19, this course will be delivered over Zoom: https://ucsd.zoom.us/j/95861287987

Overview

This course mainly focuses on introducing current methods and models that are useful in analyzing and mining real-world unstructured text data.

As the starting points, we will cover text preprocessing, text classification (e.g., sentiment analysis), topic modeling, word embedding, language models, etc. This course will not only cover the theories and high-level understandings but also some tricks about implementation (e.g., how to use 3rd-party libs, and how to set hyper-parameters).

For the advanced part, we will talk about unsupervised, weakly supervised, and distantly supervised methods for text mining problems, including information retrieval, open-domain information extraction, text summarization (both extractive and generative), and knowledge graph construction. Bootstrapping, comparative analysis, learning from seed words and existing knowledge bases will be the key methodologies.

We will have a take-home midterm, a few homework assignments, a Kaggle-like competition, and a final (team-based) project. These four parts will have roughly the same weights.

There is no textbook required, but there are recommended readings for each lecture (at the end of the slides).

If you don’t have much experience in data mining, machine learning, etc. Here are some recommended textbooks to review.

Prerequisites

  • Math, Stats, and Coding
  • For Coding
    • We will mainly use Python
    • Sometimes, we will need to run some tools developed in C/C++ and Java
  • It’s a bonus if you already have knowledge about machine learning and data mining

Teaching Assistant

  • Yixin Zou (yiz867 AT eng.ucsd.edu)

Office Hours

Note: all times are in Pacific Time.

Grading

  • Homework: 8% each. Your lowest (of four) homework grades is dropped (or one homework can be skipped).
  • Midterm: 26%.
  • Data Mining Challenge: 25%.
  • Project: 25%.
  • You should complete all work individually, except for the Project.
  • Late submissions are NOT accepted.

Lecture Schedule

Recording Note: Please check out Canvas for recordings.

HW Note: All HWs due before the lecture time 6PM PT in the afternoon.

WeekDateTopic & SlidesEvents
101/06 (Wed)Intro, Logistics, Course Project; Text Preprocessing [slides] [annotations] [Jupyter Notebook]HW1 out
201/13 (Wed)Text Classification using Bag-of-Words [slides] [annotations] [Jupyter Notebook]DM Challenge out
301/20 (Wed)Word Embedding & Language Models: from N-Gram to Neural LMsHW1 due, HW2 out
401/27 (Wed)Information Retrieval & Topic Modeling 
502/03 (Wed)Midterm Exam 
602/10 (Wed)Phrase Mining and its ApplicationsHW2 due, HW3 out
702/17 (Wed)Open-Domain Information Extraction: Entity Recognition, Relation Extraction, and Attribute DiscoveryDM challenge due
802/24 (Wed)Weakly Supervised Text ClassificationHW3 due, HW4 out
903/03 (Wed)Document Summarization, Aspect-based Sentiment Analysis, and Opinion Summarization 
1003/10 (Wed)Topic Taxonomy ConstructionHW4 due

Homework (24%)

Your lowest (of four) homework grades is dropped (or one homework can be skipped).

  • HW1: Text Pre-processing and Classification (8%). This homework mainly focuses on the impact of the pre-processing on the classification results.
  • HW2: Word Embedding and Neural Language Models (8%). This homework mainly focuses on trying out the word embedding and neural language models (e.g., BERT using HuggingFace).
  • HW3: Phrase Mining (8%). This homework mainly focuses on phrase mining applications and future work.
  • HW4: Weakly Supervised Text Classification (8%). This homework mainly focuses on weakly supervised aspect classification and opinion summarization.

Midterm (26%)

It is an open-book, take-home exam, which covers all lectures given before the Midterm. Most of the questions will be open-ended. Some of them might be slightly more difficult than homework. You will have 24 hours to complete the midterm, which is expected for about 3 hours.

  • Start: Feb 3, 6 PM PT
  • End: Feb 4, 6 PM PT
  • Midterm problems download: TBD.

Data Mining Challenge (25%)

It is a individual-based data mining competition with quantitative evaluation. The challenge runs from Jan 14, 0:00:01 AM to Feb 18 4:59:59 PM PT. Note that the time displayed on Kaggle is in UTC, not PT.

Project (25%)

  • Team-Based Open-Ended Project
    • 1 to 4 members per team. More members, higher expectation.
    • Define your own research problem and justify its importance
    • Final Deliverables: Research Paper-like Report
      • Report due on Mar 12, End of the day, Pacific Time.
      • Write a 5 to 9 pages report (research-paper like following ACL template). The pages here do not include references.
      • Come up with your hypothesis and find some datasets for verification
      • Design your own models or try a large variety of existing models
      • Submit your codes and datasets; Github repos are welcome
      • Up to 5% bonus for working demos/apps towards the total course grades