2020-Fall-DSC180A05-Capstone: Text Mining and NLP

Undergraduate Class, HDSI, UCSD, 2020

Class Time: Wednesdays, 9 to 9:50 AM Pacific Time. Room: https://ucsd.zoom.us/j/91491702947. Piazza: piazza.com/ucsd/fall2020/dsc180a05.


This capstone section mainly focuses on text mining and natural language processing. We will explore cutting-edge research papers in these areas together and try to replicate some experiments for a deeper, better understanding.

We will mostly have discussions in a Q&A form, instead of traditional lectures. Due to the COVID-19, the discussions will be online over Zoom.

Papers to Read


  1. These two papers are highly related. Please read them one by one.
  2. The Github README files also provide useful information.


WeekDateDiscussion Focus
110/07General Overview (a short lecture by Jingbo Shang) [recording]
210/14Introduction & Motivation [recording]
310/21Datasets and Experiment Design [recording]
410/28Experimental Results - Analysis [recording]
511/04Experimental Results - Replication [recording]
611/11No class (Veteran’s Day)
711/18Case Studies [recording]
811/25Application Brainstorming [recording]
912/02Possible Extension [recording]
1012/09Elevator Pitch

Discussion Questions

Week 2: Introduction & Motivation

  1. Why do we want to study phrase mining? What’s the advantage of phrases over unigrams?
  2. What’s the major problem when someone is going to apply SegPhrase to a new corpus? Is there any human effort?
  3. What’s the motivation of AutoPhrase? Compared with SegPhrase, which parts do you believe are novel?

Week 3: Datasets and Experiment Design

  1. How many datasets are used in the papers? How many domains and languages are covered?
  2. Why do we want to use such a diverse set of datasets? How this is related to the claims in the papers?
  3. Why do we want to evaluate the results following the pooling strategy? Think about how much human effort is required, if we are not using pooling.

Week 4: Experimental Results - Analysis

  1. Please outline the claims in these two papers.
  2. How can we understand each table and figure? What are the takeaways? One or two sentences per table/figure should be enough.
  3. For each claim, where are the experimental results supporting it?

Week 5: Experimental Results - Replication

  1. Carefully check the README file in the AutoPhrase repo. What is the relation between autophrase.sh and phrasal_segmentation.sh?
  2. Try to run AutoPhrase using the DBLP.5k.txt and DBLP.txt datasets as the input corpus. It should be runnable on your laptop. Let me know if you encounter any issue.
  3. Please eyeball the results from the two runs and try to compare them from the following aspects:
    • The number of high-quality phrases (e.g., > 0.5)
    • Unigram phrase vs. multi-word phrase
    • Top a few high-quality phrases (e.g., >0.9) vs. those borderline phrases (e.g., ~0.5)

Week 7: Case Studies

  1. Why do we need case studies in addition to the quantitative results?
  2. How case studies further the claims in the papers?
  3. Do you have any interesting findings from either the case studies presented in the papers or the results you got from Week 5?

Week 8: Application Brainstorming

  1. What kind of applications do you think could be benefited from phrase mining? Why?
  2. Try to think broadly for more domains/languages.
  3. Based on your proposed applications, can we apply SegPhrase/AutoPhrase directly?
  4. Do you think there is some necessary adaption? If yes, how? If no, why?

Week 9: Possible Extension

  1. What are the drawbacks of these two papers? Do you see any limitations?
  2. Can we do better in order to address these limitations?