2020-Fall-DSC180A05-Capstone: Text Mining and NLP

Undergraduate Class, HDSI, UCSD, 2020

Class Time: Wednesdays, 9 to 9:50 AM Pacific Time. Room: https://ucsd.zoom.us/j/91491702947. Piazza: piazza.com/ucsd/fall2020/dsc180a05.

Overview

This capstone section mainly focuses on text mining and natural language processing. We will explore cutting-edge research papers in these areas together and try to replicate some experiments for a deeper, better understanding.

We will mostly have discussions in a Q&A form, instead of traditional lectures. Due to the COVID-19, the discussions will be online over Zoom.

Papers to Read

Mining Quality Phrases from Massive Text Corpora
Jialu Liu*, Jingbo Shang*, Chi Wang, Xiang Ren and Jiawei Han. SIGMOD 2015. [code]
Automated Phrase Mining from Massive Text Corpora
Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss and Jiawei Han. TKDE 2018. [code]

Tips

These two papers are highly related. Please read them one by one.
The Github README files also provide useful information.

Schedule

Week	Date	Discussion Focus
1	10/07	General Overview (a short lecture by Jingbo Shang) [recording]
2	10/14	Introduction & Motivation [recording]
3	10/21	Datasets and Experiment Design [recording]
4	10/28	Experimental Results - Analysis [recording]
5	11/04	Experimental Results - Replication [recording]
6	11/11	No class (Veteran’s Day)
7	11/18	Case Studies [recording]
8	11/25	Application Brainstorming [recording]
9	12/02	Possible Extension [recording]
10	12/09	Elevator Pitch

Discussion Questions

Week 2: Introduction & Motivation

Why do we want to study phrase mining? What’s the advantage of phrases over unigrams?
What’s the major problem when someone is going to apply SegPhrase to a new corpus? Is there any human effort?
What’s the motivation of AutoPhrase? Compared with SegPhrase, which parts do you believe are novel?

Week 3: Datasets and Experiment Design

How many datasets are used in the papers? How many domains and languages are covered?
Why do we want to use such a diverse set of datasets? How this is related to the claims in the papers?
Why do we want to evaluate the results following the pooling strategy? Think about how much human effort is required, if we are not using pooling.

Week 4: Experimental Results - Analysis

Please outline the claims in these two papers.
How can we understand each table and figure? What are the takeaways? One or two sentences per table/figure should be enough.
For each claim, where are the experimental results supporting it?

Week 5: Experimental Results - Replication

Carefully check the README file in the AutoPhrase repo. What is the relation between autophrase.sh and phrasal_segmentation.sh?
Try to run AutoPhrase using the DBLP.5k.txt and DBLP.txt datasets as the input corpus. It should be runnable on your laptop. Let me know if you encounter any issue.
Please eyeball the results from the two runs and try to compare them from the following aspects:
- The number of high-quality phrases (e.g., > 0.5)
- Unigram phrase vs. multi-word phrase
- Top a few high-quality phrases (e.g., >0.9) vs. those borderline phrases (e.g., ~0.5)

Week 7: Case Studies

Why do we need case studies in addition to the quantitative results?
How case studies further the claims in the papers?
Do you have any interesting findings from either the case studies presented in the papers or the results you got from Week 5?

Week 8: Application Brainstorming

What kind of applications do you think could be benefited from phrase mining? Why?
Try to think broadly for more domains/languages.
Based on your proposed applications, can we apply SegPhrase/AutoPhrase directly?
Do you think there is some necessary adaption? If yes, how? If no, why?

Week 9: Possible Extension

What are the drawbacks of these two papers? Do you see any limitations?
Can we do better in order to address these limitations?

Share on

Twitter Facebook LinkedIn

Jingbo Shang