2020-Fall-DSC180A05-Capstone: Text Mining and NLP
Undergraduate Class, HDSI, UCSD, 2020
Class Time: Wednesdays, 9 to 9:50 AM Pacific Time. Room: https://ucsd.zoom.us/j/91491702947. Piazza: piazza.com/ucsd/fall2020/dsc180a05.
Overview
This capstone section mainly focuses on text mining and natural language processing. We will explore cutting-edge research papers in these areas together and try to replicate some experiments for a deeper, better understanding.
We will mostly have discussions in a Q&A form, instead of traditional lectures. Due to the COVID-19, the discussions will be online over Zoom.
Papers to Read
Mining Quality Phrases from Massive Text Corpora
Jialu Liu*, Jingbo Shang*, Chi Wang, Xiang Ren and Jiawei Han. SIGMOD 2015. [code]Automated Phrase Mining from Massive Text Corpora
Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss and Jiawei Han. TKDE 2018. [code]
Tips
- These two papers are highly related. Please read them one by one.
- The Github README files also provide useful information.
Schedule
Week | Date | Discussion Focus |
1 | 10/07 | General Overview (a short lecture by Jingbo Shang) [recording] |
2 | 10/14 | Introduction & Motivation [recording] |
3 | 10/21 | Datasets and Experiment Design [recording] |
4 | 10/28 | Experimental Results - Analysis [recording] |
5 | 11/04 | Experimental Results - Replication [recording] |
6 | 11/11 | No class (Veteran’s Day) |
7 | 11/18 | Case Studies [recording] |
8 | 11/25 | Application Brainstorming [recording] |
9 | 12/02 | Possible Extension [recording] |
10 | 12/09 | Elevator Pitch |
Discussion Questions
Week 2: Introduction & Motivation
- Why do we want to study phrase mining? What’s the advantage of phrases over unigrams?
- What’s the major problem when someone is going to apply SegPhrase to a new corpus? Is there any human effort?
- What’s the motivation of AutoPhrase? Compared with SegPhrase, which parts do you believe are novel?
Week 3: Datasets and Experiment Design
- How many datasets are used in the papers? How many domains and languages are covered?
- Why do we want to use such a diverse set of datasets? How this is related to the claims in the papers?
- Why do we want to evaluate the results following the pooling strategy? Think about how much human effort is required, if we are not using pooling.
Week 4: Experimental Results - Analysis
- Please outline the claims in these two papers.
- How can we understand each table and figure? What are the takeaways? One or two sentences per table/figure should be enough.
- For each claim, where are the experimental results supporting it?
Week 5: Experimental Results - Replication
- Carefully check the README file in the AutoPhrase repo. What is the relation between
autophrase.sh
andphrasal_segmentation.sh
? - Try to run AutoPhrase using the
DBLP.5k.txt
andDBLP.txt
datasets as the input corpus. It should be runnable on your laptop. Let me know if you encounter any issue. - Please eyeball the results from the two runs and try to compare them from the following aspects:
- The number of high-quality phrases (e.g., > 0.5)
- Unigram phrase vs. multi-word phrase
- Top a few high-quality phrases (e.g., >0.9) vs. those borderline phrases (e.g., ~0.5)
Week 7: Case Studies
- Why do we need case studies in addition to the quantitative results?
- How case studies further the claims in the papers?
- Do you have any interesting findings from either the case studies presented in the papers or the results you got from Week 5?
Week 8: Application Brainstorming
- What kind of applications do you think could be benefited from phrase mining? Why?
- Try to think broadly for more domains/languages.
- Based on your proposed applications, can we apply SegPhrase/AutoPhrase directly?
- Do you think there is some necessary adaption? If yes, how? If no, why?
Week 9: Possible Extension
- What are the drawbacks of these two papers? Do you see any limitations?
- Can we do better in order to address these limitations?