This capstone section mainly focuses on text mining and natural language processing. We will explore cutting-edge research papers in these areas together and try to replicate some experiments for a deeper, better understanding.
We will mostly have discussions in a Q&A form, instead of traditional lectures. Due to the COVID-19, the discussions will be online over Zoom.
Papers to Read
- These two papers are highly related. Please read them one by one.
- The Github README files also provide useful information.
|1||10/07||General Overview (a short lecture by Jingbo Shang) [recording]|
|2||10/14||Introduction & Motivation [recording]|
|3||10/21||Datasets and Experiment Design [recording]|
|4||10/28||Experimental Results - Analysis [recording]|
|5||11/04||Experimental Results - Replication [recording]|
|6||11/11||No class (Veteran’s Day)|
|7||11/18||Case Studies [recording]|
|8||11/25||Application Brainstorming [recording]|
|9||12/02||Possible Extension [recording]|
Week 2: Introduction & Motivation
- Why do we want to study phrase mining? What’s the advantage of phrases over unigrams?
- What’s the major problem when someone is going to apply SegPhrase to a new corpus? Is there any human effort?
- What’s the motivation of AutoPhrase? Compared with SegPhrase, which parts do you believe are novel?
Week 3: Datasets and Experiment Design
- How many datasets are used in the papers? How many domains and languages are covered?
- Why do we want to use such a diverse set of datasets? How this is related to the claims in the papers?
- Why do we want to evaluate the results following the pooling strategy? Think about how much human effort is required, if we are not using pooling.
Week 4: Experimental Results - Analysis
- Please outline the claims in these two papers.
- How can we understand each table and figure? What are the takeaways? One or two sentences per table/figure should be enough.
- For each claim, where are the experimental results supporting it?
Week 5: Experimental Results - Replication
- Carefully check the README file in the AutoPhrase repo. What is the relation between
- Try to run AutoPhrase using the
DBLP.txtdatasets as the input corpus. It should be runnable on your laptop. Let me know if you encounter any issue.
- Please eyeball the results from the two runs and try to compare them from the following aspects:
- The number of high-quality phrases (e.g., > 0.5)
- Unigram phrase vs. multi-word phrase
- Top a few high-quality phrases (e.g., >0.9) vs. those borderline phrases (e.g., ~0.5)
Week 7: Case Studies
- Why do we need case studies in addition to the quantitative results?
- How case studies further the claims in the papers?
- Do you have any interesting findings from either the case studies presented in the papers or the results you got from Week 5?
Week 8: Application Brainstorming
- What kind of applications do you think could be benefited from phrase mining? Why?
- Try to think broadly for more domains/languages.
- Based on your proposed applications, can we apply SegPhrase/AutoPhrase directly?
- Do you think there is some necessary adaption? If yes, how? If no, why?
Week 9: Possible Extension
- What are the drawbacks of these two papers? Do you see any limitations?
- Can we do better in order to address these limitations?