Class Time: Wednesdays, 1 to 1:50 PM Pacific Time. Room: https://ucsd.zoom.us/j/91491702947.
This capstone section mainly focuses on text mining and natural language processing. We will explore cutting-edge research papers in these areas together and try to replicate some experiments for a deeper, better understanding.
We will mostly have discussions in a Q&A form, instead of traditional lectures. Due to the COVID-19, the discussions will be online over Zoom.
Papers to Read
Mining Quality Phrases from Massive Text Corpora
Jialu Liu*, Jingbo Shang*, Chi Wang, Xiang Ren and Jiawei Han. SIGMOD 2015. [code]
Automated Phrase Mining from Massive Text Corpora
Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss and Jiawei Han. TKDE 2018. [code]
- These three papers are highly related. Please read them one by one.
- The Github README files also provide useful information.
- PDFs of these papers can be found online or here
|1||09/29||General Overview (a short lecture by Jingbo Shang)|
|2||10/06||Introduction & Motivation|
|3||10/13||Datasets and Experiment Design|
|4||10/20||Experimental Results - Analysis|
|5||10/27||Experimental Results - Replication|
|9||11/24||Report Writing Discussion|
Week 2: Introduction & Motivation
- Why do we want to study phrase mining? What’s the advantage of phrases over unigrams?
- What’s the major problem when someone is going to apply SegPhrase to a new corpus? Is there any human effort?
- What’s the motivation of AutoPhrase? Compared with SegPhrase, which parts do you believe are novel?
- What’s the motivation of UCPhrase? Compared with AutoPhrase and SegPhrase, what are the major invotations in UCPhrase?
Week 3: Datasets and Experiment Design
- How many datasets are used in the papers? How many domains and languages are covered?
- Why do we want to use such a diverse set of datasets? How this is related to the claims in the papers?
- Why do we want to evaluate the results following the pooling strategy? Think about how much human effort is required, if we are not using pooling.
- Why the UCPhrase has some different evaluation settings than AutoPhrase and SegPhrase?
Week 4: Experimental Results - Analysis
- Please outline the claims in these three papers.
- How can we understand each table and figure? What are the takeaways? One or two sentences per table/figure should be enough.
- For each claim, where are the experimental results supporting it?
Week 5: Experimental Results - Replication
- Carefully check the README file in the AutoPhrase repo. What is the relation between
- Try to run AutoPhrase using the
DBLP.txtdatasets as the input corpus. It should be runnable on your laptop. Let me know if you encounter any issue.
- Please eyeball the results from the two runs and try to compare them from the following aspects:
- The number of high-quality phrases (e.g., > 0.5)
- Unigram phrase vs. multi-word phrase
- Top a few high-quality phrases (e.g., >0.9) vs. those borderline phrases (e.g., ~0.5)
Week 6: Case Studies
- Why do we need case studies in addition to the quantitative results?
- How case studies further the claims in the papers?
- Do you have any interesting findings from either the case studies presented in the papers or the results you got from Week 5?
Week 7: Application Brainstorming
- What kind of applications do you think could be benefited from phrase mining? Why?
- Try to think broadly for more domains/languages.
- Based on your proposed applications, can we apply SegPhrase/AutoPhrase directly?
- Do you think there is some necessary adaption? If yes, how? If no, why?
Week 8: Possible Extension
- What are the drawbacks of these three papers? Do you see any limitations?
- Can we do better in order to address these limitations?
Week 9: Report Writing Discussion
- Do you have any questions about the final report writing?
- How to prepare informative Figures and Tables?
- How to properly cite previous work?
- How to make the proposal look more promising?
Week 10: Elevator Pitch
We will have a timed rehearsal for the evevator pitch.