Jingbo Shang

NSF C-Accel Award OIA-2040727 (Project Page)

2020-11-01T11:00:00-08:00

NSF Convergence Accelerator Track D:

Towards Intelligent Sharing and Search for AI Models and Datasets

Jingbo Shang, Rajesh Gupta, Lucila Ohno-Machado, Arun Kumar, Giorgio Quer
University of California San Diego & Scripps Research

Abstract

A major goal of AI-driven applications is to discover the underlying patterns in domain-specific datasets, which typically requires tremendous field experience and interdisciplinary knowledge to design or even select suitable AI models. For instance, AI modeling for COVID-19 patient imaging and social distancing datasets requires an understanding of not only the epidemiological processes but also bioinformatics that informs mutation rate and its effects on models, coupled with socio-economic models that accurately capture living and working conditions. Such model selection process is far beyond the capabilities of search services available at existing platforms (e.g., Google Dataset Search, IEEE DataPort, and GitHub).

We envision an open-source, privacy-preserving intelligent system for searching and navigating through large-scale collections of AI models and datasets for scientific and other applications. The envisioned system would transform AI models and datasets into ‘computational resources’ such that model-dataset pairs can be searched and matched easily based on their semantics. It will serve as a sharing portal for models and datasets matched via contextual information, captured as ‘metadata’ that relies upon innovations in metadata methods and tools in the application context. More importantly, the confidential and private information embedded in the models and datasets will be protected by developing novel, rigorous privacy techniques. This way, our system would be able to allow clinicians to upload the patient imaging dataset and issue a query, such as “Coronavirus hazard assessment from chest CT”, and then without risks of leaking patient information, it would return suitable AI models and related datasets.

[Marking Video]

Team

PI/Co-PI: Jingbo Shang, Rajesh Gupta, Lucila Ohno-Machado, Arun Kumar, Giorgio Quer
Senior Personnel: Luca Bonomi, Dezhi Hong
Graduate Student Research Assistants: Ranak Roy Chowdhury, Zichao Li, Vraj Shah, Xianjie Shen, Zihan Wang, Zeyun Wu
External Partner: IEEE DataPort, Google Tensorflow Extended Team, Amazon AWS, Databricks, OpenML, Snowflake, Tempus Lab

Publication & Pre-Prints

Unsupervised Deep Keyphrase Generation
Xianjie Shen, Yinghan Wang, Rui Meng and Jingbo Shang. AAAI 2022. [arXiv:2104.08729] [code]
UniTS: Short-Time Fourier Inspired Neural Networks for Sensory Time Series Classification
Shuheng Li, Ranak Roy Chowdhury, Jingbo Shang, Rajesh K. Gupta and Dezhi Hong. SenSys 2021. [code]
Coarse2Fine: Fine-grained Text Classification on Coarsely-grained Annotated Data
Dheeraj Mekala, Varun Gangal and Jingbo Shang. EMNLP 2021.
“Average” Approximates “First Principal Component”? An Empirical Analysis on Representations from Neural Language Models
Zihan Wang, Chengyu Dong and Jingbo Shang. EMNLP 2021. (short) [arXiv:2104.08673] [code]
BFClass: A Backdoor-free Text Classification Framework
Zichao Li, Dheeraj Mekala, Chengyu Dong and Jingbo Shang. EMNLP (Findings) 2021.
UCPhrase: Unsupervised Context-aware Quality Phrase Tagging
Xiaotao Gu*, Zihan Wang*, Zhenyu Bi, Yu Meng, Liyuan Liu, Jiawei Han and Jingbo Shang. KDD 2021. [arXiv:2105.14078] [code]
X-Class: Text Classification with Extremely Weak Supervision
Zihan Wang, Dheeraj Mekala and Jingbo Shang. NAACL 2021. [code]
TaxoClass: Hierarchical Multi-Label Text Classification Using Only Class Names
Jiaming Shen, Wenda Qiu, Yu Meng, Jingbo Shang, Xiang Ren and Jiawei Han. NAACL 2021.
“Misc”-Aware Weakly Supervised Aspect Classification
Peiran Li*, Fang Guo* and Jingbo Shang. SDM 2021. [arXiv:2004.14555] [code]
Sensei: Self-Supervised Sensor Name Segmentation
Jiaman Wu, Dezhi Hong, Rajesh Gupta and Jingbo Shang. ACL (Findings) 2021. arXiv:2101.00130. [code]
SeNsER: Learning Cross-Building Sensor Metadata Tagger
Yang Jiao*, Jiacheng Li*, Jiaman Wu, Dezhi Hong, Rajesh Gupta and Jingbo Shang. EMNLP (Findings) 2020. [code]
META: Metadata-Empowered Weak Supervision for Text Classification
Dheeraj Mekala, Xinyang Zhang and Jingbo Shang. EMNLP 2020. [code]
Towards Benchmarking Feature Type Inference for AutoML Platforms
Vraj Shah, Jonathan Lacanlale, Premanand Kumar, Kevin Yang, and Arun Kumar. SIGMOD 2021 [code] [project page]
A Comprehensive Explanation Framework for Biomedical Time Series Classification
Praharsh Ivaturi, Matteo Gadaleta, Amitabh C Pandey, Michael Pazzani, Steven R Steinhubl, and Giorgio Quer. IEEE J Biomed Health Inform. Feb. 2021.
Machine Learning and the Future of Cardiovascular Care: JACC State-of-the-Art Review
Giorgio Quer, Ramy Arnaout, Michael Henne, and Rima Arnaout. J Am Coll Cardiol. Jan. 2021.
Improving Feature Type Inference Accuracy of TFDV with SortingHat
Vraj Shah, Kevin Yang, and Arun Kumar. Preprint.

Contact

jshang [at] ucsd [dot] edu

Acknowledgment

This project is supported in part by the NSF Convergence Accelerator under award OIA-2040727.

SIGKDD Dissertation Award Runner-up!

2020-08-25T07:00:00-07:00

My PhD thesis “Constructing and Mining Heterogeneous Information Networks From Massive Text” has been awarded Runner-up in the SIGKDD Dissertation Award competition. Thanks a lot to my advisor Dr. Jiawei Han and the award committee! Here is a brief video introduction to my work.

KDD 2020 Tutorial

2020-08-19T07:00:00-07:00

We are going to discuss about “Scientific Text Mining and Knowledge Graphs”. More details can be found here.

ICML Paper Accepted!

2020-06-01T07:00:00-07:00

Our novel training method, LipGrow, can save ~50% time for deep ResNets with theory behind!

Two ACL Papers!

2020-04-04T07:00:00-07:00

Recently, we have 2 papers got accepted by ACL. The travel is not clear yet due to the COVID-19. The camera-ready versions are coming soon. Please stay tuned.

Contextualized Weak Supervision for Text Classification. Dheeraj Mekala, Jingbo Shang. ACL 2020.
Empower Entity Set Expansion via Language Model Probing. Yunyi Zhang, Jiaming Shen, Jingbo Shang and Jiawei Han. ACL 2020.

Topcoder X UCSD Lightning Marathon Match

2020-02-25T06:00:00-08:00

To celebrate the annual event of Halıcıoğlu Data Science Institute, we collaborate with Topcoder to bring a Lightning Data Science Marathon Match to UCSD! It will run from 5 PM PST Feb 26 (Wed) to 5 AM PST Mar 1 (Sunday). We will announce the winners during the HDSI annual event on Mar 2.

The match will allow you to tickle your brains and help solve a real-world problem. Make sure you are ready with your coffee to compete in the intense 72-Hour battle to prove yourself as the best of the best! What’s more? While you get a chance to hone your skills, prove yourself as the best of the lot, you also get to win $2000 and Topcoder T-shirts in prizes.

Topcoder X UCSD Lightning Marathon Match

Problem Statement: While the problem will be kept secret till the launch of the contest, it will for sure have the following tags: interesting, real-world, data science, machine learning, prediction! The ranking is purely based on the accuracy of your prediction. Duration: ~84 Hours

Prizes: $1000, $500, $250, $150, $100 and Topcoder T-shirts for Top 20

How to compete?

In order to compete in a Topcoder Marathon Match, you will need to click the Register button next to the appropriate Marathon Match within the Active Challenge List and agree to the rules of the event. Once you register, make sure you go into the challenge forums and check out the discussions there.

Please fill the following form https://forms.gle/d4Mcsskn6FfcoSNq9, if you want to participate in the challenge.

Want to practice?

To understand how to compete and submit on the Topcoder platform and get the feel of a Topcoder Marathon Match you can visit this interesting practice match (https://www.topcoder.com/challenges/30094385) and participate to practice.

VDLB 2019 Tutorial (Tutorial Page)

2019-08-23T07:00:00-07:00

VLDB 2019 Tutorial:

Tutorial 6: TextCube: Automated Construction and Multidimensional Exploration

Yu Meng, Jiaxin Huang, Jingbo Shang, Jiawei Han
Computer Science Department, University of Illinois at Urbana-Champaign
Time: 2:00PM - 5:30PM, Aug 29, 2019
Location: Avalon

Slides

Preliminary Version

Abstract

Today’s society is immersed in a wealth of text data, ranging from news articles, to social media, research literature, medical records, and corporate reports. A grand challenge of data science and engineering is to develop effective and scalable methods to extract structures and knowledge from massive text data to satisfy diverse applications, without extensive, corpus-specific human annotations.

In this tutorial, we show that TextCube provides a critical information organization structure that will satisfy such an information need. We overview a set of recently developed data-driven methods that facilitate automated construction of TextCubes from massive, domain-specific text corpora, and show that TextCubes so constructed will enhance text exploration and analysis for various applications. We focus on new TextCube construction methods that are scalable, weakly-supervised, domain-independent, language-agnostic, and effective (i.e., generating quality TextCubes from large corpora of various domains). We will demonstrate with real datasets (including news articles, scientific publications, and product reviews) on how TextCubes can be constructed to assist multidimensional analysis of massive text corpora.

Outline

Introduction
- Motivations & Prior Arts
- Overview of Multidimensional Text Analysis
Phrase Mining
- What are quality phrases?
- Supervised Methods
  - Noun Phrase Chunking Methods
  - Parsing-based Methods
  - How to rank entities at the corpus-level?
- Unsupervised Methods
  - Raw Frequency based Methods
  - Concordance based Methods
  - Topic Model based Methods
  - Comparative Methods
- Weakly/Distantly Supervised Methods
  - Phrasal Segmentation and its Variants
  - How to leverage distant supervision?
- System demos and software introduction
  - A multilingual phrase mining system which integrates AutoPhrase, SegPhrase, and TopMine together and supports phrase mining in multiple languages (e.g., English, Spanish, Chinese, Arabic, and Japanese).
Text Representation
- Unsupervised Word Embedding
  - Context-free representation
  - Contextualized representation
- Other embeddings: Network embedding
  - DeepWalk, LINE, node2vec, …
- Category name-guided word embedding
  - CatE
- System demos and software introduction
  - Our CatE system demo
Entity Recognition
- What is named entity recognition?
- Handcrafted Features + Human Supervision
  - Classical Models: Conditional Random Filed
  - Standford NER
  - Twitter NER
- Automated Features + Human Supervision
  - LSTM-CRF, LSTM-CNN-CRF, …
  - LM-LSTM-CRF, EMLo, Flair, …
  - Multi-task learning
- Automated Features + Distant Supervision
  - AutoEntity, SwellShark, ClusType, Distant-LSTM-CRF, …
  - FuzzyCRF & AutoNER
- System Demos and Software
  - Named entity recognition inference Python package: LightNER. This module helps users easily apply the pre-trained NER models to their own corpus in an efficient and portable manner.
Text Cube Construction
- Taxonomy Basics and Construction
- Cluster-based Taxonomy Construction
  - Hierarchical Topic Modeling
  - General Graphical Model Approach
  - Hierarchical Clustering
- Text Cube Basics and Construction
  - What is Text Cube?
  - Automatic document allocation for Text Cube construction
- System Demos and Software
  - Publication Dataset Analysis Demo
Text Cube Exploration
- Cube-based Multidimensional Analysis
  - Statistical Measures Aggregation
  - Phrase-based Cell Summarization
  - Key N-gram based Ranking and Exploration
- System Demos and Software
  - Demo: MissionCube
Summary and Future Directions
- Summary of Text Cube
  - Principles and Techniques
  - Advantages and Limitations
  - How to build a text cube based on your application?
- Future Directions

Presenters

Yu Meng, Ph.D. student, Computer Science, UIUC. His research focuses on mining structured knowledge from massive text corpora with minimum human supervision.

Jiaxin Huang, Ph.D. student, Computer Science, UIUC. Her research focuses on mining structured knowledge from massive text corpora. She is the recipient of Chirag Foundation Graduate Fellowship in Computer Science.

Jingbo Shang, Ph.D. candidate, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research focuses on mining and constructing structured knowledge from massive text corpora with minimum human effort. His research has been recognized by multiple prestigious awards, including Grand Prize of Yelp Dataset Challenge (2015), Google PhD Fellowships (2017-2019) on Structured Data and Database Management. Mr. Shang has rich experiences in delivering tutorials in major conferences (SIGMOD’17, WWW’17, SIGKDD’17, SIGKDD’18, SIGKDD’19).

Jiawei Han is Abel Bliss Professor in the Department of Computer Science at the University of Illinois. He has been researching into data mining, information network analysis, and database systems, with over 600 publications. He served as the founding Editor-in-Chief of ACM Transactions on Knowledge Discovery from Data (TKDD). Jiawei has received ACM SIGKDD Innovation Award (2004), IEEE Computer Society Technical Achievement Award (2005), IEEE Computer Society W. Wallace McDowell Award (2009), and Daniel C. Drucker Eminent Faculty Award at UIUC (2011). He is a Fellow of ACM and a Fellow of IEEE. He is currently the Director of Information Network Academic Research Center (INARC) supported by the Network Science-Collaborative Technology Alliance (NS-CTA) program of U.S. Army Research Lab. His co-authored textbook ``Data Mining: Concepts and Techniques’’ (Morgan Kaufmann) has been adopted worldwide.

Joining UC San Diego as Assistant Professor!

2019-07-24T12:00:00-07:00

I’m joining UC San Diego as Assistant Professor starting from Jan 2020. I will be jointly appointed by Computer Science and Engineering (CSE) and Halıcıoğlu Data Science Institute (HDSI). I’m looking for talented & dedicated students to work together on data-driven, large-scale text mining and network mining problems.

KDD 2019 Tutorial Accepted! (Tutorial Page)

2019-04-22T12:00:00-07:00

SIGKDD 2019 Tutorial:

T17: Constructing and Mining Heterogeneous Information Networks from Massive Text

Jingbo Shang, Jiaming Shen, Liyuan Liu, Jiawei Han
Computer Science Department, University of Illinois at Urbana-Champaign
Time: 1:00PM - 5:00PM, Aug 4, 2019
Location: Kahtnu 2-Level 2, Dena’ina

Slides

preliminary version

Abstract

Real-world data exists largely in the form of unstructured texts. A grand challenge on data mining research is to develop effective and scalable methods that may transform unstructured text into structured knowledge. Based on our vision, it is highly beneficial to transform such text into structured heterogeneous information networks, on which actionable knowledge can be generated based on the user’s need.

In this tutorial, we provide a comprehensive overview on recent research and development in this direction. First, we introduce a series of effective methods that construct heterogeneous information networks from massive, domain-specific text corpora. Then we discuss methods that mine such text-rich networks based on the user’s need. Specifically, we focus on scalable, effective, weakly supervised, language-agnostic methods that work on various kinds of text. We further demonstrate, on real datasets (including news articles, scientific publications, and product reviews), how information networks can be constructed and how they can assist further exploratory analysis.

Outline

Introduction
- Motivations: Why construction and mining of heterogeneous information networks from massive text?
- An overview of network construction from massive texts
- An overview on exploration of applications of constructed networks
Phrase Mining
- Why phrase mining and how to define high-quality phrases?
- Supervised Methods
  - Noun Phrase Chunking Methods
  - Parsing-based Methods
  - How to rank entities at the corpus-level?
- Unsupervised Methods
  - Raw Frequency based Methods
  - Concordance based Methods
  - Topic Model based Methods
  - Comparative Methods
- Weakly/Distantly Supervised Methods
  - Phrasal Segmentation and its Variants
  - How to leverage distant supervision?
- System demos and software introduction
  - A multilingual phrase mining system which integrates AutoPhrase, SegPhrase, and TopMine together and supports phrase mining in multiple languages (e.g., English, Spanish, Chinese, Arabic, and Japanese).
Information Extraction: Entity, Attribute, and Relation
- What is Named Entity Recognition (NER)?
- Traditional Supervised Methods
  - CorNLL03 shared task
  - Sequence labeling framework
  - Conditional random fields
  - Handcrafted features
- Modern End-to-End Neural Models
  - Bidirectional LSTM-based models
  - Language model and contextualized representations
  - Raw-to-end models
- Distantly Supervised Models
  - Data programming for entity typing
  - Learning from domain-specific dictionaries
- Meta-Pattern based Information Extraction
  - Meta-Pattern Discovery
  - Meta-Pattern-Enhanced NER
- System Demos and Software
  - Named entity recognition inference Python package: LightNER. This module helps users easily apply the pre-trained NER models to their own corpus in an efficient and portable manner.
Taxonomy Construction
- Taxonomy Basics
  - Taxonomy Definition
  - Taxonomy Application
  - Taxonomy Construction Pipeline
- Instance-based Taxonomy Construction
  - Used Resources Overview
  - Pattern-based Methods
  - Supervised Methods
  - Weakly-supervised Methods
- Cluster-based Taxonomy Construction
  - Hierarchical Topic Modeling
  - General Graphical Model Approach
  - Hierarchical Clustering
Mining Heterogeneous Information Networks (Structured Analysis)
- Basic Analysis System Demo
  - AutoNet system: It constructs a huge structured network from the PubMed papers (title & abstract) and supports online construction (new documents) and intelligent exploration (search).
- Summarization
  - Graph-based Summarization
  - Clustering and Ranking for Summarization
- Meta-Path Guided Exploration
  - Meta-Path based Similarity
  - Meta-Path guided Node Embedding
- Link Prediction
  - Task-Guided Node Embedding
  - Link Enrichment in Constructed Networks
Summary and Future Directions
- Summary
  - Principles and Techniques
  - AdvantagesandLimitations
- Challenges and Future Research Directions
- Interaction with the Audience
  - How to construct and mine heterogeneous information networks based on your text data and application need?
Question Answering and Discussions

Presenters

Jiaming Shen, Ph.D. candidate, Department of Com- puter Science, Univ. of Illinois at Urbana-Champaign. His research focuses on turning massive unstructured text cor- pora into structured knowledge, for better retrieval, explo- ration, and analysis of domain-specific corpora. He is the recipient of Brian Totty Graduate Fellowship in 2016.

Liyuan Liu, Ph.D. candidate, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research interest mainly lies in data-driven text mining, including contextualized representations with language modeling, weak and heterogeneous supervision.

WWW Paper Accepted!

2019-01-11T06:00:00-08:00

Our NetTaxo paper has been accepted by WWW 2020. The camera-ready versions are coming soon. Please stay tuned.

Jingbo Shang*, Xinyang Zhang*, Liyuan Liu, Sha Li and Jiawei Han, “NetTaxo: Automated Topic Taxonomy Construction from Large-Scale Text-Rich Network”, in Proc. 2020 Int. World Wide Web Conf. (WWW’20), Taipei, Taiwan, Apr. 2020