View on GitHub

Learning Named Entity Tagger using Domain-Specific Dictionary

Jingbo Shang, Liyuan Liu, Xiang Ren, Xiaotao Gu, Teng Ren, Jiawei Han

Paper Bib Tex Documentation

TL;DR

No line-by-line annotations, AutoNER trains named entity taggers with distant supervision.

Benchmark Results

We listed the performance comparison on the BC5DR dataset, which is annotated with disease and chemical entities. With our proposed labeling schema and distant training, AutoNER achieves competitive performance with the supervised method.

Method Precision Recall F1
Supervised Benchmark 88.84 85.16 86.96
Dictionary Match 93.93 58.35 71.98
Fuzzy-LSTM-CRF 88.27 76.75 82.11
AutoNER 88.96 81.00 84.80

A Not-Too-Short Note

Below is a brief introduction and please refer to our paper for more details, .

Problem Setting

The input only contains a domain-specific entity dictionary and raw texts. There is no human annotations.

Specifically, we require two dictionaries as input.

Challenges

We obtain the distant supervision through a string match between the dictionary and the corpus. When some high-quality phrase is matched, we mark the corresponding token span as "unknown". Such token span can be an entity of the interested type, an entity of another type, a part of some entity, or a non-entity token span.

The state-of-the-art NER models mainly follow the BIO/BIOES labeling scheme, which is not compatible with the "unknown" token spans. We tried to extend the BIOES labeling scheme and propose a fuzzy CRF layer (as below) to model "unknown" token spans, but the performance is not satisfiable.

framework

Our Solution

We propose a new labeling scheme, i.e., Tie or Break, as well as a new neural framework, i.e., AutoNER to better leverage the distant supervision.

Tie or Break. Instead of annotating each token, we choose to annotate the gap between two adjacent tokens.

Then, a token span can be defined by two adjacent Break's. For each token span, we treat its type as a multi-label multi-class classification problem. We also add a new None type for those non-entity span.

There are two important observations:

AutoNER. Using our Tie or Break labeling scheme, we design a new neural framework that leverages both character- and word-level information, as shown below.

framework

Bib Tex

Please cite the following papers if you find the codes and datasets useful.

@inproceedings{shang2018learning,
  title = {Learning Named Entity Tagger using Domain-Specific Dictionary},
  author = {Shang, Jingbo and Liu, Liyuan and Ren, Xiang and Gu, Xiaotao and Ren, Teng and Han, Jiawei},
  booktitle = {EMNLP},
  year = 2018,
}

@article{shang2018automated,
  title = {Automated phrase mining from massive text corpora},
  author = {Shang, Jingbo and Liu, Jialu and Jiang, Meng and Ren, Xiang and Voss, Clare R and Han, Jiawei},
  journal = {IEEE Transactions on Knowledge and Data Engineering},
  year = {2018},
  publisher = {IEEE}
}