序列标注-2:命名实体识别

       SIGHAN是国际计算语言学会(ACL)中文语言处理小组的简称,其英文全称为“Special Interest Group for Chinese Language Processing of the Association for Computational Linguistics”.Bakeoff则是SIGHAN所主办的国际中文语言处理竞赛,第一届于2003年在日本札幌举行(Bakeoff 2003),第二届于2005年在韩国济州岛举行(Bakeoff 2005), 而2006年在悉尼举行的第三届(Bakeoff 2006)则在前两届的基础上加入了中文命名实体识别评测。目前SIGHAN Bakeoff已成功举办了6届,其中Bakeoff 2005的数据和结果在其主页上是完全免费和公开的,但是请注意使用的前提是非商业使用(non-commercial):

    The data and results for the 2nd International Chinese Word Segmentation Bakeoff are now available for non-commercial use.

The Third SIGHAN Chinese Language Processing Bakeoff will feature two tasks:

  • Word Segmentation
  • Named Entity Recognition

Word Segmentation Task

    The Word Segmentation task requires identification of word boundaries in running Chinese text.

The following resources will be available:

Matched training and (new) test sets from:

Source Institution Character Encoding Approximate Size (chars)
CKIP, Academia Sinica, Taiwan Traditional, Big5 8.3M
City University of Hong Kong Traditional, Big5 2.4M
Microsoft Research Simplified, CP936 5M
University of PennsylvaniaUniversity of Colorado, Boulder Simplified 1M

    Segmentation guidelines for the following corpora are available. These were supplied to SIGHAN by each data provider, and converted into PDF by the organizer:

Corpus MS Word PDF
Academia Sinica 516 KB 336 KB
City University of Hong Kong 154 KB 237 KB
Microsoft Research 41 KB 70 KB

Named Entity Recognition Task

    The Named Entity Recognition Task requires participants to identify named entities (person, location, and organization) in running unsegmented Chinese text.

The following resources will be available:

Matched training and (new) test sets from:

Source Institution Character Encoding
City University of Hong Kong Traditional, Big5
Microsoft Research Simplified, CP936
Linguistic Data Consortium Simplified, CP936

    You may declare that you will return results on any subset of these corpora. For example, you may decide that you will test on the Sinica Corpus and the City University corpus. The only constraint is that you must not select a corpus where you have knowingly had previous access to the testing portion of the corpus. A corollary of this is that a team may not test on the data from their own institution.

Data Formats

     Training data will be available for CityU and MSRA in two formats. The primary format will be similar to that of the Co-NLL NER task 2002, adapted for Chinese. The data will be presented in two-column format, where the first column consists of the character and the second is a tag. The tag is specified as follows:

Tag Meaning
0 (zero) Not part of a named entity
B-PER Beginning character of a person name
I-PER Non-beginning character of a person name
B-ORG Beginning character of an organization name
I-ORG Non-beginning character of an organization name
B-LOC Beginning character of a location name
I-LOC Non-beginning character of a location name
B-GPE Beginning character of a geopolitical entity
I-GPE Non-beginning character of a geopolitical entity

Test data

    Test data will be provided one-sentence per line, unsegmented with no tags. Participants should format their results to conform to the training data format described above. Scoring will be done automatically using a variant of the Co-NLL 2003 scoring script. Comments at the beginning of the file describe usage.

论文总结

论文题目:Chinese Named Entity Recognition with Conditional Random Fields

​ 这篇论文使用了基本特征和辅助特征,而且加入了一个后处理的对不正确的结果进行纠正。主要是在特征的设计上面。

1.基本特征:$C_n(n=-2,-1,0,1,2)$ $C_nC_n+1(n=-1,0)$

​ 特征的使用是特征模板定义的,特征模板可以在一个上下文环境(窗口)中安装Unigram、Bigram、Trigram等使用特征。例如,在句子“中国和日本是邻邦“中,若当前字是”和“,Bigram模板C_2C-1/C-1C0、C0C1、C1C2将分别拓展出”中国“、”国和“、”和日“、”日本“四个特征。0.-1.-2,1,2表示相对当前字符及前后一个或者两个字符的位置。CRF模型就是根据这些特征及类别标签生成判别函数,最后在测试时为每一个token按概率最大原则打上相应的标签。

2.辅助特征:

1)词边界特征:

对人名而言,“先生“是一个重要特征。

  1. 提取任意n元($2<=n<=10\: Frequency>=10$)得到列表$W_1$

  2. 使用SSR(Statistical Substring Reduction)算法得到列表$W_2$

  3. 构建一个字符列表$CH$ (在训练语料中出现频率前20的),为了收集像“的”,“了”的字符。

  4. 从列表中$W_2$移除在$CH$出现过的字符,得到$W_3$

    用列表$W_3$作为一个字典从左向右最大匹配分词

2)字符特征

PSur: uni-gram characters, first characters of Person Name. (surname)

PC: uni-gram characters in Person Name.

PPre: bi-gram characters before Person Name. (prefix of Person Name)

PSuf: bi-gram characters after Person Name. (suffix of Person Name)

LC: uni-gram characters in Location Name or Geopolitical entity.

LSuf: uni-gram characters, the last characters of Location Name or Geopolitical Entity. (suffix of Location Name or Geopolitical Entity)

OC: uni-gram characters in Organization Name.

OSuf: uni-gram characters, the last characters of Organization Name. (suffix of Organization Name)

OBSuf: bi-gram characters, the last two characters of Organization Name. (suffix of Organization Name)

3.后处理

The post-processing tries to assign the correct tags according to n-best results for every sentence.

参考:

  1. http://sighan.cs.uchicago.edu/bakeoff2006/
  2. Chinese Named Entity Recognition with Conditional Random Fields. Wenliang Chen

本文标题:序列标注-2:命名实体识别

文章作者:goingcoder

发布时间:2018年01月17日 - 22:01

最后更新:2018年01月23日 - 14:01

原始链接:https://goingcoder.github.io/2018/01/17/ner2/

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

-------------本文结束感谢您的阅读-------------