SIGHAN是国际计算语言学会(ACL)中文语言处理小组的简称,其英文全称为“Special Interest Group for Chinese Language Processing of the Association for Computational Linguistics”.Bakeoff则是SIGHAN所主办的国际中文语言处理竞赛,第一届于2003年在日本札幌举行(Bakeoff 2003),第二届于2005年在韩国济州岛举行(Bakeoff 2005), 而2006年在悉尼举行的第三届(Bakeoff 2006)则在前两届的基础上加入了中文命名实体识别评测。目前SIGHAN Bakeoff已成功举办了6届,其中Bakeoff 2005的数据和结果在其主页上是完全免费和公开的,但是请注意使用的前提是非商业使用(non-commercial):
The data and results for the 2nd International Chinese Word Segmentation Bakeoff are now available for non-commercial use.
The Third SIGHAN Chinese Language Processing Bakeoff will feature two tasks:
- Word Segmentation
- Named Entity Recognition
Word Segmentation Task
The Word Segmentation task requires identification of word boundaries in running Chinese text.
The following resources will be available:
Matched training and (new) test sets from:
Source Institution | Character Encoding | Approximate Size (chars) |
---|---|---|
CKIP, Academia Sinica, Taiwan | Traditional, Big5 | 8.3M |
City University of Hong Kong | Traditional, Big5 | 2.4M |
Microsoft Research | Simplified, CP936 | 5M |
University of PennsylvaniaUniversity of Colorado, Boulder | Simplified | 1M |
Segmentation guidelines for the following corpora are available. These were supplied to SIGHAN by each data provider, and converted into PDF by the organizer:
Corpus | MS Word | |
---|---|---|
Academia Sinica | 516 KB | 336 KB |
City University of Hong Kong | 154 KB | 237 KB |
Microsoft Research | 41 KB | 70 KB |
Named Entity Recognition Task
The Named Entity Recognition Task requires participants to identify named entities (person, location, and organization) in running unsegmented Chinese text.
The following resources will be available:
Matched training and (new) test sets from:
Source Institution | Character Encoding |
---|---|
City University of Hong Kong | Traditional, Big5 |
Microsoft Research | Simplified, CP936 |
Linguistic Data Consortium | Simplified, CP936 |
You may declare that you will return results on any subset of these corpora. For example, you may decide that you will test on the Sinica Corpus and the City University corpus. The only constraint is that you must not select a corpus where you have knowingly had previous access to the testing portion of the corpus. A corollary of this is that a team may not test on the data from their own institution.
Data Formats
Training data will be available for CityU and MSRA in two formats. The primary format will be similar to that of the Co-NLL NER task 2002, adapted for Chinese. The data will be presented in two-column format, where the first column consists of the character and the second is a tag. The tag is specified as follows:
Tag | Meaning |
---|---|
0 (zero) | Not part of a named entity |
B-PER | Beginning character of a person name |
I-PER | Non-beginning character of a person name |
B-ORG | Beginning character of an organization name |
I-ORG | Non-beginning character of an organization name |
B-LOC | Beginning character of a location name |
I-LOC | Non-beginning character of a location name |
B-GPE | Beginning character of a geopolitical entity |
I-GPE | Non-beginning character of a geopolitical entity |
Test data
Test data will be provided one-sentence per line, unsegmented with no tags. Participants should format their results to conform to the training data format described above. Scoring will be done automatically using a variant of the Co-NLL 2003 scoring script. Comments at the beginning of the file describe usage.
论文总结
论文题目:Chinese Named Entity Recognition with Conditional Random Fields
这篇论文使用了基本特征和辅助特征,而且加入了一个后处理的对不正确的结果进行纠正。主要是在特征的设计上面。
1.基本特征:$C_n(n=-2,-1,0,1,2)$ $C_nC_n+1(n=-1,0)$
特征的使用是特征模板定义的,特征模板可以在一个上下文环境(窗口)中安装Unigram、Bigram、Trigram等使用特征。例如,在句子“中国和日本是邻邦“中,若当前字是”和“,Bigram模板C_2C-1/C-1C0、C0C1、C1C2将分别拓展出”中国“、”国和“、”和日“、”日本“四个特征。0.-1.-2,1,2表示相对当前字符及前后一个或者两个字符的位置。CRF模型就是根据这些特征及类别标签生成判别函数,最后在测试时为每一个token按概率最大原则打上相应的标签。
2.辅助特征:
1)词边界特征:
对人名而言,“先生“是一个重要特征。
提取任意n元($2<=n<=10\: Frequency>=10$)得到列表$W_1$
使用SSR(Statistical Substring Reduction)算法得到列表$W_2$
构建一个字符列表$CH$ (在训练语料中出现频率前20的),为了收集像“的”,“了”的字符。
从列表中$W_2$移除在$CH$出现过的字符,得到$W_3$
用列表$W_3$作为一个字典从左向右最大匹配分词
2)字符特征
PSur: uni-gram characters, first characters of Person Name. (surname)
PC: uni-gram characters in Person Name.
PPre: bi-gram characters before Person Name. (prefix of Person Name)
PSuf: bi-gram characters after Person Name. (suffix of Person Name)
LC: uni-gram characters in Location Name or Geopolitical entity.
LSuf: uni-gram characters, the last characters of Location Name or Geopolitical Entity. (suffix of Location Name or Geopolitical Entity)
OC: uni-gram characters in Organization Name.
OSuf: uni-gram characters, the last characters of Organization Name. (suffix of Organization Name)
OBSuf: bi-gram characters, the last two characters of Organization Name. (suffix of Organization Name)
3.后处理
The post-processing tries to assign the correct tags according to n-best results for every sentence.
参考:
- http://sighan.cs.uchicago.edu/bakeoff2006/
- Chinese Named Entity Recognition with Conditional Random Fields. Wenliang Chen