基于词性标注的内容术语提取
项目描述
此软件包用于确定给定内容中的重要术语。它使用语言工具,如词性(POS)和一些简单的统计分析来确定术语及其强度。
详细文档
术语提取
此软件包通过使用简单的词性(POS)标注算法实现文本术语提取。
http://bioie.ldc.upenn.edu/wiki/index.php/Part-of-Speech
词性标注器
词性标注器使用词典来标记单词的标签。可用的标签列表可以在以下位置找到:
http://bioie.ldc.upenn.edu/wiki/index.php/POS_tags
由于单词可以有多个标签,因此确定正确的标签并不总是简单的。然而,此实现并不试图推断语言使用,而只是选择词典中的第一个标签。
>>> from topia.termextract import tag >>> tagger = tag.Tagger() >>> tagger <Tagger for english>
为了使标注器准备就绪,我们需要对其进行初始化。在此实现中,加载了词典。
>>> tagger.initialize()
现在我们可以大干一场了。
分词
标注的第一步是将文本分词成术语。
>>> tagger.tokenize('This is a simple example.') ['This', 'is', 'a', 'simple', 'example', '.']
虽然大多数分词器忽略标点符号,但对我们来说保持它是重要的,因为我们需要它在稍后进行术语提取。让我们现在看看一些更复杂的案例。
引号文本
>>> tagger.tokenize('This is a "simple" example.') ['This', 'is', 'a', '"', 'simple', '"', 'example', '.']
>>> tagger.tokenize('"This is a simple example."') ['"', 'This', 'is', 'a', 'simple', 'example', '."']
单词中的非字母字符。
>>> tagger.tokenize('Parts-Of-Speech') ['Parts-Of-Speech']
>>> tagger.tokenize('amazon.com') ['amazon.com']
>>> tagger.tokenize('Go to amazon.com.') ['Go', 'to', 'amazon.com', '.']
各种标点符号。
>>> tagger.tokenize('Quick, go to amazon.com.') ['Quick', ',', 'go', 'to', 'amazon.com', '.']
>>> tagger.tokenize('Live free; or die?') ['Live', 'free', ';', 'or', 'die', '?']
对错误标点的容忍度。
>>> tagger.tokenize('Hi , I am here.') ['Hi', ',', 'I', 'am', 'here', '.']
所有格结构。
>>> tagger.tokenize("my parents' car") ['my', 'parents', "'", 'car'] >>> tagger.tokenize("my father's car") ['my', 'father', "'s", 'car']
数字。
>>> tagger.tokenize("12.4") ['12.4'] >>> tagger.tokenize("-12.4") ['-12.4'] >>> tagger.tokenize("$12.40") ['$12.40']
日期。
>>> tagger.tokenize("10/3/2009") ['10/3/2009'] >>> tagger.tokenize("3.10.2009") ['3.10.2009']
好吧,就这样。
标注
下一步是标注。标注分为两个阶段。在第一阶段,通过查看词典并设置规范化形式为术语本身,将术语分配一个标签。在第二阶段,对每个标注术语应用一组规则,并调整标注和规范化。
>>> tagger('This is a simple example.') [['This', 'DT', 'This'], ['is', 'VBZ', 'is'], ['a', 'DT', 'a'], ['simple', 'JJ', 'simple'], ['example', 'NN', 'example'], ['.', '.', '.']]
哇,这个决定太准确了。让我们尝试一个复数形式名词,看看会发生什么。
>>> tagger('These are simple examples.') [['These', 'DT', 'These'], ['are', 'VBP', 'are'], ['simple', 'JJ', 'simple'], ['examples', 'NNS', 'example'], ['.', '.', '.']]
到目前为止一切顺利。让我们测试几个更多的情况。
>>> tagger("The fox's tail is red.") [['The', 'DT', 'The'], ['fox', 'NN', 'fox'], ["'s", 'POS', "'s"], ['tail', 'NN', 'tail'], ['is', 'VBZ', 'is'], ['red', 'JJ', 'red'], ['.', '.', '.']]>>> tagger("The fox can't really jump over the fox's tail.") [['The', 'DT', 'The'], ['fox', 'NN', 'fox'], ['can', 'MD', 'can'], ["'t", 'RB', "'t"], ['really', 'RB', 'really'], ['jump', 'VB', 'jump'], ['over', 'IN', 'over'], ['the', 'DT', 'the'], ['fox', 'NN', 'fox'], ["'s", 'POS', "'s"], ['tail', 'NN', 'tail'], ['.', '.', '.']]
规则
正确的默认名词标签
>>> tagger('Ikea') [['Ikea', 'NN', 'Ikea']] >>> tagger('Ikeas') [['Ikeas', 'NNS', 'Ikea']]
验证句子开头的专有名词。
>>> tagger('. Police') [['.', '.', '.'], ['police', 'NN', 'police']] >>> tagger('Police') [['police', 'NN', 'police']] >>> tagger('. Stephan') [['.', '.', '.'], ['Stephan', 'NNP', 'Stephan']]
确定情态动词后的动词
>>> tagger('The fox can jump') [['The', 'DT', 'The'], ['fox', 'NN', 'fox'], ['can', 'MD', 'can'], ['jump', 'VB', 'jump']] >>> tagger("The fox can't jump") [['The', 'DT', 'The'], ['fox', 'NN', 'fox'], ['can', 'MD', 'can'], ["'t", 'RB', "'t"], ['jump', 'VB', 'jump']] >>> tagger('The fox can really jump') [['The', 'DT', 'The'], ['fox', 'NN', 'fox'], ['can', 'MD', 'can'], ['really', 'RB', 'really'], ['jump', 'VB', 'jump']]
规范化复数形式
>>> tagger('examples') [['examples', 'NNS', 'example']] >>> tagger('stresses') [['stresses', 'NNS', 'stress']] >>> tagger('cherries') [['cherries', 'NNS', 'cherry']]
一些不适用的情况
>>> tagger('men') [['men', 'NNS', 'men']] >>> tagger('feet') [['feet', 'NNS', 'feet']]
术语提取
现在我们能够标注文本,让我们看看术语提取。
>>> from topia.termextract import extract >>> extractor = extract.TermExtractor() >>> extractor <TermExtractor using <Tagger for english>>
如您所见,提取器维护一个标注器
>>> extractor.tagger <Tagger for english>
在创建提取器时,您还可以传递一个标注器以避免频繁初始化标注器
>>> extractor = extract.TermExtractor(tagger) >>> extractor.tagger is tagger True
让我们为一个简单文本获取术语。
>>> extractor("The fox can't jump over the fox's tail.") []
我们没有获得任何术语。这是因为默认情况下,如果术语由单个单词组成,则至少必须检测到该术语的3次出现。
提取器维护一个过滤器组件。让我们注册一个平凡的许可过滤器,它只是简单地返回提取器建议的任何内容
>>> extractor.filter = extract.permissiveFilter >>> extractor("The fox can't jump over the fox's tail.") [('tail', 1, 1), ('fox', 2, 1)]
但是,让我们再次查看默认过滤器,因为它允许调整其参数
>>> extractor.filter = extract.DefaultFilter(singleStrengthMinOccur=2) >>> extractor("The fox can't jump over the fox's tail.") [('fox', 2, 1)]
现在让我们看看多词术语。多词名词和专有名词往往只在文本中出现一次或两次。但它们通常是很好的术语!为了处理这种情况,引入了“强度”的概念。目前,强度简单地是术语中的单词数量。默认情况下,所有强度大于1的术语都会被选中,而不管出现的次数。
>>> extractor('The German consul of Boston resides in Newton.') [('German consul', 1, 2)]
示例 - 新闻文章
本文档提供了一个简单的例子,说明如何从2009年5月29日的BBC文章中提取术语。我们将使用几个术语提取工具来比较结果。
>>> text =''' ... Police shut Palestinian theatre in Jerusalem. ... ... Israeli police have shut down a Palestinian theatre in East Jerusalem. ... ... The action, on Thursday, prevented the closing event of an international ... literature festival from taking place. ... ... Police said they were acting on a court order, issued after intelligence ... indicated that the Palestinian Authority was involved in the event. ... ... Israel has occupied East Jerusalem since 1967 and has annexed the ... area. This is not recognised by the international community. ... ... The British consul-general in Jerusalem , Richard Makepeace, was ... attending the event. ... ... "I think all lovers of literature would regard this as a very ... regrettable moment and regrettable decision," he added. ... ... Mr Makepeace said the festival's closing event would be reorganised to ... take place at the British Council in Jerusalem. ... ... The Israeli authorities often take action against events in East ... Jerusalem they see as connected to the Palestinian Authority. ... ... Saturday's opening event at the same theatre was also shut down. ... ... A police notice said the closure was on the orders of Israel's internal ... security minister on the grounds of a breach of interim peace accords ... from the 1990s. ... ... These laid the framework for talks on establishing a Palestinian state ... alongside Israel, but left the status of Jerusalem to be determined by ... further negotiation. ... ... Israel has annexed East Jerusalem and declares it part of its eternal ... capital. ... ... Palestinians hope to establish their capital in the area. ... '''
Yahoo关键词提取器
Yahoo提供了一种服务,可以通过其庞大的搜索数据库从内容中提取术语。
http://developer.yahoo.com/search/content/V1/termExtraction.html
如您所见,结果非常出色
<ResultSet> <Result>british consul general</Result> <Result>east jerusalem</Result> <Result>literature festival</Result> <Result>richard makepeace</Result> <Result>international literature</Result> <Result>israeli authorities</Result> <Result>eternal capital</Result> <Result>peace accords</Result> <Result>security minister</Result> <Result>israeli police</Result> <Result>internal security</Result> <Result>palestinian state</Result> <Result>palestinian authority</Result> <Result>british council</Result> <Result>palestinians</Result> <Result>negotiation</Result> <Result>breach</Result> <Result>1990s</Result> <Result>closure</Result> <Result>israel</Result> </ResultSet>
遗憾的是,该服务每天只允许5000次请求。此外,术语没有强度指示器。
TreeTagger
一个使用一些语言学来标记文本的词性标注器。以下是其输出结果
Police NNS Police shut VVD shut Palestinian JJ Palestinian theatre NN theatre in IN in Jerusalem NP Jerusalem . SENT . Israeli JJ Israeli police NNS police have VHP have shut VVN shut down RP down a DT a Palestinian JJ Palestinian theatre NN theatre in IN in East NP East Jerusalem NP Jerusalem . SENT . The DT the action NN action , , , on IN on Thursday NP Thursday , , , prevented VVD prevent the DT the closing NN closing event NN event of IN of an DT an international JJ international literature NN literature festival NN festival from IN from taking VVG take place NN place . SENT . Police NNS Police said VVD say they PP they were VBD be acting VVG act on IN on a DT a court NN court order NN order , , , issued VVN issue after IN after intelligence NN intelligence indicated VVN indicate that IN that the DT the Palestinian NP Palestinian Authority NP Authority was VBD be involved VVN involve in IN in the DT the event NN event . SENT . Israel NP Israel has VHZ have occupied VVN occupy East NP East Jerusalem NP Jerusalem since IN since 1967 CD @card@ and CC and has VHZ have annexed VVN annex the DT the area NN area . SENT . This DT this is VBZ be not RB not recognised VVN recognise by IN by the DT the international JJ international community NN community . SENT . The DT the British JJ British consul-general NN <unknown> in IN in Jerusalem NP Jerusalem , , , Richard NP Richard Makepeace NP Makepeace , , , was VBD be attending VVG attend the DT the event NN event . SENT . " `` " I PP I think VVP think all DT all lovers NNS lover of IN of literature NN literature would MD would regard VV regard this DT this as IN as a DT a very RB very regrettable JJ regrettable moment NN moment and CC and regrettable JJ regrettable decision NN decision , , , " '' " he PP he added VVD add . SENT . Mr NP Mr Makepeace NP Makepeace said VVD say the DT the festival NN festival 's POS 's closing NN closing event NN event would MD would be VB be reorganised VVN <unknown> to TO to take VV take place NN place at IN at the DT the British NP British Council NP Council in IN in Jerusalem NP Jerusalem . SENT . The DT the Israeli JJ Israeli authorities NNS authority often RB often take VVP take action NN action against IN against events NNS event in IN in East NP East Jerusalem NP Jerusalem they PP they see VVP see as RB as connected VVN connect to TO to the DT the Palestinian JJ Palestinian Authority NP Authority . SENT . Saturday NP Saturday 's POS 's opening NN opening event NN event at IN at the DT the same JJ same theatre NN theatre was VBD be also RB also shut VVN shut down RP down . SENT . A DT a police NN police notice NN notice said VVD say the DT the closure NN closure was VBD be on IN on the DT the orders NNS order of IN of Israel NP Israel 's POS 's internal JJ internal security NN security minister NN minister on IN on the DT the grounds NNS ground of IN of a DT a breach NN breach of IN of interim JJ interim peace NN peace accords NNS accord from IN from the DT the 1990s NNS 1990s . SENT . These DT these laid VVD lay the DT the framework NN framework for IN for talks NNS talk on IN on establishing VVG establish a DT a Palestinian JJ Palestinian state NN state alongside IN alongside Israel NP Israel , , , but CC but left VVD leave the DT the status NN status of IN of Jerusalem NP Jerusalem to TO to be VB be determined VVN determine by IN by further JJR further negotiation NN negotiation . SENT . Israel NP Israel has VHZ have annexed VVN annex East NP East Jerusalem NP Jerusalem and CC and declares VVZ declare it PP it part NN part of IN of its PP$ its eternal JJ eternal capital NN capital . SENT . Palestinians NPS Palestinians hope VVP hope to TO to establish VV establish their PP$ their capital NN capital in IN in the DT the area NN area . SENT .
如您所见,TreeTagger的识别相当不错,但输出结果需要一些分析才能生成有用的术语集。此外,TreeTagger商业用途不是免费的。
Topia的术语提取器
Topia的术语提取器试图在像TreeTagger这样的词性标注器和Yahoo关键词提取之间产生结果。
由于我们只对名词感兴趣,可以部署一个非常简单的词性标注算法,这在大多数情况下都能提供良好的结果。然后我们使用一些简单的统计和语言学方法来生成一个窄而强大的术语列表。
>>> from topia.termextract import extract >>> extractor = extract.TermExtractor()
让我们首先看看标注器的结果
>>> printTaggedTerms(extractor.tagger(text)) #doctest: +REPORT_NDIFF police NN police shut VBN shut Palestinian JJ Palestinian theatre NN theatre in IN in Jerusalem NNP Jerusalem . . . Israeli JJ Israeli police NN police have VBP have shut VBN shut down RB down a DT a Palestinian JJ Palestinian theatre NN theatre in IN in East NNP East Jerusalem NNP Jerusalem . . . The DT The action NN action , , , on IN on Thursday NNP Thursday , , , prevented VBN prevented the DT the closing VBG closing event NN event of IN of an DT an international JJ international literature NN literature festival NN festival from IN from taking VBG taking place NN place . . . police NN police said VBD said they PRP they were VBD were acting VBG acting on IN on a DT a court NN court order NN order , , , issued VBN issued after IN after intelligence NN intelligence indicated VBD indicated that IN that the DT the Palestinian JJ Palestinian Authority NNP Authority was VBD was involved VBN involved in IN in the DT the event NN event . . . Israel NNP Israel has VBZ has occupied VBN occupied East NNP East Jerusalem NNP Jerusalem since IN since 1967 NN 1967 and CC and has VBZ has annexed VBD annexed the DT the area NN area . . . This DT This is VBZ is not RB not recognised VBD recognised by IN by the DT the international JJ international community NN community . . . The DT The British JJ British consul-general NN consul-general in IN in Jerusalem NNP Jerusalem , , , Richard NNP Richard Makepeace NNP Makepeace , , , was VBD was attending VBG attending the DT the event NN event . . . " " " I PRP I think VBP think all DT all lovers NNS lover of IN of literature NN literature would MD would regard VB regard this DT this as IN as a DT a very RB very regrettable JJ regrettable moment NN moment and CC and regrettable JJ regrettable decision NN decision ," , ," he PRP he added VBD added . . . Mr NNP Mr Makepeace NNP Makepeace said VBD said the DT the festival NN festival 's POS 's closing VBG closing event NN event would MD would be VB be reorganised NN reorganised to TO to take VB take place NN place at IN at the DT the British JJ British Council NNP Council in IN in Jerusalem NNP Jerusalem . . . The DT The Israeli JJ Israeli authorities NNS authority often RB often take VB take action NN action against IN against events NNS event in IN in East NNP East Jerusalem NNP Jerusalem they PRP they see VB see as IN as connected VBN connected to TO to the DT the Palestinian JJ Palestinian Authority NNP Authority . . . Saturday NNP Saturday 's POS 's opening NN opening event NN event at IN at the DT the same JJ same theatre NN theatre was VBD was also RB also shut VBN shut down RB down . . . A DT A police NN police notice NN notice said VBD said the DT the closure NN closure was VBD was on IN on the DT the orders NNS order of IN of Israel NNP Israel 's POS 's internal JJ internal security NN security minister NN minister on IN on the DT the grounds NNS ground of IN of a DT a breach NN breach of IN of interim JJ interim peace NN peace accords NNS accord from IN from the DT the 1990 NN 1990 s PRP s . . . These DT These laid VBN laid the DT the framework NN framework for IN for talks NNS talk on IN on establishing VBG establishing a DT a Palestinian JJ Palestinian state NN state alongside IN alongside Israel NNP Israel , , , but CC but left VBN left the DT the status NN status of IN of Jerusalem NNP Jerusalem to TO to be VB be determined VBN determined by IN by further JJ further negotiation NN negotiation . . . Israel NNP Israel has VBZ has annexed VBD annexed East NNP East Jerusalem NNP Jerusalem and CC and declares VBZ declares it PRP it part NN part of IN of its PRP$ its eternal JJ eternal capital NN capital . . . Palestinians NNPS Palestinian hope NN hope to TO to establish VB establish their PRP$ their capital NN capital in IN in the DT the area NN area . . .
现在让我们应用提取器。
>>> sorted(extractor(text)) [('British Council', 1, 2), ('British consul-general', 1, 2), ('East', 4, 1), ('East Jerusalem', 4, 2), ('Israel', 4, 1), ('Israeli authorities', 1, 2), ('Israeli police', 1, 2), ('Jerusalem', 8, 1), ('Mr Makepeace', 1, 2), ('Palestinian', 6, 1), ('Palestinian Authority', 2, 2), ('Palestinian state', 1, 2), ('Palestinian theatre', 2, 2), ('Palestinians hope', 1, 2), ('Richard Makepeace', 1, 2), ('court order', 1, 2), ('event', 6, 1), ('literature festival', 1, 2), ('opening event', 1, 2), ('peace accords', 1, 2), ('police', 4, 1), ('police notice', 1, 2), ('security minister', 1, 2), ('theatre', 3, 1)]
更改
1.1.0 (2009-06-29)
略微改进了字典,以改进实际场景。
1.0.0 (2009-05-30)
初始版本
使用现有词汇表和非常简单的语言规则进行词性文本标注。
基于出现频率和术语强度进行术语提取。
项目详情
topia.termextract-1.1.0.tar.gz的哈希值
算法 | 哈希摘要 | |
---|---|---|
SHA256 | 86f82423613d6c33a975e666484af936d7b0c54ff5af6264d3b007c1a168f6bf |
|
MD5 | 1db016dd114588b75bdc3c458eb97f01 |
|
BLAKE2b-256 | d1b9452257976ebee91d07c74bc4b34cfce416f45b94af1d62902ae39bf902cf |