跳转到主要内容

基于词性标注的内容术语提取

项目描述

此软件包用于确定给定内容中的重要术语。它使用语言工具,如词性(POS)和一些简单的统计分析来确定术语及其强度。

详细文档

术语提取

此软件包通过使用简单的词性(POS)标注算法实现文本术语提取。

http://bioie.ldc.upenn.edu/wiki/index.php/Part-of-Speech

词性标注器

词性标注器使用词典来标记单词的标签。可用的标签列表可以在以下位置找到:

http://bioie.ldc.upenn.edu/wiki/index.php/POS_tags

由于单词可以有多个标签,因此确定正确的标签并不总是简单的。然而,此实现并不试图推断语言使用,而只是选择词典中的第一个标签。

>>> from topia.termextract import tag
>>> tagger = tag.Tagger()
>>> tagger
<Tagger for english>

为了使标注器准备就绪,我们需要对其进行初始化。在此实现中,加载了词典。

>>> tagger.initialize()

现在我们可以大干一场了。

分词

标注的第一步是将文本分词成术语。

>>> tagger.tokenize('This is a simple example.')
['This', 'is', 'a', 'simple', 'example', '.']

虽然大多数分词器忽略标点符号,但对我们来说保持它是重要的,因为我们需要它在稍后进行术语提取。让我们现在看看一些更复杂的案例。

  • 引号文本

    >>> tagger.tokenize('This is a "simple" example.')
    ['This', 'is', 'a', '"', 'simple', '"', 'example', '.']
    
    >>> tagger.tokenize('"This is a simple example."')
    ['"', 'This', 'is', 'a', 'simple', 'example', '."']
    
  • 单词中的非字母字符。

    >>> tagger.tokenize('Parts-Of-Speech')
    ['Parts-Of-Speech']
    
    >>> tagger.tokenize('amazon.com')
    ['amazon.com']
    
    >>> tagger.tokenize('Go to amazon.com.')
    ['Go', 'to', 'amazon.com', '.']
    
  • 各种标点符号。

    >>> tagger.tokenize('Quick, go to amazon.com.')
    ['Quick', ',', 'go', 'to', 'amazon.com', '.']
    
    >>> tagger.tokenize('Live free; or die?')
    ['Live', 'free', ';', 'or', 'die', '?']
    
  • 对错误标点的容忍度。

    >>> tagger.tokenize('Hi , I am here.')
    ['Hi', ',', 'I', 'am', 'here', '.']
    
  • 所有格结构。

    >>> tagger.tokenize("my parents' car")
    ['my', 'parents', "'", 'car']
    >>> tagger.tokenize("my father's car")
    ['my', 'father', "'s", 'car']
    
  • 数字。

    >>> tagger.tokenize("12.4")
    ['12.4']
    >>> tagger.tokenize("-12.4")
    ['-12.4']
    >>> tagger.tokenize("$12.40")
    ['$12.40']
    
  • 日期。

    >>> tagger.tokenize("10/3/2009")
    ['10/3/2009']
    >>> tagger.tokenize("3.10.2009")
    ['3.10.2009']
    

好吧,就这样。

标注

下一步是标注。标注分为两个阶段。在第一阶段,通过查看词典并设置规范化形式为术语本身,将术语分配一个标签。在第二阶段,对每个标注术语应用一组规则,并调整标注和规范化。

>>> tagger('This is a simple example.')
[['This', 'DT', 'This'],
 ['is', 'VBZ', 'is'],
 ['a', 'DT', 'a'],
 ['simple', 'JJ', 'simple'],
 ['example', 'NN', 'example'],
 ['.', '.', '.']]

哇,这个决定太准确了。让我们尝试一个复数形式名词,看看会发生什么。

>>> tagger('These are simple examples.')
[['These', 'DT', 'These'],
 ['are', 'VBP', 'are'],
 ['simple', 'JJ', 'simple'],
 ['examples', 'NNS', 'example'],
 ['.', '.', '.']]

到目前为止一切顺利。让我们测试几个更多的情况。

>>> tagger("The fox's tail is red.")
[['The', 'DT', 'The'],
 ['fox', 'NN', 'fox'],
 ["'s", 'POS', "'s"],
 ['tail', 'NN', 'tail'],
 ['is', 'VBZ', 'is'],
 ['red', 'JJ', 'red'],
 ['.', '.', '.']]
>>> tagger("The fox can't really jump over the fox's tail.")
[['The', 'DT', 'The'],
 ['fox', 'NN', 'fox'],
 ['can', 'MD', 'can'],
 ["'t", 'RB', "'t"],
 ['really', 'RB', 'really'],
 ['jump', 'VB', 'jump'],
 ['over', 'IN', 'over'],
 ['the', 'DT', 'the'],
 ['fox', 'NN', 'fox'],
 ["'s", 'POS', "'s"],
 ['tail', 'NN', 'tail'],
 ['.', '.', '.']]
规则
  • 正确的默认名词标签

    >>> tagger('Ikea')
    [['Ikea', 'NN', 'Ikea']]
    >>> tagger('Ikeas')
    [['Ikeas', 'NNS', 'Ikea']]
    
  • 验证句子开头的专有名词。

    >>> tagger('. Police')
    [['.', '.', '.'], ['police', 'NN', 'police']]
    >>> tagger('Police')
    [['police', 'NN', 'police']]
    >>> tagger('. Stephan')
    [['.', '.', '.'], ['Stephan', 'NNP', 'Stephan']]
    
  • 确定情态动词后的动词

    >>> tagger('The fox can jump')
    [['The', 'DT', 'The'],
     ['fox', 'NN', 'fox'],
     ['can', 'MD', 'can'],
     ['jump', 'VB', 'jump']]
    >>> tagger("The fox can't jump")
    [['The', 'DT', 'The'],
     ['fox', 'NN', 'fox'],
     ['can', 'MD', 'can'],
     ["'t", 'RB', "'t"],
     ['jump', 'VB', 'jump']]
    >>> tagger('The fox can really jump')
    [['The', 'DT', 'The'],
     ['fox', 'NN', 'fox'],
     ['can', 'MD', 'can'],
     ['really', 'RB', 'really'],
     ['jump', 'VB', 'jump']]
    
  • 规范化复数形式

    >>> tagger('examples')
    [['examples', 'NNS', 'example']]
    >>> tagger('stresses')
    [['stresses', 'NNS', 'stress']]
    >>> tagger('cherries')
    [['cherries', 'NNS', 'cherry']]
    

    一些不适用的情况

    >>> tagger('men')
    [['men', 'NNS', 'men']]
    >>> tagger('feet')
    [['feet', 'NNS', 'feet']]
    

术语提取

现在我们能够标注文本,让我们看看术语提取。

>>> from topia.termextract import extract
>>> extractor = extract.TermExtractor()
>>> extractor
<TermExtractor using <Tagger for english>>

如您所见,提取器维护一个标注器

>>> extractor.tagger
<Tagger for english>

在创建提取器时,您还可以传递一个标注器以避免频繁初始化标注器

>>> extractor = extract.TermExtractor(tagger)
>>> extractor.tagger is tagger
True

让我们为一个简单文本获取术语。

>>> extractor("The fox can't jump over the fox's tail.")
[]

我们没有获得任何术语。这是因为默认情况下,如果术语由单个单词组成,则至少必须检测到该术语的3次出现。

提取器维护一个过滤器组件。让我们注册一个平凡的许可过滤器,它只是简单地返回提取器建议的任何内容

>>> extractor.filter = extract.permissiveFilter
>>> extractor("The fox can't jump over the fox's tail.")
[('tail', 1, 1), ('fox', 2, 1)]

但是,让我们再次查看默认过滤器,因为它允许调整其参数

>>> extractor.filter = extract.DefaultFilter(singleStrengthMinOccur=2)
>>> extractor("The fox can't jump over the fox's tail.")
[('fox', 2, 1)]

现在让我们看看多词术语。多词名词和专有名词往往只在文本中出现一次或两次。但它们通常是很好的术语!为了处理这种情况,引入了“强度”的概念。目前,强度简单地是术语中的单词数量。默认情况下,所有强度大于1的术语都会被选中,而不管出现的次数。

>>> extractor('The German consul of Boston resides in Newton.')
[('German consul', 1, 2)]

示例 - 新闻文章

本文档提供了一个简单的例子,说明如何从2009年5月29日的BBC文章中提取术语。我们将使用几个术语提取工具来比较结果。

>>> text ='''
... Police shut Palestinian theatre in Jerusalem.
...
... Israeli police have shut down a Palestinian theatre in East Jerusalem.
...
... The action, on Thursday, prevented the closing event of an international
... literature festival from taking place.
...
... Police said they were acting on a court order, issued after intelligence
... indicated that the Palestinian Authority was involved in the event.
...
... Israel has occupied East Jerusalem since 1967 and has annexed the
... area. This is not recognised by the international community.
...
... The British consul-general in Jerusalem , Richard Makepeace, was
... attending the event.
...
... "I think all lovers of literature would regard this as a very
... regrettable moment and regrettable decision," he added.
...
... Mr Makepeace said the festival's closing event would be reorganised to
... take place at the British Council in Jerusalem.
...
... The Israeli authorities often take action against events in East
... Jerusalem they see as connected to the Palestinian Authority.
...
... Saturday's opening event at the same theatre was also shut down.
...
... A police notice said the closure was on the orders of Israel's internal
... security minister on the grounds of a breach of interim peace accords
... from the 1990s.
...
... These laid the framework for talks on establishing a Palestinian state
... alongside Israel, but left the status of Jerusalem to be determined by
... further negotiation.
...
... Israel has annexed East Jerusalem and declares it part of its eternal
... capital.
...
... Palestinians hope to establish their capital in the area.
... '''

Yahoo关键词提取器

Yahoo提供了一种服务,可以通过其庞大的搜索数据库从内容中提取术语。

http://developer.yahoo.com/search/content/V1/termExtraction.html

如您所见,结果非常出色

<ResultSet>
   <Result>british consul general</Result>
   <Result>east jerusalem</Result>
   <Result>literature festival</Result>
   <Result>richard makepeace</Result>
   <Result>international literature</Result>
   <Result>israeli authorities</Result>
   <Result>eternal capital</Result>
   <Result>peace accords</Result>
   <Result>security minister</Result>
   <Result>israeli police</Result>
   <Result>internal security</Result>
   <Result>palestinian state</Result>
   <Result>palestinian authority</Result>
   <Result>british council</Result>
   <Result>palestinians</Result>
   <Result>negotiation</Result>
   <Result>breach</Result>
   <Result>1990s</Result>
   <Result>closure</Result>
   <Result>israel</Result>
</ResultSet>

遗憾的是,该服务每天只允许5000次请求。此外,术语没有强度指示器。

TreeTagger

一个使用一些语言学来标记文本的词性标注器。以下是其输出结果

Police          NNS       Police
shut            VVD       shut
Palestinian     JJ        Palestinian
theatre         NN        theatre
in              IN        in
Jerusalem       NP        Jerusalem
.               SENT      .
Israeli         JJ        Israeli
police          NNS       police
have            VHP       have
shut            VVN       shut
down            RP        down
a               DT        a
Palestinian     JJ        Palestinian
theatre         NN        theatre
in              IN        in
East            NP        East
Jerusalem       NP        Jerusalem
.               SENT      .
The             DT        the
action          NN        action
,               ,         ,
on              IN        on
Thursday        NP        Thursday
,               ,         ,
prevented       VVD       prevent
the             DT        the
closing         NN        closing
event           NN        event
of              IN        of
an              DT        an
international   JJ        international
literature      NN        literature
festival        NN        festival
from            IN        from
taking          VVG       take
place           NN        place
.               SENT      .
Police          NNS       Police
said            VVD       say
they            PP        they
were            VBD       be
acting          VVG       act
on              IN        on
a               DT        a
court           NN        court
order           NN        order
,               ,         ,
issued          VVN       issue
after           IN        after
intelligence    NN        intelligence
indicated       VVN       indicate
that            IN        that
the             DT        the
Palestinian     NP        Palestinian
Authority       NP        Authority
was             VBD       be
involved        VVN       involve
in              IN        in
the             DT        the
event           NN        event
.               SENT      .
Israel          NP        Israel
has             VHZ       have
occupied        VVN       occupy
East            NP        East
Jerusalem       NP        Jerusalem
since           IN        since
1967            CD        @card@
and             CC        and
has             VHZ       have
annexed         VVN       annex
the             DT        the
area            NN        area
.               SENT      .
This            DT        this
is              VBZ       be
not             RB        not
recognised      VVN       recognise
by              IN        by
the             DT        the
international   JJ        international
community       NN        community
.               SENT      .
The             DT        the
British         JJ        British
consul-general  NN        <unknown>
in              IN        in
Jerusalem       NP        Jerusalem
,               ,         ,
Richard         NP        Richard
Makepeace       NP        Makepeace
,               ,         ,
was             VBD       be
attending       VVG       attend
the             DT        the
event           NN        event
.               SENT      .
"               ``        "
I               PP        I
think           VVP       think
all             DT        all
lovers          NNS       lover
of              IN        of
literature      NN        literature
would           MD        would
regard          VV        regard
this            DT        this
as              IN        as
a               DT        a
very            RB        very
regrettable     JJ        regrettable
moment          NN        moment
and             CC        and
regrettable     JJ        regrettable
decision        NN        decision
,               ,         ,
"               ''        "
he              PP        he
added           VVD       add
.               SENT      .
Mr              NP        Mr
Makepeace       NP        Makepeace
said            VVD       say
the             DT        the
festival        NN        festival
's              POS       's
closing         NN        closing
event           NN        event
would           MD        would
be              VB        be
reorganised     VVN       <unknown>
to              TO        to
take            VV        take
place           NN        place
at              IN        at
the             DT        the
British         NP        British
Council         NP        Council
in              IN        in
Jerusalem       NP        Jerusalem
.               SENT      .
The             DT        the
Israeli         JJ        Israeli
authorities     NNS       authority
often           RB        often
take            VVP       take
action          NN        action
against         IN        against
events          NNS       event
in              IN        in
East            NP        East
Jerusalem       NP        Jerusalem
they            PP        they
see             VVP       see
as              RB        as
connected       VVN       connect
to              TO        to
the             DT        the
Palestinian     JJ        Palestinian
Authority       NP        Authority
.               SENT      .
Saturday        NP        Saturday
's              POS       's
opening         NN        opening
event           NN        event
at              IN        at
the             DT        the
same            JJ        same
theatre         NN        theatre
was             VBD       be
also            RB        also
shut            VVN       shut
down            RP        down
.               SENT      .
A               DT        a
police          NN        police
notice          NN        notice
said            VVD       say
the             DT        the
closure         NN        closure
was             VBD       be
on              IN        on
the             DT        the
orders          NNS       order
of              IN        of
Israel          NP        Israel
's              POS       's
internal        JJ        internal
security        NN        security
minister        NN        minister
on              IN        on
the             DT        the
grounds         NNS       ground
of              IN        of
a               DT        a
breach          NN        breach
of              IN        of
interim         JJ        interim
peace           NN        peace
accords         NNS       accord
from            IN        from
the             DT        the
1990s           NNS       1990s
.               SENT      .
These           DT        these
laid            VVD       lay
the             DT        the
framework       NN        framework
for             IN        for
talks           NNS       talk
on              IN        on
establishing    VVG       establish
a               DT        a
Palestinian     JJ        Palestinian
state NN        state
alongside       IN        alongside
Israel          NP        Israel
,               ,         ,
but             CC        but
left            VVD       leave
the             DT        the
status          NN        status
of              IN        of
Jerusalem       NP        Jerusalem
to              TO        to
be              VB        be
determined      VVN       determine
by              IN        by
further         JJR       further
negotiation     NN        negotiation
.               SENT      .
Israel          NP        Israel
has             VHZ       have
annexed         VVN       annex
East            NP        East
Jerusalem       NP        Jerusalem
and             CC        and
declares        VVZ       declare
it              PP        it
part            NN        part
of              IN        of
its             PP$       its
eternal         JJ        eternal
capital         NN        capital
.               SENT      .
Palestinians    NPS       Palestinians
hope            VVP       hope
to              TO        to
establish       VV        establish
their           PP$       their
capital         NN        capital
in              IN        in
the             DT        the
area            NN        area
.               SENT      .

如您所见,TreeTagger的识别相当不错,但输出结果需要一些分析才能生成有用的术语集。此外,TreeTagger商业用途不是免费的。

Topia的术语提取器

Topia的术语提取器试图在像TreeTagger这样的词性标注器和Yahoo关键词提取之间产生结果。

由于我们只对名词感兴趣,可以部署一个非常简单的词性标注算法,这在大多数情况下都能提供良好的结果。然后我们使用一些简单的统计和语言学方法来生成一个窄而强大的术语列表。

>>> from topia.termextract import extract
>>> extractor = extract.TermExtractor()

让我们首先看看标注器的结果

>>> printTaggedTerms(extractor.tagger(text)) #doctest: +REPORT_NDIFF
police          NN    police
shut            VBN   shut
Palestinian     JJ    Palestinian
theatre         NN    theatre
in              IN    in
Jerusalem       NNP   Jerusalem
.               .     .
Israeli         JJ    Israeli
police          NN    police
have            VBP   have
shut            VBN   shut
down            RB    down
a               DT    a
Palestinian     JJ    Palestinian
theatre         NN    theatre
in              IN    in
East            NNP   East
Jerusalem       NNP   Jerusalem
.               .     .
The             DT    The
action          NN    action
,               ,     ,
on              IN    on
Thursday        NNP   Thursday
,               ,     ,
prevented       VBN   prevented
the             DT    the
closing         VBG   closing
event           NN    event
of              IN    of
an              DT    an
international   JJ    international
literature      NN    literature
festival        NN    festival
from            IN    from
taking          VBG   taking
place           NN    place
.               .     .
police          NN    police
said            VBD   said
they            PRP   they
were            VBD   were
acting          VBG   acting
on              IN    on
a               DT    a
court           NN    court
order           NN    order
,               ,     ,
issued          VBN   issued
after           IN    after
intelligence    NN    intelligence
indicated       VBD   indicated
that            IN    that
the             DT    the
Palestinian     JJ    Palestinian
Authority       NNP   Authority
was             VBD   was
involved        VBN   involved
in              IN    in
the             DT    the
event           NN    event
.               .     .
Israel          NNP   Israel
has             VBZ   has
occupied        VBN   occupied
East            NNP   East
Jerusalem       NNP   Jerusalem
since           IN    since
1967            NN    1967
and             CC    and
has             VBZ   has
annexed         VBD   annexed
the             DT    the
area            NN    area
.               .     .
This            DT    This
is              VBZ   is
not             RB    not
recognised      VBD   recognised
by              IN    by
the             DT    the
international   JJ    international
community       NN    community
.               .     .
The             DT    The
British         JJ    British
consul-general  NN    consul-general
in              IN    in
Jerusalem       NNP   Jerusalem
,               ,     ,
Richard         NNP   Richard
Makepeace       NNP   Makepeace
,               ,     ,
was             VBD   was
attending       VBG   attending
the             DT    the
event           NN    event
.               .     .
"               "     "
I               PRP   I
think           VBP   think
all             DT    all
lovers          NNS   lover
of              IN    of
literature      NN    literature
would           MD    would
regard          VB    regard
this            DT    this
as              IN    as
a               DT    a
very            RB    very
regrettable     JJ    regrettable
moment          NN    moment
and             CC    and
regrettable     JJ    regrettable
decision        NN    decision
,"              ,     ,"
he              PRP   he
added           VBD   added
.               .     .
Mr              NNP   Mr
Makepeace       NNP   Makepeace
said            VBD   said
the             DT    the
festival        NN    festival
's              POS   's
closing         VBG   closing
event           NN    event
would           MD    would
be              VB    be
reorganised     NN    reorganised
to              TO    to
take            VB    take
place           NN    place
at              IN    at
the             DT    the
British         JJ    British
Council         NNP   Council
in              IN    in
Jerusalem       NNP   Jerusalem
.               .     .
The             DT    The
Israeli         JJ    Israeli
authorities     NNS   authority
often           RB    often
take            VB    take
action          NN    action
against         IN    against
events          NNS   event
in              IN    in
East            NNP   East
Jerusalem       NNP   Jerusalem
they            PRP   they
see             VB    see
as              IN    as
connected       VBN   connected
to              TO    to
the             DT    the
Palestinian     JJ    Palestinian
Authority       NNP   Authority
.               .     .
Saturday        NNP   Saturday
's              POS   's
opening         NN    opening
event           NN    event
at              IN    at
the             DT    the
same            JJ    same
theatre         NN    theatre
was             VBD   was
also            RB    also
shut            VBN   shut
down            RB    down
.               .     .
A               DT    A
police          NN    police
notice          NN    notice
said            VBD   said
the             DT    the
closure         NN    closure
was             VBD   was
on              IN    on
the             DT    the
orders          NNS   order
of              IN    of
Israel          NNP   Israel
's              POS   's
internal        JJ    internal
security        NN    security
minister        NN    minister
on              IN    on
the             DT    the
grounds         NNS   ground
of              IN    of
a               DT    a
breach          NN    breach
of              IN    of
interim         JJ    interim
peace           NN    peace
accords         NNS   accord
from            IN    from
the             DT    the
1990            NN    1990
s               PRP   s
.               .     .
These           DT    These
laid            VBN   laid
the             DT    the
framework       NN    framework
for             IN    for
talks           NNS   talk
on              IN    on
establishing    VBG   establishing
a               DT    a
Palestinian     JJ    Palestinian
state           NN    state
alongside       IN    alongside
Israel          NNP   Israel
,               ,     ,
but             CC    but
left            VBN   left
the             DT    the
status          NN    status
of              IN    of
Jerusalem       NNP   Jerusalem
to              TO    to
be              VB    be
determined      VBN   determined
by              IN    by
further         JJ    further
negotiation     NN    negotiation
.               .     .
Israel          NNP   Israel
has             VBZ   has
annexed         VBD   annexed
East            NNP   East
Jerusalem       NNP   Jerusalem
and             CC    and
declares        VBZ   declares
it              PRP   it
part            NN    part
of              IN    of
its             PRP$  its
eternal         JJ    eternal
capital         NN    capital
.               .     .
Palestinians    NNPS  Palestinian
hope            NN    hope
to              TO    to
establish       VB    establish
their           PRP$  their
capital         NN    capital
in              IN    in
the             DT    the
area            NN    area
.               .     .

现在让我们应用提取器。

>>> sorted(extractor(text))
[('British Council', 1, 2),
 ('British consul-general', 1, 2),
 ('East', 4, 1),
 ('East Jerusalem', 4, 2),
 ('Israel', 4, 1),
 ('Israeli authorities', 1, 2),
 ('Israeli police', 1, 2),
 ('Jerusalem', 8, 1),
 ('Mr Makepeace', 1, 2),
 ('Palestinian', 6, 1),
 ('Palestinian Authority', 2, 2),
 ('Palestinian state', 1, 2),
 ('Palestinian theatre', 2, 2),
 ('Palestinians hope', 1, 2),
 ('Richard Makepeace', 1, 2),
 ('court order', 1, 2),
 ('event', 6, 1),
 ('literature festival', 1, 2),
 ('opening event', 1, 2),
 ('peace accords', 1, 2),
 ('police', 4, 1),
 ('police notice', 1, 2),
 ('security minister', 1, 2),
 ('theatre', 3, 1)]

更改

1.1.0 (2009-06-29)

  • 略微改进了字典,以改进实际场景。

1.0.0 (2009-05-30)

  • 初始版本

    • 使用现有词汇表和非常简单的语言规则进行词性文本标注。

    • 基于出现频率和术语强度进行术语提取。

项目详情


下载文件

下载适合您平台的应用程序文件。如果您不确定选择哪个,请了解更多关于安装包的信息。

源代码分发

topia.termextract-1.1.0.tar.gz (571.7 kB 查看哈希值)

上传时间 源代码

由以下支持

AWS AWS 云计算和安全赞助商 Datadog Datadog 监控 Fastly Fastly CDN Google Google 下载分析 Microsoft Microsoft PSF赞助商 Pingdom Pingdom 监控 Sentry Sentry 错误记录 StatusPage StatusPage 状态页面