适用于多种人类语言的分词器、文本清理器和音素化工具。

这些细节尚未由PyPI验证

项目链接

主页

项目描述

Gruut

支持SSML（语音合成标记语言）的多种人类语言的分词器、文本清理器和国际音标音素化工具。

from gruut import sentences

text = 'He wound it around the wound, saying "I read it was $10 to read."'

for sent in sentences(text, lang="en-us"):
    for word in sent:
        if word.phonemes:
            print(word.text, *word.phonemes)

输出结果

He h ˈi
wound w ˈaʊ n d
it ˈɪ t
around ɚ ˈaʊ n d
the ð ə
wound w ˈu n d
, |
saying s ˈeɪ ɪ ŋ
I ˈaɪ
read ɹ ˈɛ d
it ˈɪ t
was w ə z
ten t ˈɛ n
dollars d ˈɑ l ɚ z
to t ə
read ɹ ˈi d
. ‖

请注意，“wound”和“read”在不同的（语法）语境中具有不同的发音。

也支持SSML的一个子集

from gruut import sentences

ssml_text = """<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
    xml:lang="en-US">
<s>Today at 4pm, 2/1/2000.</s>
<s xml:lang="it">Un mese fà, 2/1/2000.</s>
</speak>"""

for sent in sentences(ssml_text, ssml=True):
    for word in sent:
        if word.phonemes:
            print(sent.idx, word.lang, word.text, *word.phonemes)

输出结果

0 en-US Today t ə d ˈeɪ
0 en-US at ˈæ t
0 en-US four f ˈɔ ɹ
0 en-US P p ˈi
0 en-US M ˈɛ m
0 en-US , |
0 en-US February f ˈɛ b j u ˌɛ ɹ i
0 en-US first f ˈɚ s t
0 en-US , |
0 en-US two t ˈu
0 en-US thousand θ ˈaʊ z ə n d
0 en-US . ‖
1 it Un u n
1 it mese ˈm e s e
1 it fà f a
1 it , |
1 it due d j u
1 it gennaio d͡ʒ e n n ˈa j o
1 it duemila d u e ˈm i l a
1 it . ‖

有关详细信息，请参阅文档。

安装

pip install gruut

在安装过程中可以添加除英语之外的语言。例如，支持法语和意大利语

pip install -f 'https://synesthesiam.github.io/prebuilt-apps/' gruut[fr,it]

需要额外的pip仓库来更新num2words分支，其中包含对更多语言的支持。

您还可以手动下载语言文件并将其放在$XDG_CONFIG_HOME/gruut/（默认为$HOME/.config/gruut）。

gruut将在目录$XDG_CONFIG_HOME/gruut/<lang>/中查找语言文件，如果未安装相应的Python包。请注意，这里的<lang>是完整的语言名称，例如de-de而不是仅仅是de。

支持的语言

gruut目前支持

阿拉伯语（ar）
捷克语（cs或cs-cz）
德语（de或de-de）
英语（en或en-us）
西班牙语（《es》或《es-es》）
波斯语/阿拉伯语（《fa》）
法语（《fr》或《fr-fr》）
意大利语（《it》或《it-it》）
卢森堡语（《lb》）
荷兰语（《nl》）
俄语（《ru》或《ru-ru》）
瑞典语（《sv》或《sv-se》）
斯瓦希里语（《sw》）

目标是支持voice2json的所有语言

依赖关系

Python 3.7或更高版本
Linux
- 已在Debian Bullseye上测试
num2words分支和Babel
- 货币/数字处理
- num2words分支包括额外的语言支持（阿拉伯语、波斯语、瑞典语、斯瓦希里语）
gruut-ipa
- 国际音标发音操作
pycrfsuite
- 词性标注和字符到音素的模型的一部分
pydateparser
- 多语言日期解析

数字、日期等

gruut可以自动将数字、日期和其他表达式转换为口头表达。这种转换既适用于解析也适用于口头表达，因此“1/1/2020”可能被解释为“M/D/Y”或“D/M/Y”，具体取决于单词或句子的语言（例如，<s lang="...">）。

gruut可以自动将以下类型的表达式扩展为单词

数字 - "123"到"one hundred and twenty three"（通过verbalize_numbers=False或--no-numbers禁用）
- 依赖于Babel进行解析和num2words进行口头表达
日期 - "1/1/2020"到"January first, twenty twenty"（通过verbalize_dates=False或--no-dates禁用）
- 依赖于pydateparser进行解析和Babel和num2words进行口头表达
货币 - "$10"到"ten dollars"（通过verbalize_currency=False或--no-currency禁用）
- 依赖于Babel进行解析和Babel和num2words进行口头表达
时间 - "12:01am"到"twelve oh one A M"（通过verbalize_times=False或--no-times禁用）
- 仅英语
- 依赖于num2words进行口头表达

命令行使用

可以使用python3 -m gruut --language <LANGUAGE> <TEXT>或使用gruut命令（来自setup.py）执行gruut模块。

gruut命令是面向行的，消耗文本并产生JSONL。您可能需要安装jq来操作来自gruut的JSONL输出。

纯文本

接受原始文本并输出包含清理后的单词/标记的JSONL。

echo 'This, right here, is some "RAW" text!' \
   | gruut --language en-us \
   | jq --raw-output '.words[].text'
This
,
right
here
,
is
some
"
RAW
"
text
!

更多详细信息可在完整的JSON输出中找到

gruut --language en-us 'More  text.' | jq .

输出

{
  "idx": 0,
  "text": "More text.",
  "text_with_ws": "More text.",
  "text_spoken": "More text",
  "par_idx": 0,
  "lang": "en-us",
  "voice": "",
  "words": [
    {
      "idx": 0,
      "text": "More",
      "text_with_ws": "More ",
      "leading_ws": "",
      "training_ws": " ",
      "sent_idx": 0,
      "par_idx": 0,
      "lang": "en-us",
      "voice": "",
      "pos": "JJR",
      "phonemes": [
        "m",
        "ˈɔ",
        "ɹ"
      ],
      "is_major_break": false,
      "is_minor_break": false,
      "is_punctuation": false,
      "is_break": false,
      "is_spoken": true,
      "pause_before_ms": 0,
      "pause_after_ms": 0
    },
    {
      "idx": 1,
      "text": "text",
      "text_with_ws": "text",
      "leading_ws": "",
      "training_ws": "",
      "sent_idx": 0,
      "par_idx": 0,
      "lang": "en-us",
      "voice": "",
      "pos": "NN",
      "phonemes": [
        "t",
        "ˈɛ",
        "k",
        "s",
        "t"
      ],
      "is_major_break": false,
      "is_minor_break": false,
      "is_punctuation": false,
      "is_break": false,
      "is_spoken": true,
      "pause_before_ms": 0,
      "pause_after_ms": 0
    },
    {
      "idx": 2,
      "text": ".",
      "text_with_ws": ".",
      "leading_ws": "",
      "training_ws": "",
      "sent_idx": 0,
      "par_idx": 0,
      "lang": "en-us",
      "voice": "",
      "pos": null,
      "phonemes": [
        "‖"
      ],
      "is_major_break": true,
      "is_minor_break": false,
      "is_punctuation": false,
      "is_break": true,
      "is_spoken": false,
      "pause_before_ms": 0,
      "pause_after_ms": 0
    }
  ],
  "pause_before_ms": 0,
  "pause_after_ms": 0
}

对于整个输入行和每个单词，text属性包含经过处理的输入文本，其中包含规范化的空白，而text_with_ws保留原始空白。仅包含口头表达单词的text_spoken属性，因此不包括标点符号和分隔符。

在单词内部

idx - 句子中单词的零基索引
sent_idx - 输入文本中句子的零基索引
pos - 词性标签（如果可用）
phonemes - 单词的国际音标音素列表（如果可用）
is_minor_break - 如果“单词”分隔短语（逗号、分号等），则返回true
is_major_break - 如果“单词”分隔句子（句号、问号等），则返回true
is_break - 如果“单词”是主要或次要的分隔符，则返回true
is_punctuation - 如果“单词”是周围的标点符号（引号、括号等），则返回true
is_spoken - 如果不是分隔符或标点符号，则返回true

有关更多信息，请参阅python3 -m gruut <LANGUAGE> --help。

SSML

支持 SSML（Synchronized Multimedia Integration Language）的一个子集

<speak> - 包裹 SSML 文本
- lang - 设置文档语言
<p> - 段落
- lang - 设置段落语言
<s> - 句子（禁用自动句子断句）
- lang - 设置句子语言
<w> / <token> - 单词（禁用自动分词）
- lang - 设置单词语言
- role - 设置单词角色（见单词角色）
<lang lang="..."> - 设置内部文本的语言
<voice name="..."> - 设置内部文本的语音
<say-as interpret-as=""> - 强制解释内部文本
- interpret-as 可以是 "spell-out", "date", "number", "time", 或 "currency"
- format - 根据 interpret-as 格式化文本的方式
  - number - "cardinal", "ordinal", "digits", 或 "year" 中的一个
  - date - 包含 "d"（基数天），"o"（序数天），"m"（月），或 "y"（年）的字符串
<break time=""> - 暂停给定的时间
- time - 秒数（"123s"）或毫秒数（"123ms"）
<mark name=""> - 用户定义的标记（单词/句子的 marks_before 和 marks_after 属性）
- name - 标记的名称
<sub alias=""> - 用 alias 替换内部文本
<phoneme ph="..."> - 为内部文本提供音素
- ph - 内部文本每个单词的音素，由空格分隔
<lexicon id="..."> - 内联或外部的发音词典
- id - 词典的唯一 ID（用于 <lookup ref="...">）
- uri - 如果为空或缺失，则词典是内联的
- 一个或多个 <lexeme> 子元素
  - 可选的 role="..."（[单词角色][#word-roles] 由空格分隔）
  - <grapheme>WORD</grapheme> - 单词文本
  - <phoneme>P H O N E M E S</phoneme> - 单词发音（音素由空格分隔）
<lookup ref="..."> - 为子元素使用发音词典
- ref - 来自 <lexicon id="..."> 的 ID

单词角色

在语音合成过程中，单词角色用于消除发音歧义。除非手动指定，否则单词的角色是从其词性标签派生的，形式为 gruut:<TAG>。对于缩写和 spell-out，使用角色 gruut:letter 来指示例如 "a" 应该读作 /eɪ/ 而不是 /ə/。

对于 en-us，从词性标签器提供了以下额外的角色

gruut:CD - 数字
gruut:DT - 限定词
gruut:IN - 介词或从属连词
gruut:JJ - 形容词
gruut:NN - 名词
gruut:PRP - 人称代词
gruut:RB - 副词
gruut:VB - 动词
gruut:VB - 动词（过去式）

内联词典

通过 <lexicon> 和 <lookup> 标签支持内联发音词典。gruut 在这里与 SSML 标准略有不同，允许在 SSML 文档内部定义词典（url 为空或缺失）。此外，可以省略 <lexicon> 元素的 id 属性，以指示一个不需要相应 <lookup> 标签的“默认”内联词典。

例如，以下文档将为单词 "tomato" 产生三种不同的发音

<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">

  <lexicon xml:id="test" alphabet="ipa">
    <lexeme>
      <grapheme>
        tomato
      </grapheme>
      <phoneme>
        <!-- Individual phonemes are separated by whitespace -->
        t ə m ˈɑ t oʊ
      </phoneme>
    </lexeme>
    <lexeme>
      <grapheme role="fake-role">
        tomato
      </grapheme>
      <phoneme>
        <!-- Made up pronunciation for fake word role -->
        t ə m ˈi t oʊ
      </phoneme>
    </lexeme>
  </lexicon>

  <w>tomato</w>
  <lookup ref="test">
    <w>tomato</w>
    <w role="fake-role">tomato</w>
  </lookup>
</speak>

第一个 "tomato" 将在美式英语词典中查找（/t ə m ˈeɪ t oʊ/）。在 <lookup> 标签的作用域内，第二个和第三个 "tomato" 单词将在内联词典中查找。第三个 "tomato" 单词附有一个角色（在这种情况下选择一个虚构的发音）。

甚至更远离SSML标准，gruut允许您完全省略<lexicon>的id。没有id，就不需要<lookup>标签，让您可以覆盖文档中任何单词的发音。

<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">

  <!-- No id means change all words without a lookup -->
  <lexicon>
    <lexeme>
      <grapheme>
        tomato
      </grapheme>
      <phoneme>
        t ə m ˈɑ t oʊ
      </phoneme>
    </lexeme>
  </lexicon>

  <w>tomato</w>
</speak>

这将使文档中所有“tomato”的发音为/t ə m ˈɑ t oʊ/（除非它们有<lookup>）。

目标受众

gruut可以将原始文本转换为音标发音，类似于phonemizer。与phonemizer不同，gruut在预构建的词汇表（发音字典）中查找单词，或者使用预训练的字符到音素模型猜测单词发音。每种语言的音素都来自精心选择的库存。

对于每种支持的语言，gruut包括以下内容：

由开源数据构建的单词发音词汇表
- 见pron_dict
用于猜测单词发音的预训练字符到音素模型

某些语言还包括

由开源数据构建的预训练词性标注器
- 见universal dependencies

项目详情

这些细节尚未由PyPI验证

项目链接

主页

发布历史发布通知 | RSS源

此版本

2.4.0

2024年7月3日

2.3.4

2022年6月17日

2.3.3

2022年6月17日

2.3.2

2022年5月11日

2.3.1

2022年5月11日

2.3.0

2022年3月30日

2.2.3

2022年3月17日

2.2.2

2022年3月11日

2.2.0

2021年12月6日

2.1.1

2021年12月3日

2.1.0

2021年11月10日

2.0.4

2021年11月5日

2.0.3

2021年11月1日

2.0.2

2021年10月19日

2.0.1

2021年10月15日

2.0.0 已撤销

2021年10月14日

撤销此版本的原因

Python 3.6的bug修复

1.3.1

2021年8月2日

1.3.0

2021年7月22日

1.2.3

2021年7月11日

1.2.2

2021年6月18日

1.2.1

2021年6月16日

1.1.0

2021年6月9日

1.0.0

2021年6月1日

0.9.5

2021年4月27日

0.9.4

2021年4月14日

0.9.3

2021年4月12日

0.9.2

2021年3月31日

0.9.1

2021年3月26日

0.8.0

2021年3月5日

0.7.0

2021年3月3日

0.3.0

2020年10月26日

0.2.1

2020年10月9日

下载文件

下载适合您平台的文件。如果您不确定选择哪个，请了解更多关于安装包的信息。

源代码分发

gruut-2.4.0.tar.gz (85.3 kB 查看哈希值)

上传时间 2024年7月3日 源代码

gruut-2.4.0.tar.gz的哈希值

gruut-2.4.0.tar.gz的哈希值
算法	哈希摘要
SHA256	`a49f693266a3a1ab5a6bde77a8f560ef27712b4169b5a6b02e6a1a873342e19e`
MD5	`bd39118707abc1b256f296e4f7bf779a`
BLAKE2b-256	`fce16b5a01ef36b5341d5d0899401e4413594dfaa21f86cfc05be8efb25baf81`

gruut 2.4.0

导航

验证细节

维护者

未经验证细节

项目链接

元数据

分类

项目描述

Gruut

安装

支持的语言

依赖关系

数字、日期等

命令行使用

纯文本

SSML

单词角色

内联词典

目标受众

项目详情

验证细节

维护者

未经验证细节

项目链接

元数据

分类

发布历史发布通知 | RSS源

下载文件

源代码分发

gruut 2.4.0

导航

验证细节

维护者

未经验证细节

项目链接

元数据

分类

项目描述

Gruut

安装

支持的语言

依赖关系

数字、日期等

命令行使用

纯文本

SSML

单词角色

内联词典

目标受众

项目详情

验证细节

维护者

未经验证细节

项目链接

元数据

分类

发布历史 发布通知 | RSS源

下载文件

源代码分发

发布历史发布通知 | RSS源