SacreMoses

这些详情尚未由PyPI验证

项目链接

主页

项目描述

Sacremoses

许可证

安装

pip install -U sacremoses

注意：Sacremoses现在仅支持Python 3（sacremoses>=0.0.41）。如果您使用Python 2，最后一个可能版本是sacremoses==0.0.40。

使用方法（Python）

分词器和反分词器

>>> from sacremoses import MosesTokenizer, MosesDetokenizer
>>> mt = MosesTokenizer(lang='en')
>>> text = 'This, is a sentence with weird\xbb symbols\u2026 appearing everywhere\xbf'
>>> expected_tokenized = 'This , is a sentence with weird \xbb symbols \u2026 appearing everywhere \xbf'
>>> tokenized_text = mt.tokenize(text, return_str=True)
>>> tokenized_text == expected_tokenized
True


>>> mt, md = MosesTokenizer(lang='en'), MosesDetokenizer(lang='en')
>>> sent = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [ ] & You're gonna shake it off? Don't?"
>>> expected_tokens = ['This', 'ain', '&apos;t', 'funny', '.', 'It', '&apos;s', 'actually', 'hillarious', ',', 'yet', 'double', 'Ls', '.', '&#124;', '&#91;', '&#93;', '&lt;', '&gt;', '&#91;', '&#93;', '&amp;', 'You', '&apos;re', 'gonna', 'shake', 'it', 'off', '?', 'Don', '&apos;t', '?']
>>> expected_detokens = "This ain't funny. It's actually hillarious, yet double Ls. | [] < > [] & You're gonna shake it off? Don't?"
>>> mt.tokenize(sent) == expected_tokens
True
>>> md.detokenize(tokens) == expected_detokens
True

真实词性标注器

>>> from sacremoses import MosesTruecaser, MosesTokenizer

# Train a new truecaser from a 'big.txt' file.
>>> mtr = MosesTruecaser()
>>> mtok = MosesTokenizer(lang='en')

# Save the truecase model to 'big.truecasemodel' using `save_to`
>> tokenized_docs = [mtok.tokenize(line) for line in open('big.txt')]
>>> mtr.train(tokenized_docs, save_to='big.truecasemodel')

# Save the truecase model to 'big.truecasemodel' after training
# (just in case you forgot to use `save_to`)
>>> mtr = MosesTruecaser()
>>> mtr.train('big.txt')
>>> mtr.save_model('big.truecasemodel')

# Truecase a string after training a model.
>>> mtr = MosesTruecaser()
>>> mtr.train('big.txt')
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")
['the', 'adventures', 'of', 'Sherlock', 'Holmes']

# Loads a model and truecase a string using trained model.
>>> mtr = MosesTruecaser('big.truecasemodel')
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES")
['the', 'adventures', 'of', 'Sherlock', 'Holmes']
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES", use_known=True)
['the', 'ADVENTURES', 'OF', 'SHERLOCK', 'HOLMES']
>>> mtr.truecase("THE ADVENTURES OF SHERLOCK HOLMES", return_str=True)
'the adventures of Sherlock Holmes'

标准化器

>>> from sacremoses import MosesPunctNormalizer
>>> mpn = MosesPunctNormalizer()
>>> mpn.normalize('THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."')
'THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."'

使用方法（命令行界面）

从版本0.0.42开始，引入了CLI的管道功能，因此应首先设置全局选项，然后再调用命令。

language
processes
encoding
quiet

$ pip install -U sacremoses>=0.1

$ sacremoses --help
Usage: sacremoses [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

Options:
  -l, --language TEXT      Use language specific rules when tokenizing
  -j, --processes INTEGER  No. of processes.
  -e, --encoding TEXT      Specify encoding of file.
  -q, --quiet              Disable progress bar.
  --version                Show the version and exit.
  -h, --help               Show this message and exit.

Commands:
  detokenize
  detruecase
  normalize
  tokenize
  train-truecase
  truecase

管道

示例：链接以下命令

使用-c选项删除控制字符的normalize。
使用-a选项的tokenize，以激进的方式分割破折号。
使用-a选项的truecase以指示模型用于ASR。
- 如果存在big.truemodel，则使用-m选项加载模型，
- 否则，使用-m选项训练模型并将其保存到big.truemodel文件。
将输出保存到控制台到big.txt.norm.tok.true文件。

cat big.txt | sacremoses -l en -j 4 \
    normalize -c tokenize -a truecase -a -m big.truemodel \
    > big.txt.norm.tok.true

分词器

$ sacremoses tokenize --help
Usage: sacremoses tokenize [OPTIONS]

Options:
  -a, --aggressive-dash-splits   Triggers dash split rules.
  -x, --xml-escape               Escape special characters for XML.
  -p, --protected-patterns TEXT  Specify file with patters to be protected in
                                 tokenisation.
  -c, --custom-nb-prefixes TEXT  Specify a custom non-breaking prefixes file,
                                 add prefixes to the default ones from the
                                 specified language.
  -h, --help                     Show this message and exit.


 $ sacremoses -l en -j 4 tokenize  < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 24363.39it/s

 $ wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/basic-protected-patterns
 $ sacremoses -l en -j 4 tokenize -p basic-protected-patterns < big.txt > big.txt.tok
100%|██████████████████████████████████| 128457/128457 [00:05<00:00, 22183.94it/s

反分词器

$ sacremoses detokenize --help
Usage: sacremoses detokenize [OPTIONS]

Options:
  -x, --xml-unescape  Unescape special characters for XML.
  -h, --help          Show this message and exit.

 $ sacremoses -l en -j 4 detokenize < big.txt.tok > big.txt.tok.detok
100%|██████████████████████████████████| 128457/128457 [00:16<00:00, 7931.26it/s]

真实词性标注器

$ sacremoses truecase --help
Usage: sacremoses truecase [OPTIONS]

Options:
  -m, --modelfile TEXT            Filename to save/load the modelfile.
                                  [required]
  -a, --is-asr                    A flag to indicate that model is for ASR.
  -p, --possibly-use-first-token  Use the first token as part of truecase
                                  training.
  -h, --help                      Show this message and exit.

$ sacremoses -j 4 truecase -m big.model < big.txt.tok > big.txt.tok.true
100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 14257.27it/s]

反真实词性标注器

$ sacremoses detruecase --help
Usage: sacremoses detruecase [OPTIONS]

Options:
  -j, --processes INTEGER  No. of processes.
  -a, --is-headline        Whether the file are headlines.
  -e, --encoding TEXT      Specify encoding of file.
  -h, --help               Show this message and exit.

$ sacremoses -j 4 detruecase  < big.txt.tok.true > big.txt.tok.true.detrue
100%|█████████████████████████████████| 128457/128457 [00:04<00:00, 26945.16it/s]

标准化器

$ sacremoses normalize --help
Usage: sacremoses normalize [OPTIONS]

Options:
  -q, --normalize-quote-commas  Normalize quotations and commas.
  -d, --normalize-numbers       Normalize number.
  -p, --replace-unicode-puncts  Replace unicode punctuations BEFORE
                                normalization.
  -c, --remove-control-chars    Remove control characters AFTER normalization.
  -h, --help                    Show this message and exit.

$ sacremoses -j 4 normalize < big.txt > big.txt.norm
100%|██████████████████████████████████| 128457/128457 [00:09<00:00, 13096.23it/s]

项目详情

这些详情尚未由PyPI验证

项目链接

主页

发布历史发布通知 | RSS 源

本版本

0.1.1

2023年10月30日

0.1.0

2023年10月30日

0.0.53

2022年5月3日

0.0.52 已撤回

2022年5月3日

撤回此发布的原因

损坏。

0.0.51 已撤回

2022年5月2日

撤回此发布的原因

过度思考

0.0.50 已撤回

2022年5月2日

撤回此发布的原因

过度思考

0.0.49

2022年3月15日

0.0.48

2022年3月15日

0.0.47

2022年1月9日

0.0.46

2021年9月25日

0.0.45

2021年4月19日

0.0.44

2021年4月3日

0.0.43

2020年5月4日

0.0.42

2020年5月4日

0.0.41

2020年4月14日

0.0.40

2020年4月13日

0.0.39

2020年4月13日

0.0.38

2020年1月6日

0.0.35

2019年10月3日

0.0.34

2019年9月20日

0.0.33

2019年8月14日

0.0.32

2019年8月14日

0.0.31

2019年8月6日

0.0.30

2019年8月6日

0.0.29

2019年8月6日

0.0.28

2019年8月6日

0.0.27

2019年8月6日

0.0.26

2019年8月6日

0.0.25

2019年8月6日

0.0.24

2019年7月29日

0.0.22

2019年7月16日

0.0.21

2019年7月16日

0.0.20

2019年7月16日

0.0.19

2019年4月12日

0.0.18

2019年4月12日

0.0.17

2019年4月12日

0.0.16

2019年4月12日

0.0.15

2019年4月12日

0.0.14

2019年4月12日

0.0.13

2019年3月19日

0.0.12

2019年3月19日

0.0.11

2019年3月19日

0.0.10

2019年3月7日

0.0.9

2019年3月6日

0.0.8

2019年3月6日

0.0.7

2019年1月14日

0.0.5

2018年9月20日

0.0.4

2018年8月7日

0.0.3

2018年6月19日

0.0.2

2018年4月24日

0.0.1

2018年4月20日

0.0.0

2018年4月20日

下载文件

下载适合您平台的项目文件。如果您不确定选择哪个，请了解更多关于安装包的信息。

源分布

sacremoses-0.1.1.tar.gz (883.2 kB 查看哈希值)

上传时间 2023年10月30日 源

构建分布

sacremoses-0.1.1-py3-none-any.whl (897.5 kB 查看哈希值)

上传时间 2023年10月30日 Python 3

sacremoses-0.1.1.tar.gz 的哈希值

sacremoses-0.1.1.tar.gz 的哈希值
算法	哈希摘要
SHA256	`b6fd5d3a766b02154ed80b962ddca91e1fd25629c0978c7efba21ebccf663934`
MD5	`db513aea014345ad8e76295ba058159f`
BLAKE2b-256	`1d51fbdc4af4f6e85d26169e28be3763fe50ddfd0d4bf8b871422b0788dcc4d2`

sacremoses-0.1.1-py3-none-any.whl 的哈希值

sacremoses-0.1.1-py3-none-any.whl 的哈希值
算法	哈希摘要
SHA256	`31e04c98b169bfd902144824d191825cd69220cdb4ae4bcf1ec58a7db5587b1a`
MD5	`c60f9116eca30734668c38ba1f09fb7f`
BLAKE2b-256	`0bf089ee2bc9da434bd78464f288fdb346bc2932f2ee80a90b2a4bbbac262c74`

sacremoses 0.1.1

导航

验证详情

维护者

未验证详情

项目链接

元信息

分类器

项目描述

Sacremoses

许可证

安装

使用方法（Python）

分词器和反分词器

真实词性标注器

标准化器

使用方法（命令行界面）

管道

分词器

反分词器

真实词性标注器

反真实词性标注器

标准化器

项目详情

验证详情

维护者

未验证详情

项目链接

元信息

分类器

发布历史发布通知 | RSS 源

下载文件

源分布

构建分布

sacremoses 0.1.1

导航

验证详情

维护者

未验证详情

项目链接

元信息

分类器

项目描述

Sacremoses

许可证

安装

使用方法（Python）

分词器和反分词器

真实词性标注器

标准化器

使用方法（命令行界面）

管道

分词器

反分词器

真实词性标注器

反真实词性标注器

标准化器

项目详情

验证详情

维护者

未验证详情

项目链接

元信息

分类器

发布历史 发布通知 | RSS 源

下载文件

源分布

构建分布

发布历史发布通知 | RSS 源