下载和清理Common Crawl的工具

这些详细信息尚未由PyPI 验证

项目链接

项目描述

cc_net

下载和清理Common Crawl的工具，如我们在论文CCNet中所述。

如果您觉得这些资源有用，请考虑引用

@article{wenzek2019ccnet,
  title={Ccnet: Extracting high quality monolingual datasets from web crawl data},
  author={Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzman, Francisco and Joulin, Armand and Grave, Edouard},
  journal={arXiv preprint arXiv:1911.00359},
  year={2019}
}

安装

我们仅在Linux上尝试过此操作，但在MacOS上也应该可以安装。

创建或创建到您想要下载语料库的data文件夹的软链接。
运行make install。这将下载一些资源并安装所需的包。
如果您有C++ 17编译器，也可以运行pip install .[getpy]，它提供了更高效的hashset。
如果make install失败，请手动安装以下工具

lmplz和build_binary来自KenLM
spm_train和spm_encode来自Sentence Piece

训练语言模型

使用Makefile在维基百科数据上训练Sentence Piece和LM。

make help显示帮助
make lang=de lm在德语维基百科上训练Sentence Piece和LM
make all_lm训练与论文中相同的模型
make lang=de dl_lm下载论文中训练的LM
make dl_all_lm下载所有LM

管道概述

完整的挖掘流程分为3步

hashes 下载一个 Common-Crawl 快照，并计算每个段落的哈希值
mine 删除重复项，检测语言，运行语言模型并按语言/困惑度桶分割
regroup 将 mine 创建的文件按 4Gb 的块重新分组

每个步骤都需要在开始之前完成上一个步骤。您可以使用 python -m cc_net 启动整个管道。

python -m cc_net --help 显示帮助
python -m cc_net --dump 2019-13 处理特定的快照
python -m cc_net -l my -l gu 限制到特定语言
python -m cc_net --lm_dir my_lms/ 使用自定义语言模型
python -m cc_net --lang_threshold 0.3 在 mine.Config 中设置特定字段
python -m cc_net --config test 在快照的一小部分上运行
python -m cc_net --config config/my_config.json 使用给定配置文件中的配置

重现我们的工作

由于运行完整管道需要大量的 CPU 资源，我们分享了一个从 URL 到我们计算的信息的映射。您可以使用以下命令重建论文中使用的语料库：

python -m cc_net --conf reproduce --dump 2019-09

提取 XLM-R 数据

《大规模无监督跨语言表征学习（XLM-RoBERTa）》论文是在由 cc_net 的内部版本提取的数据上训练的。

由于格式略有不同，请使用以下命令代替

python cc_net/tools/dl_cc_100.py --help
python cc_net/tools/dl_cc_100.py --outdir data_cc100 --process 8

如果您使用此版本的数据，请考虑引用

@article{conneau2019unsupervised,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}

适应您的基础设施

由于运行完整管道的计算成本很高，我们使用 Slurm 集群通过 submitit 分配了计算。如果找到 Slurm 集群，submitit 将默认在您的机器上启动进程。您应该调整 --task_parallelism 以适应您的机器。默认值是挖掘 512 和重现 20。

要使用进程内运行任务，请使用 --execution debug。

输出格式

生成的文件是压缩的 JSON 文件。每行有一个 JSON 对象。

字段列表:

url: 网页 URL（CC 的一部分）
date_download: 下载日期（CC 的一部分）
digest: 网页的 sha1 校验和（CC 的一部分）
length: 字符数
nlines: 行数
source_domain: 网页的 Web 域名
title: 页面标题（CC 的一部分）
raw_content: 去重后的网页内容
original_nlines: 去重前的行数
original_length: 去重前的字符数
language: 使用 FastText LID 检测到的语言
language_score: 语言分数
perplexity: 在维基百科上训练的语言模型的困惑度

示例 JSON 对象:

{
  "url": "http://www.pikespeakhospice.org/members/1420",
  "date_download": "2019-02-15T18:40:25Z",
  "digest": "sha1:VQW3KXUOALO543IJGTK2JLVEAN2XXKHI",
  "length": 752,
  "nlines": 5,
  "source_domain": "www.pikespeakhospice.org",
  "title": "LeeRoy Aragon",
  "raw_content": "Date Honored: March 2017\nHe was a man of integrity, a hard worker, and a dedicated family man. He loved spending time with family camping, fishing, hunting, boating and just hanging out.\nHis Catholic faith was extremely important to him as he gave of his time and talents to the community. He had many friends through church and the Knights of Columbus. He was a meticulous handyman, and enjoyed building and fixing things and restoring antique furniture to perfection. He was a fan and supported his Colorado Rockies and Denver Broncos. Throughout the years he had devoted four-legged friends (his dogs and a horse named Sunny Boy).\nWe have many cherished memories of him that we will treasure until we are with him again.\n~ Family of LeeRoy F. Aragon",
  "original_nlines": 7,
  "original_length": 754,
  "language": "en",
  "language_score": 0.99,
  "perplexity": 255.11,
}

您可以使用 UNIX 工具 zcat 和 jq 查看这些文件，例如：zcat data/mined/2019-09/en_head_0000.json.gz | head -1 | jq .

jq 可以进行一些复杂的过滤。 jsonql.py 提供了一个具有多进程支持的 Python API，可以执行更复杂的操作，如文档的语言模型评分。

许可

通过向 cc_net 贡献，您同意您的贡献将根据此源树根目录中的 LICENSE 文件许可。

项目详情

这些详细信息尚未由PyPI 验证

项目链接

发行历史发布通知 | RSS 源

此版本

1.0.0

2020 年 11 月 2 日

0.0.0

2019 年 10 月 30 日

下载文件

下载适合您平台的文件。如果您不确定选择哪个，请了解有关安装包的更多信息。

源代码分发

cc_net-1.0.0.tar.gz (81.3 kB 查看哈希值)

上传时间 2020年11月2日 源代码

哈希值 for cc_net-1.0.0.tar.gz

cc_net-1.0.0.tar.gz 的哈希值
算法	哈希摘要
SHA256	`60131c23498bfa1428b4c6d311cceb60e8f298d5d4d1900eca4f8402543d75f3`
MD5	`7591c4493e2127f2971dee5fe32830c4`
BLAKE2b-256	`bb46ca08b95f9164a01a01448cd41524c9b21c34b1b79b1cacb92ad7a14be608`

cc-net 1.0.0

导航

验证详情

维护者

未验证详情

项目链接

元数据

分类器

项目描述

cc_net

安装

训练语言模型

管道概述

重现我们的工作

提取 XLM-R 数据

适应您的基础设施

输出格式

许可

项目详情

验证详情

维护者

未验证详情

项目链接

元数据

分类器

发行历史发布通知 | RSS 源

下载文件

源代码分发

cc-net 1.0.0

导航

验证详情

维护者

未验证详情

项目链接

元数据

分类器

项目描述

cc_net

安装

训练语言模型

管道概述

重现我们的工作

提取 XLM-R 数据

适应您的基础设施

输出格式

许可

项目详情

验证详情

维护者

未验证详情

项目链接

元数据

分类器

发行历史 发布通知 | RSS 源

下载文件

源代码分发

发行历史发布通知 | RSS 源