用于自然语言处理应用的文本增强库。

这些详情尚未由PyPI 验证

项目链接

主页

目标受众
- 开发者
许可证
- OSI批准 :: MIT许可证
自然语言
- 英语
操作系统
- OS独立
编程语言
- Python :: 3
- Python :: 实现 :: PyPy
主题
- 文本处理 :: 语言学

项目描述

TextAugment：通过全局增强方法改进短文本分类

您已找到TextAugment。

TextAugment是一个用于自然语言处理应用的Python 3库，它站在NLTK、Gensim v3.x和TextBlob的巨人肩膀上，并与它们很好地协作。

>>> from textaugment import Word2vec, Fasttext
>>> t = Word2vec(model='path/to/gensim/model'or 'gensim model itself')
>>> t.augment('The stories are good')
The films are good
>>> t = Fasttext(model='path/to/gensim/model'or 'gensim model itself')
>>> t.augment('The stories are good')
The films are good

高级示例

>>> runs = 1 # By default.
>>> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf)
>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

>>> word = Word2vec(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5)
>>> word.augment('The stories are good', top_n=10)
The movies are excellent
>>> fast = Fasttext(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5)
>>> fast.augment('The stories are good', top_n=10)
The movies are excellent

基于WordNet的增强

基本示例

>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('wordnet')
>>> from textaugment import Wordnet
>>> t = Wordnet()
>>> t.augment('In the afternoon, John is going to town')
In the afternoon, John is walking to town

高级示例

>>> v = True # enable verbs augmentation. By default is True.
>>> n = False # enable nouns augmentation. By default is False.
>>> runs = 1 # number of times to augment a sentence. By default is 1.
>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

>>> t = Wordnet(v=False ,n=True, p=0.5)
>>> t.augment('In the afternoon, John is going to town', top_n=10)
In the afternoon, Joseph is going to town.

基于RTT的增强

示例

>>> src = "en" # source language of the sentence
>>> to = "fr" # target language
>>> from textaugment import Translate
>>> t = Translate(src="en", to="fr")
>>> t.augment('In the afternoon, John is going to town')
In the afternoon John goes to town

EDA：提升文本分类任务性能的简单数据增强技术

这是Jason Wei和Kai Zou实现的EDA实现。

https://www.aclweb.org/anthology/D19-1670.pdf

请参阅这个笔记本中的示例

同义词替换

随机选择句子中不是停用词的n个词语。随机将这些词语中的一个同义词替换。

基本示例

>>> from textaugment import EDA
>>> t = EDA()
>>> t.synonym_replacement("John is going to town", top_n=10)
John is give out to town

随机删除

以概率p随机删除句子中的每个词语。

基本示例

>>> from textaugment import EDA
>>> t = EDA()
>>> t.random_deletion("John is going to town", p=0.2)
is going to town

随机交换

随机选择句子中的两个词语并交换它们的位置。重复此操作n次。

基本示例

>>> from textaugment import EDA
>>> t = EDA()
>>> t.random_swap("John is going to town")
John town going to is

随机插入

找到句子中非停用词的随机同义词。将这个同义词插入句子中的随机位置。重复此操作n次。

基本示例

>>> from textaugment import EDA
>>> t = EDA()
>>> t.random_insertion("John is going to town")
John is going to make up town

AEDA：文本分类的简单数据增强技术

这是Karimi等人实现的AEDA实现，是EDA的变体。它基于随机插入标点符号。

https://aclanthology.org/2021.findings-emnlp.234.pdf

实现

请参阅这个笔记本中的示例

随机插入标点符号

基本示例

>>> from textaugment import AEDA
>>> t = AEDA()
>>> t.punct_insertion("John is going to town")
! John is going to town

Mixup增强

这是由Hongyi Zhang, Moustapha Cisse, Yann Dauphin, David Lopez-Paz实现的mixup增强，适用于NLP。

在Augmenting Data with Mixup for Sentence Classification: An Empirical Study中使用。

Mixup是一种通用且简单直接的数据增强原理。本质上，mixup通过训练神经网络对成对的示例及其标签的凸组合进行训练。通过这样做，mixup将神经网络正则化，以在训练示例之间偏好简单的线性行为。

实现

请参阅这个笔记本中的示例

由❤在

Python

作者

致谢

在使用此库时引用此论文。 Arxiv版本

@inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

许可证

MIT许可。有关更多详细信息，请参阅捆绑的LICENCE文件。

项目详情

这些详情尚未由PyPI 验证

项目链接

主页

目标受众
- 开发者
许可证
- OSI批准 :: MIT许可证
自然语言
- 英语
操作系统
- OS独立
编程语言
- Python :: 3
- Python :: 实现 :: PyPy
主题
- 文本处理 :: 语言学

发布历史发布通知 | RSS源

此版本

2.0.0

2023年11月16日

1.3.4

2020年11月5日

1.3.3

2020年10月21日

1.3.2

2020年6月10日

1.3.1

2020年5月29日

1.3

2020年5月28日

1.2

2020年5月23日

1.1

2019年7月15日

1.0

2019年7月15日

下载文件

下载适用于您的平台的文件。如果您不确定选择哪一个，请了解更多关于安装包的信息。

源代码分发

textaugment-2.0.0.tar.gz (20.4 kB 查看哈希值)

上传时间 2023年11月16日 源代码

构建分发

textaugment-2.0.0-py3-none-any.whl (19.3 kB 查看哈希值)

上传时间 2023年11月16日 Python 3

哈希值 for textaugment-2.0.0.tar.gz

哈希值 for textaugment-2.0.0.tar.gz
算法	哈希摘要
SHA256	`1964518a1a27e53919ea3199801a01f956441f028c787af69c45c799b9520c83`
MD5	`3bb2fb0c3ee77789ae9bbb0c068e9b4b`
BLAKE2b-256	`1406c8655c49032f0ffe351bbd0a3bc20f80674beaa6867510af04a879bff098`

哈希值 for textaugment-2.0.0-py3-none-any.whl

哈希值 for textaugment-2.0.0-py3-none-any.whl
算法	哈希摘要
SHA256	`823ea30743711375d6ae95ca81eae5e350e3acc13641d60d368f94283fc22182`
MD5	`fa447288daf066c075ee7f41840af025`
BLAKE2b-256	`fdfda2a4b69a8fe92e5c79783a408cc393a24994cf425b39ac01cd41453fedc7`

textaugment 2.0.0

导航

验证详情

维护者

未验证详情

项目链接

元数据

分类器

项目描述

TextAugment：通过全局增强方法改进短文本分类

您已找到TextAugment。

目录

特性

引用论文

要求

安装

如何使用

基于Fasttext/Word2vec的增强

基于WordNet的增强

基于RTT的增强

EDA：提升文本分类任务性能的简单数据增强技术

这是Jason Wei和Kai Zou实现的EDA实现。

同义词替换

随机删除

随机交换

随机插入

AEDA：文本分类的简单数据增强技术

实现

随机插入标点符号

Mixup增强

实现

由❤在

作者

致谢

许可证

项目详情

验证详情

维护者

未验证详情

项目链接

元数据

分类器

发布历史 发布通知 | RSS源

下载文件

源代码分发

构建分发

发布历史发布通知 | RSS源