fasttext

fasttext Python绑定

这些详细信息尚未由PyPI验证

项目链接

主页

项目描述

fastText是一个用于高效学习词表示和句子分类的库。

在本文档中，我们介绍了如何在Python中使用fastText。

要求

fastText 构建于现代 macOS 和 Linux 发行版之上。由于它使用了 C++11 特性，因此需要支持 C++11 的编译器。您需要 Python（版本 2.7 或 ≥ 3.4），NumPy 和 SciPy 以及 pybind11。

安装

要安装最新版本，您可以这样做

$ pip install fasttext

或者，要获取 fastText 的最新开发版本，可以从我们的 GitHub 仓库安装

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ sudo pip install .
$ # or :
$ sudo python setup.py install

使用概述

词表示模型

如这里所述，为了学习词向量，我们可以使用 fasttext.train_unsupervised 函数，如下所示

import fasttext

# Skipgram model :
model = fasttext.train_unsupervised('data.txt', model='skipgram')

# or, cbow model :
model = fasttext.train_unsupervised('data.txt', model='cbow')

其中 data.txt 是一个包含 utf-8 编码文本的训练文件。

返回的 model 对象代表您学习的模型，您可以使用它来检索信息。

print(model.words)   # list of words in dictionary
print(model['king']) # get the vector of the word 'king'

保存和加载模型对象

您可以通过调用函数 save_model 来保存您的训练模型对象。

model.save_model("model_filename.bin")

然后通过函数 load_model 在以后检索它

model = fasttext.load_model("model_filename.bin")

有关 fastText 中词表示使用的更多信息，您可以参阅我们的词表示教程。

文本分类模型

为了使用这里描述的方法训练文本分类器，我们可以使用 fasttext.train_supervised 函数，如下所示

import fasttext

model = fasttext.train_supervised('data.train.txt')

其中 data.train.txt 是一个文本文件，每行包含一个训练句子及其标签。默认情况下，我们假设标签是前缀为字符串 __label__ 的单词

模型训练完成后，我们可以检索单词和标签列表

print(model.words)
print(model.labels)

为了通过计算测试集上的精确度（P@1）和召回率来评估我们的模型，我们使用 test 函数

def print_results(N, p, r):
    print("N\t" + str(N))
    print("P@{}\t{:.3f}".format(1, p))
    print("R@{}\t{:.3f}".format(1, r))

print_results(*model.test('test.txt'))

我们还可以预测特定文本的标签

model.predict("Which baking dish is best to bake a banana bread ?")

默认情况下，predict 仅返回一个标签：概率最高的那个标签。您也可以通过指定参数 k 来预测多个标签

model.predict("Which baking dish is best to bake a banana bread ?", k=3)

如果您想预测多个句子，可以传递一个字符串数组

model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)

当然，您也可以像在词表示使用中那样将模型保存到/从文件中

有关 fastText 中的文本分类使用的更多信息，您可以参阅我们的文本分类教程。

使用量化压缩模型文件

当您想要保存一个监督模型文件时，fastText 可以通过牺牲一点性能将其压缩，以使模型文件更小。

# with the previously trained `model` object, call :
model.quantize(input='data.train.txt', retrain=True)

# then display results and save the new model :
print_results(*model.test(valid_data))
model.save_model("model_filename.ftz")

model_filename.ftz 将比 model_filename.bin 小得多。

有关量化的进一步阅读，您可以参考我们博客文章中的这一段。

重要：数据预处理/编码约定

通常，正确预处理您的数据非常重要。特别是，我们根目录中的示例脚本就是这样做的。

fastText假定文本是UTF-8编码。所有文本必须为Python2的unicode和Python3的str。传入的文本将被pybind11编码为UTF-8，然后传递给fastText C++库。这意味着在构建模型时使用UTF-8编码的文本非常重要。在类Unix系统中，您可以使用iconv转换文本。

fastText将基于以下ASCII字符（字节）进行分词（将文本分割成片段）。特别是，它不识别UTF-8空白。我们建议用户将UTF-8空白/单词边界转换为以下符号之一。

空格
制表符
垂直制表符
回车
换页
空字符

换行符用于分隔文本行。特别是，如果遇到换行符，EOS标记将附加到文本行上。唯一的例外是，如果标记的数量超过在Dictionary头文件中定义的MAX_LINE_SIZE常量。这意味着如果您有不以换行符分隔的文本，例如fil9数据集，它将根据MAX_LINE_SIZE标记数分成块，且不附加EOS标记。

标记的长度是根据一个字节的前两位来确定的，这是为了通过UTF-8字节的前两位来识别多字节序列的后续字节。了解这一点在选择子词的最小和最大长度时尤其重要。此外，EOS标记（如Dictionary头文件中指定）被视为一个字符，并且不会被分割成子词。

API

train_unsupervised参数

input             # training file path (required)
model             # unsupervised fasttext model {cbow, skipgram} [skipgram]
lr                # learning rate [0.05]
dim               # size of word vectors [100]
ws                # size of the context window [5]
epoch             # number of epochs [5]
minCount          # minimal number of word occurences [5]
minn              # min length of char ngram [3]
maxn              # max length of char ngram [6]
neg               # number of negatives sampled [5]
wordNgrams        # max length of word ngram [1]
loss              # loss function {ns, hs, softmax, ova} [ns]
bucket            # number of buckets [2000000]
thread            # number of threads [number of cpus]
lrUpdateRate      # change the rate of updates for the learning rate [100]
t                 # sampling threshold [0.0001]
verbose           # verbose [2]

train_supervised参数

input             # training file path (required)
lr                # learning rate [0.1]
dim               # size of word vectors [100]
ws                # size of the context window [5]
epoch             # number of epochs [5]
minCount          # minimal number of word occurences [1]
minCountLabel     # minimal number of label occurences [1]
minn              # min length of char ngram [0]
maxn              # max length of char ngram [0]
neg               # number of negatives sampled [5]
wordNgrams        # max length of word ngram [1]
loss              # loss function {ns, hs, softmax, ova} [softmax]
bucket            # number of buckets [2000000]
thread            # number of threads [number of cpus]
lrUpdateRate      # change the rate of updates for the learning rate [100]
t                 # sampling threshold [0.0001]
label             # label prefix ['__label__']
verbose           # verbose [2]
pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []

model对象

train_supervised、train_unsupervised和load_model函数返回一个_FastText类的实例，我们通常称之为model对象。

该对象将训练参数公开为属性：lr、dim、ws、epoch、minCount、minCountLabel、minn、maxn、neg、wordNgrams、loss、bucket、thread、lrUpdateRate、t、label、verbose、pretrainedVectors。因此，model.wordNgrams将给出用于训练此模型的词ngram的最大长度。

此外，该对象公开了几个函数

get_dimension           # Get the dimension (size) of a lookup vector (hidden layer).
                        # This is equivalent to `dim` property.
get_input_vector        # Given an index, get the corresponding vector of the Input Matrix.
get_input_matrix        # Get a copy of the full input matrix of a Model.
get_labels              # Get the entire list of labels of the dictionary
                        # This is equivalent to `labels` property.
get_line                # Split a line of text into words and labels.
get_output_matrix       # Get a copy of the full output matrix of a Model.
get_sentence_vector     # Given a string, get a single vector represenation. This function
                        # assumes to be given a single line of text. We split words on
                        # whitespace (space, newline, tab, vertical tab) and the control
                        # characters carriage return, formfeed and the null character.
get_subword_id          # Given a subword, return the index (within input matrix) it hashes to.
get_subwords            # Given a word, get the subwords and their indicies.
get_word_id             # Given a word, get the word id within the dictionary.
get_word_vector         # Get the vector representation of word.
get_words               # Get the entire list of words of the dictionary
                        # This is equivalent to `words` property.
is_quantized            # whether the model has been quantized
predict                 # Given a string, get a list of labels and a list of corresponding probabilities.
quantize                # Quantize the model reducing the size of the model and it's memory footprint.
save_model              # Save the model to the given path
test                    # Evaluate supervised model using file given by path
test_label              # Return the precision and recall score for each label.

属性words、labels返回词典中的单词和标签

model.words         # equivalent to model.get_words()
model.labels        # equivalent to model.get_labels()

对象覆盖了__getitem__和__contains__函数，以便返回单词的表示形式并检查单词是否在词汇表中。

model['king']       # equivalent to model.get_word_vector('king')
'king' in model     # equivalent to `'king' in model.get_words()`

加入fastText社区

项目详情

这些详细信息尚未由PyPI验证

项目链接

主页

发布历史发布通知 | RSS订阅

此版本

0.9.3

2024年6月12日

0.9.2

2020年4月28日

0.9.1

2019年6月27日

0.8.4

2019年6月24日

0.8.3

2017年2月14日

0.8.2

2016年12月21日

0.8.1

2016年11月10日

0.8.0

2016年10月5日

0.7.6

2016年9月6日

0.7.5

2016年9月1日

0.7.4

2016年9月1日

0.7.3

2016年9月1日

0.7.2

2016年8月24日

0.7.1

2016年8月22日

0.7.0

2016年8月22日

0.6.4

2016年8月20日

0.6.3

2016年8月19日

0.6.2

2016年8月15日

0.6.1

2016年8月15日

0.6.0

2016年8月14日

0.5.19

2016年8月12日

0.5.18

2016年8月12日

0.5.17

2016年8月10日

0.5.16

2016年8月10日

0.5.15

2016年8月10日

0.5.14

2016年8月10日

0.5.13

2016年8月10日

0.5.12

2016年8月10日

0.5.1

2016年8月10日

0.5.0

2016年8月10日

0.4.0

2016年8月9日

0.3.1

2016年8月9日

0.3.0

2016年8月9日

0.2.1

2016年8月7日

0.2.0

2016年8月6日

0.1.0

2016年8月5日

下载文件

下载适用于您平台的文件。如果您不确定该选择哪个，请了解有关安装包的更多信息。

源代码分发

fasttext-0.9.3.tar.gz (73.4 kB 查看哈希值)

上传时间 2024年6月12日 源代码

构建分发

fasttext-0.9.3-cp39-cp39-macosx_14_0_arm64.whl (282.8 kB 查看哈希值)

上传时间 2024年6月12日 CPython 3.9 macOS 14.0+ ARM64

哈希值 for fasttext-0.9.3.tar.gz

哈希值 for fasttext-0.9.3.tar.gz
算法	哈希摘要
SHA256	`eb03f2ef6340c6ac9e4398a30026f05471da99381b307aafe2f56e4cd26baaef`
MD5	`0ccab4e897c80f2d85402d03c8f7734b`
BLAKE2b-256	`9f3b9a10b95eaf565358339162848863197c3f0a29b540ca22b2951df2d66a48`

哈希值 for fasttext-0.9.3-cp39-cp39-macosx_14_0_arm64.whl

哈希值 for fasttext-0.9.3-cp39-cp39-macosx_14_0_arm64.whl
算法	哈希摘要
SHA256	`8b39f3ac5df43873648ea400cb75d4f7f9455730ac5105490b23b70f14e03ea7`
MD5	`22b3f774ae0ad110b1e2e4bee2e90b64`
BLAKE2b-256	`a8f28d3c5969c4cca40752b9da43978b5c65b74a2eab48a2964dacfd2a53c5d5`

fasttext 0.9.3

导航

验证详细信息

维护者

未验证详细信息

项目链接

元数据

分类器

项目描述

fastText

目录

要求

安装

使用概述

词表示模型

保存和加载模型对象

文本分类模型

使用量化压缩模型文件

重要：数据预处理/编码约定

更多示例

API

train_unsupervised参数

train_supervised参数

model对象

加入fastText社区

项目详情

验证详细信息

维护者

未验证详细信息

项目链接

元数据

分类器

发布历史发布通知 | RSS订阅

下载文件

源代码分发

构建分发

fasttext 0.9.3

导航

验证详细信息

维护者

未验证详细信息

项目链接

元数据

分类器

项目描述

fastText

目录

要求

安装

使用概述

词表示模型

保存和加载模型对象

文本分类模型

使用量化压缩模型文件

重要：数据预处理/编码约定

更多示例

API

train_unsupervised参数

train_supervised参数

model对象

加入fastText社区

项目详情

验证详细信息

维护者

未验证详细信息

项目链接

元数据

分类器

发布历史 发布通知 | RSS订阅

下载文件

源代码分发

构建分发

发布历史发布通知 | RSS订阅