为指定的Datasette表维护FAISS索引

这些详情尚未通过PyPI验证

项目链接

项目描述

datasette-faiss

为指定的Datasette表维护FAISS索引

有关此项目的背景信息，请参阅语义搜索答案：使用GPT3 + OpenAI嵌入针对文档的问答。

安装

在此环境与Datasette相同的环境中安装此插件。

datasette install datasette-faiss

使用方法

此插件在启动时为指定的表创建内存中的FAISS索引，使用IndexFlatL2 FAISS索引类型。

如果在服务器启动后修改了表，则索引将不会（尚未）获取那些更改。

配置

要索引的表必须具有id和embedding列。embedding列必须包含使用以下Python函数编码的浮点数数组的blob。

def encode(vector):
    return struct.pack("f" * len(vector), *vector)

您可以通过以下方式从此包导入该函数

from datasette_faiss import encode

您可以通过将以下内容添加到metadata.json中来指定应为其创建索引的表

{
    "plugins": {
        "datasette-faiss": {
            "tables": [
                ["blog", "embeddings"]
            ]
        }
    }
}

每个表都是一个数组，列出了数据库名和表名。

如果您正在使用metadata.yml，则配置应如下所示

plugins:
  datasette-faiss:
    tables:
    - ["blog", "embeddings"]

SQL函数

此插件在Datasette中提供了四个新的SQL函数

faiss_search(database, table, embedding, k)

返回在指定的数据库和表中找到的embedding的最近的k个邻居。例如

select faiss_search('blog', 'embeddings', (select embedding from embeddings where id = 3), 5)

这将返回一个JSON数组，其中包含在embeddings表中，位于blog数据库中，与指定嵌入最接近的五条记录的ID。返回值看起来像这样

["1", "1249", "1011", "5", "10"]

您可以使用SQLite的json_each()函数将其转换为可以与之连接的类似表的序列。

以下是一个示例查询，用于实现该功能

with related as (
  select value from json_each(
    faiss_search(
      'blog',
      'embeddings',
      (select embedding from embeddings limit 1),
      5
    )
  )
)
select * from blog_entry, related
where id = value

faiss_search_with_scores(database, table, embedding, k)

与上面相同，但返回值是一个包含ID和分数的JSON数组对，类似于以下内容

[
    ["1", 0.0],
    ["1249", 0.21042244136333466],
    ["1011", 0.29391372203826904],
    ["5", 0.29505783319473267],
    ["10", 0.31554925441741943]
]

faiss_encode(json_vector)

给定一个浮点数JSON数组，返回一个二进制嵌入blob，可用于其他函数

select faiss_encode('[2.4, 4.1, 1.8]')
-- Returns a 12 byte blob
select hex(faiss_encode('[2.4, 4.1, 1.8]'))
-- Returns 9A991940333383406666E63F

faiss_decode(vector_blob)

faiss_encode()的反操作。

select faiss_decode(X'9A991940333383406666E63F')

[2.4000000953674316, 4.099999904632568, 1.7999999523162842]

请注意，浮点数算术的结果数字可能不会精确地返回到预期的相同值。

faiss_agg(id, embedding, compare_embedding, k)

此聚合函数可用于为表中每个唯一的id值找到与compare_embedding最近的k个邻居。例如

select faiss_agg(
    id, embedding, (select embedding from embeddings where id = 3), 5
) from embeddings

与faiss_search()函数不同，这并不依赖于插件首次运行时创建的每个表的索引。相反，每次运行聚合函数时都会构建一个索引。

这意味着它应该仅用于较小值的集合 - 一旦超过10,000左右，此函数的性能可能会变得过于昂贵。

该函数返回一个表示距离分数最接近的k行ID的JSON数组，如下所示

[1324, 344, 5562, 553, 2534]

您可以使用json_each()函数将其转换为类似于表的序列，以便进行连接。

尝试一个fais_agg()查询示例.

faiss_agg_with_scores(id, embedding, compare_embedding, k)

这与faiss_agg()聚合函数类似，但它返回一个包含ID和相应分数的配对列表，类似于以下内容（如果k为2）

[[2412, 0.25], [1245, 24.25]]

尝试一个fais_agg_with_scores()查询示例.

开发

要本地设置此插件，首先检出代码。然后创建一个新的虚拟环境

cd datasette-faiss
python3 -m venv venv
source venv/bin/activate

现在安装依赖项并测试依赖项

pip install -e '.[test]'

要运行测试

pytest

项目详情

这些详情尚未通过PyPI验证

项目链接

发行历史发布通知 | RSS源

此版本

0.2.1

2024年6月17日

0.2

2023年1月19日

0.1a0 预发布

2023年1月11日

下载文件

下载您平台上的文件。如果您不确定选择哪个，请了解更多关于安装软件包的信息。

源分布

datasette_faiss-0.2.1.tar.gz (10.0 kB 查看散列)

上传时间 2024年6月17日 源

构建分布

datasette_faiss-0.2.1-py3-none-any.whl (9.5 kB 查看散列)

上传时间 2024年6月17日 Python 3

散列 for datasette_faiss-0.2.1.tar.gz

散列 for datasette_faiss-0.2.1.tar.gz
算法	散列摘要
SHA256	`f41fa89637f368a460f1d4a4ebf083c33c99d6060fff2a4c54afc3561d6522a9`
MD5	`7a7e948a04cf675f3a56af4ada9629b0`
BLAKE2b-256	`2d61674028fdf92b29c488caebe473dbc79813b4c58397ee90a1ad76114b2194`

散列 for datasette_faiss-0.2.1-py3-none-any.whl

散列 for datasette_faiss-0.2.1-py3-none-any.whl
算法	散列摘要
SHA256	`3f1989d9def3a3d6713200ed022bd5bfcda66f34e01a3f835222896daefc9717`
MD5	`5974535a0cdc1d141b3b9d700cf2fbeb`
BLAKE2b-256	`218ee221ae410407953af2ad0824c2d851a3475781ddac284b4b82617c198b3f`

datasette-faiss 0.2.1

导航

验证详情

维护者

未验证详情

项目链接

元数据

分类器

项目描述

datasette-faiss

安装

使用方法

配置

SQL函数

faiss_search(database, table, embedding, k)

faiss_search_with_scores(database, table, embedding, k)

faiss_encode(json_vector)

faiss_decode(vector_blob)

faiss_agg(id, embedding, compare_embedding, k)

faiss_agg_with_scores(id, embedding, compare_embedding, k)

开发

项目详情

验证详情

维护者

未验证详情

项目链接

元数据

分类器

发行历史发布通知 | RSS源

下载文件

源分布

构建分布

datasette-faiss 0.2.1

导航

验证详情

维护者

未验证详情

项目链接

元数据

分类器

项目描述

datasette-faiss

安装

使用方法

配置

SQL函数

faiss_search(database, table, embedding, k)

faiss_search_with_scores(database, table, embedding, k)

faiss_encode(json_vector)

faiss_decode(vector_blob)

faiss_agg(id, embedding, compare_embedding, k)

faiss_agg_with_scores(id, embedding, compare_embedding, k)

开发

项目详情

验证详情

维护者

未验证详情

项目链接

元数据

分类器

发行历史 发布通知 | RSS源

下载文件

源分布

构建分布

发行历史发布通知 | RSS源