未提供项目描述

这些详情尚未由PyPI验证

项目链接

首页

项目描述

ckanext-resource_indexer

通过搜索资源的内容，在数据集搜索中查找更多结果。

此扩展会索引附加到资源中的文件内容。因此，用户在使用站点搜索时更有可能找到相关结果。

可以通过资源索引器对每种文件格式进行索引过程的定制。以下格式是默认支持的

纯文本
PDF
JSON

结构

安装
配置
索引器
- 注册自己的索引器
- 内置索引器

安装

将包作为CKAN扩展安装
```
pip install ckanext-resource-indexer
```
将resource_indexer添加到启用插件列表中
可选。通过将以下项目添加到启用插件列表中启用内置索引器

配置

# Make an attempt to index remote files(fetch into tmp folder
# using URL)
# (optional, default: false).
ckanext.resource_indexer.allow_remote = 1

# Tiemeout for the attempt to download remote file
# (optional, default: 2).
ckanext.resource_indexer.remote_timeout = 10

# The size treshold(MB) for remote resources
# (optional, default: 4).
ckanext.resource_indexer.max_remote_size = 4

# List of resource formats(lowercase) that should be
# indexed.
# (optional, default: None)
ckanext.resource_indexer.indexable_formats = txt pdf

# Store the data extracted from resource inside specified field in the index.
# If empty, store data inside the general-purpose `text` field.
# (optional, default: text)
ckanext.resoruce_indexer.index_field = extras_res_attachment

# Boost matches by resource's content. Set values greater that 1 in order
# to promote such matches and value between 0 and 1 in order to put such
# matches further in search results. Works only when using custom index
# field(`ckanext.resoruce_indexer.index_field`)
# (optional, default: 1)
ckanext.resoruce_indexer.search_boost = 0.5

##### Indexer specific option ###############

### Plain
# Space-separated list of formats that can be indexed as a plain text
# (optional, default: txt csv json yaml yml html)
ckanext.resource_indexer.plain.indexable_formats = xml txt csv

### PDF
# Change a text from a single page before it added to the index
# (optional, default: builtins:str)
ckanext.resoruce_indexer.pdf.page_processor = custom.module:value_processor

### JSON
# Index JSON files as plain text(in addition to indexing as mapping)
# (optional, default: false)
ckanext.resoruce_indexer.json.add_as_plain = true

# Change a key before it's used for patching the package dictionary
# (optional, default: builtins:str)
ckanext.resoruce_indexer.json.key_processor = custom.module:key_processor

# Change a value before it's used for patching the package dictionary
# (optional, default: builtins:str)
ckanext.resoruce_indexer.json.value_processor = custom.module:value_processor

索引器

为了从资源中提取数据，此扩展使用索引器。这些是实现IResourceIndexer接口的CKAN插件。

对于每个由 ckanext.resource_indexer.indexable_formats 配置选项指定的格式的资源，都会搜索适当的索引器。如果没有找到索引器（或资源格式未在 ckanext.resource_indexer.indexable_formats 配置选项中指定），则跳过该资源。

:信息来源: 可以使用以下方法之一临时禁用索引

设置环境变量 CKANEXT_RESOURCE_INDEXER_BYPASS(任何非空值)，插件将不会干扰标准数据集索引过程。

使用 ckanext.resource_indexer.utils.disabled_indexation 上下文管理器

with disabled_indexation():
    here_indexation_does_not_happen()

here_indexation_happens()

每个索引器都有权重（优先级）。权重最高的索引器将被用于索引资源。

索引包括两个步骤

从资源中提取有意义的片段数据
将这些数据片段合并成由搜索引擎（Solr）用于索引的包字典

这意味着提取的片段格式必须与第二步的合并逻辑兼容。但除此之外，提取数据的格式没有特定的要求。

数据提取在本地进行。如果资源已上传到本地文件系统，则直接从资源的文件中提取数据。如果资源存储在远程位置（无论是上传到云端还是通过远程URL链接），则可以临时下载到本地文件系统，并在处理完毕后删除。默认情况下，忽略非本地资源，但可以通过 ckanext.resource_indexer.allow_remote 配置选项更改此设置。

注册自己的索引器

通过提供以下方法实现 ckanext.resource_indexer.interface.IResourceIndexer

class CustomIndexerPlugin(plugins.SingletonPlugin):
    plugins.implements(IResourceIndexer)

    def get_resource_indexer_weight(self, resource: dict[str, Any]) -> int:
        """Define priority of the indexer

        Args:
            resource: resource's details

        Returns:
            the weight of the indexer
            Expected values:
               0: skip handler
               10: use handler if no other handlers found
               20: use handler as a default one for the resource
               30: use handler as an optimal one for the resource
               40: use handler as a special-case handler for the resource
               50: ignore all the other handlers and use this one instead
        """
        return Weight.fallback

    def extract_indexable_chunks(self, path: str) -> Any:
        """Extract indexable data from the resource

        The result can have any form as long as it can be merged into the
        package dictionary by implementation of `merge_chunk_into_index`.

        Args:
            path: path to resource file

        Returns:
            all meaningfuld pieces of data with no type assumption

        """
        return []

    def merge_chunks_into_index(self, pkg_dict: dict[str, Any], chunks: Any):
        """Merge data into the package dictionary.


        Args:
            pkg_dict: package that is going to be indexed
            chunks: collection of data fragments extracted from resource

        Returns:
            all meaningfuld pieces of data with no type assumption
        """
        pass

内置索引器

普通索引器

如果它们符合 ckanext.resource_indexer.plain.indexable_formats 配置选项的值，则索引由 ckanext.resource_indexer.indexable_formats 指定的索引格式，除非找到具有非回退权重（>10）的其他处理器。

资源按原样索引。读取文件并发送到索引，不进行任何额外更改。

通过将 plain_resource_indexer 添加到已启用插件的列表中启用它。

PDF索引器

从PDF文件中提取并索引文本。

为了启用它

安装带有 pdf 扩展的当前扩展
```
pip install 'ckanext-resource-indexer[pdf]'
```
或者，如果您已经安装了扩展本身，只需安装 pdftotext
```
pip install pdftotext
```
将 pdf_resource_indexer 添加到已启用插件的列表中，并

安装系统软件包以处理PDF。这取决于您的系统。例如

CentOS

yum install -y pulseaudio-libs-devel \
   gcc-c++ pkgconfig \
   python3-devel \
   libxml2-devel libxslt-devel \
   poppler poppler-utils poppler-cpp-devel

Debian

apt install build-essential libpoppler-cpp-dev pkg-config python3-dev

macOS
```
brew install pkg-config poppler python
```

如果需要预处理PDF内容，请指定将每个单独的文本转换为函数的函数，作为 ckanext.resoruce_indexer.pdf.page_processor。它使用标准的导入字符串格式：module.import.path:function

JSON索引器

从JSON文件中读取字典，将所有非字符串值转换为字符串（即不允许嵌套值），并将其作为补丁应用到索引数据集中。

如果启用了 ckanext.resoruce_indexer.json.add_as_plain 标志，则将文件内容作为纯文本索引（类似于普通索引器）

如果键或值需要预处理，请指定将数据转换为函数的函数，作为 ckanext.resoruce_indexer.json.key_processor 或 ckanext.resoruce_indexer.json.value_processor。它使用标准的导入字符串格式：module.import.path:function

通过将 json_resource_indexer 添加到已启用插件的列表中启用它。

项目详情

这些详情尚未由PyPI验证

项目链接

首页

版本历史版本通知 | RSS源

此版本

0.4.2

2024年7月28日

0.4.1

2023年2月21日

0.4.0

2023年2月13日

0.3.2

2023年2月10日

0.3.1

2022年12月1日

0.3.0

2022年12月1日

0.2.1

2022年7月26日

0.2.0

2022年7月25日

0.1.1

2021年6月16日

0.1.0

2021年6月15日

0.0.11

2021年5月25日

0.0.10

2021年5月25日

0.0.9

2021年5月14日

0.0.8

2021年4月20日

0.0.7.post1

2021年3月31日

0.0.7

2021年3月31日

0.0.6

2021年1月8日

0.0.5

2020年11月2日

0.0.4

2020年9月14日

0.0.3

2020年9月14日

0.0.2

2020年5月5日

0.0.1

2020年4月29日

下载文件

下载您平台对应的文件。如果您不确定选择哪一个，请了解更多关于安装包的信息。

源代码分发

ckanext_resource_indexer-0.4.2.tar.gz (27.9 kB 查看哈希值)

上传时间 2024年7月28日 源代码

构建分发

ckanext_resource_indexer-0.4.2-py3-none-any.whl (28.6 kB 查看哈希值)

上传时间 2024年7月28日 Python 3

哈希值 for ckanext_resource_indexer-0.4.2.tar.gz

ckanext_resource_indexer-0.4.2.tar.gz的哈希值
算法	哈希摘要
SHA256	`511ecf0806a3e320e32a7e20b8c6cfa480e14ba6570c5735c3af488c448704ab`
MD5	`d50c505b7181ec5b63b3acf592748345`
BLAKE2b-256	`19f4abcdba1c2fe14e358819584ce9ab252719ecfe58749f3702ef8b5819a9e8`

哈希值 for ckanext_resource_indexer-0.4.2-py3-none-any.whl

ckanext_resource_indexer-0.4.2-py3-none-any.whl的哈希值
算法	哈希摘要
SHA256	`81e89364cd45ab57b838e34570baecd19db2219a2d6bff609aca3e294dff8943`
MD5	`0a5938a031ef99bd3d6b5b9b36e18a7e`
BLAKE2b-256	`46bf0a45ba78193a50eb6f0db48ecd34daadc259e19f8c85bd315ffeaac503d0`

ckanext-resource-indexer 0.4.2

导航

验证详情

维护者

未经验证的详情

项目链接

元数据

分类器

项目描述

ckanext-resource_indexer

结构

安装

配置

索引器

注册自己的索引器

内置索引器

普通索引器

PDF索引器

JSON索引器

项目详情

验证详情

维护者

未经验证的详情

项目链接

元数据

分类器

版本历史版本通知 | RSS源

下载文件

源代码分发

构建分发

ckanext-resource-indexer 0.4.2

导航

验证详情

维护者

未经验证的详情

项目链接

元数据

分类器

项目描述

ckanext-resource_indexer

结构

安装

配置

索引器

注册自己的索引器

内置索引器

普通索引器

PDF索引器

JSON索引器

项目详情

验证详情

维护者

未经验证的详情

项目链接

元数据

分类器

版本历史 版本通知 | RSS源

下载文件

源代码分发

构建分发

版本历史版本通知 | RSS源