跳转到主要内容

基于lxml的TEDective项目的XML到OCDS解析器

项目描述

ETL

Ruff REUSE status

此仓库中的代码是TEDective项目的一部分。它定义了一个ETL管道,将欧洲公共采购数据从Tenders Electronic Daily (TED)转换为更容易处理和分析的格式。主要将TED XML(以及eForms,WIP)转换为Open Contracting Data Standard (OCDS) JSON和parquet文件,以简化数据导入到

  • 图数据库(我们使用KuzuDB,但处理后的数据应该足够通用,以支持任何图数据库和
  • 搜索引擎(我们使用Meilisearch)

在导入到图数据库之前,使用Splinkg进行去重,并将其与它们的GLEIF标识符链接(WIP)。

ETL内容表

背景

TEDective项目的目标是使欧洲公共采购数据对非专业人士可探索。这种转换基本上基于开放合同数据标准(OCDS)欧盟配置文件

因此,该管道可以作为独立组件使用,也可以作为您项目中处理TED数据的有趣项目的组成部分。我们将其用于自己的TEDective API,该API为TEDective UI提供动力。

安装

:construction: 免责声明:安装说明自2024年4月12日起有效,但可能会发生变化。

ETL由两部分组成:管道和Luigi服务器(调度器)

使用PyPi包

安装TEDective ETL的最简单方法是使用PyPi包通过pipx

pipx install tedective-etl
pipx ensurepath # to make sure it has been added to your path
run-pipeline --help

使用Nix

# Install flake into your profile
nix profile install git+https://git.fsfe.org/TEDective/etl
run-pipeline --help

或者,您可以克隆此存储库并使用Nix自行构建它。

# Cloning the repository and entering it
git clone https://git.fsfe.org/TEDective/etl && cd etl

# Nix build using the provided flake
nix build

# Disclaimer nix-commands and flakes are experimental features so you will need to add these flags to the command to be able to run them.
--extra-experimental-features 'nix-command flakes'
# You will also been prompted to accept/decline some extra configurations. You can accept them without receiving a prompt using this or manually decide without adding it:
--accept-flake-config

手动

另一种方法是直接使用poetry

在克隆此存储库后

poetry install
poetry run run-pipeline --help

用法

:construction: 免责声明:使用说明自2024年4月12日起有效,但可能会发生变化。

通用使用选项

run-pipeline [-h] [--first-month FIRST_MONTH] [--last-month LAST_MONTH]
                    [--meilisearch-url MEILISEARCH_URL] [--in-dir IN_DIR]
                    [--output-dir OUTPUT_DIR] [--graph-dir GRAPH_DIR] [--local-scheduler]

options:
  -h, --help            show this help message and exit
  --first-month FIRST_MONTH
                        The first month to process. Defaults to '2017-01'.
  --last-month LAST_MONTH
                        The last month to process. Defaults to the last month.
  --meilisearch-url MEILISEARCH_URL
                        The URL of the Meilisearch server. Defaults to
                        'http://localhost:7700'
  --in-dir IN_DIR       The directory to store the TED XMLs. Defaults to '/tmp/ted_notices'
  --output-dir OUTPUT_DIR
                        The directory to store the output data. Defaults to '/tmp/output'
  --graph-dir GRAPH_DIR
                        The name of the KuzuDB graph. Defaults to '/tmp/graph'
  --local-scheduler     Use the local scheduler.

使用PyPi包

安装后,您应该能够运行Luigi调度器和管道。

run-server
# In different window
run-pipeline

还可以运行一个名为meilisearch的额外实例,以便构建搜索索引。这可以在devenv内部完成,更多细节将在下方介绍。

使用Nix

# The nix build will create a result folder inside it you will find these scripts
# This is how you can get more information about the possible arguments you can provide to the script
result/bin/run-pipeline --help

# IMPORTANT: As we previously said there are two parts to the ETL this is how to spin up luigi so the pipeline can run
result/bin/run-server

# We suggest for development purposes to use the --last-month flag to have it quickly setup. You can also set the first-month if you would like a specific time window of data. By default first month is going to be 2017-01
run-pipeline --last-month 2017-02

在这种情况下,您也可以运行Meilisearch来构建搜索索引。这可以在devenv内部完成,更多细节将在下方介绍。

手动(使用poetry

运行管道需要运行luigi守护进程。它包含在项目中,您可以使用以下命令运行它

poetry run run-server
# And pipeline itself in different window
poetry run run-pipeline

建议同时运行Meilisearch,如果使用此方法,您将不得不手动安装它。

维护者

@linozen
@micgor32

贡献

1. Nix开发环境

如果您使用nix,最容易开始开发的方式是使用提供的flake.nix通过devenv

# If you have Nix installed
nix develop --impure
# This will drop you into a shell with all the dependencies installed
# And it will also require the experimental flags:
# Disclaimer nix-commands and flakes are experimental features so you will need to add these flags to the command to be able to run them.
--extra-experimental-features 'nix-command flakes'
# You will also been prompted to accept/decline some extra configurations. You can accept them without receiving a prompt using this or manually decide without adding it:
--accept-flake-config

# Inside you have all the needed tools

# These will provide you with the amazing kuzu-explorer which allows you to run queries to the database.
kuzu-up
# And
kuzu-down

# Inside the devenv you also have access to Mielisearch

# Inside the devenv pre-commits are setup with all other checks so that is the easiest way to make a commit to the repo.
2. 编辑文档

小贴士:如果编辑README,请遵守标准README规范。同时,请确保文档与代码保持同步。请注意,主要文档存储库已通过git-subrepo添加到此存储库。要更新文档,请使用以下命令

git-subrepo pull docs
cd ./docs

# Make your changes
git commit -am "docs: update documentation for new feature"

# Preview your changes
pnpm install
pnpm run dev

# If you're happy with your changes, push them
git-subrepo push docs

许可证

EUPL-1.2 © 2024 欧洲自由软件基金会 e.V.

项目详情


下载文件

下载适合您平台的文件。如果您不确定选择哪个,请了解更多关于安装包的信息。

源分发

tedective_etl-0.1.3.tar.gz (36.7 kB 查看哈希)

上传时间

构建分发

tedective_etl-0.1.3-py3-none-any.whl (38.6 kB 查看哈希)

上传时间 Python 3

由以下支持