基于lxml的TEDective项目的XML到OCDS解析器
项目描述
ETL
此仓库中的代码是TEDective项目的一部分。它定义了一个ETL管道,将欧洲公共采购数据从Tenders Electronic Daily (TED)转换为更容易处理和分析的格式。主要将TED XML(以及eForms,WIP)转换为Open Contracting Data Standard (OCDS) JSON和parquet文件,以简化数据导入到
- 图数据库(我们使用KuzuDB,但处理后的数据应该足够通用,以支持任何图数据库和
- 搜索引擎(我们使用Meilisearch)
在导入到图数据库之前,使用Splinkg进行去重,并将其与它们的GLEIF标识符链接(WIP)。
ETL内容表
背景
TEDective项目的目标是使欧洲公共采购数据对非专业人士可探索。这种转换基本上基于开放合同数据标准(OCDS)欧盟配置文件
因此,该管道可以作为独立组件使用,也可以作为您项目中处理TED数据的有趣项目的组成部分。我们将其用于自己的TEDective API,该API为TEDective UI提供动力。
安装
:construction: 免责声明:安装说明自2024年4月12日起有效,但可能会发生变化。
ETL由两部分组成:管道和Luigi服务器(调度器)
使用PyPi包
安装TEDective ETL的最简单方法是使用PyPi包通过pipx
。
pipx install tedective-etl
pipx ensurepath # to make sure it has been added to your path
run-pipeline --help
使用Nix
# Install flake into your profile
nix profile install git+https://git.fsfe.org/TEDective/etl
run-pipeline --help
或者,您可以克隆此存储库并使用Nix自行构建它。
# Cloning the repository and entering it
git clone https://git.fsfe.org/TEDective/etl && cd etl
# Nix build using the provided flake
nix build
# Disclaimer nix-commands and flakes are experimental features so you will need to add these flags to the command to be able to run them.
--extra-experimental-features 'nix-command flakes'
# You will also been prompted to accept/decline some extra configurations. You can accept them without receiving a prompt using this or manually decide without adding it:
--accept-flake-config
手动
另一种方法是直接使用poetry
。
在克隆此存储库后
poetry install
poetry run run-pipeline --help
用法
:construction: 免责声明:使用说明自2024年4月12日起有效,但可能会发生变化。
通用使用选项
run-pipeline [-h] [--first-month FIRST_MONTH] [--last-month LAST_MONTH]
[--meilisearch-url MEILISEARCH_URL] [--in-dir IN_DIR]
[--output-dir OUTPUT_DIR] [--graph-dir GRAPH_DIR] [--local-scheduler]
options:
-h, --help show this help message and exit
--first-month FIRST_MONTH
The first month to process. Defaults to '2017-01'.
--last-month LAST_MONTH
The last month to process. Defaults to the last month.
--meilisearch-url MEILISEARCH_URL
The URL of the Meilisearch server. Defaults to
'http://localhost:7700'
--in-dir IN_DIR The directory to store the TED XMLs. Defaults to '/tmp/ted_notices'
--output-dir OUTPUT_DIR
The directory to store the output data. Defaults to '/tmp/output'
--graph-dir GRAPH_DIR
The name of the KuzuDB graph. Defaults to '/tmp/graph'
--local-scheduler Use the local scheduler.
使用PyPi包
安装后,您应该能够运行Luigi调度器和管道。
run-server
# In different window
run-pipeline
还可以运行一个名为meilisearch
的额外实例,以便构建搜索索引。这可以在devenv内部完成,更多细节将在下方介绍。
使用Nix
# The nix build will create a result folder inside it you will find these scripts
# This is how you can get more information about the possible arguments you can provide to the script
result/bin/run-pipeline --help
# IMPORTANT: As we previously said there are two parts to the ETL this is how to spin up luigi so the pipeline can run
result/bin/run-server
# We suggest for development purposes to use the --last-month flag to have it quickly setup. You can also set the first-month if you would like a specific time window of data. By default first month is going to be 2017-01
run-pipeline --last-month 2017-02
在这种情况下,您也可以运行Meilisearch来构建搜索索引。这可以在devenv内部完成,更多细节将在下方介绍。
手动(使用poetry
)
运行管道需要运行luigi守护进程。它包含在项目中,您可以使用以下命令运行它
poetry run run-server
# And pipeline itself in different window
poetry run run-pipeline
建议同时运行Meilisearch,如果使用此方法,您将不得不手动安装它。
维护者
贡献
1. Nix开发环境
如果您使用nix,最容易开始开发的方式是使用提供的flake.nix
通过devenv。
# If you have Nix installed
nix develop --impure
# This will drop you into a shell with all the dependencies installed
# And it will also require the experimental flags:
# Disclaimer nix-commands and flakes are experimental features so you will need to add these flags to the command to be able to run them.
--extra-experimental-features 'nix-command flakes'
# You will also been prompted to accept/decline some extra configurations. You can accept them without receiving a prompt using this or manually decide without adding it:
--accept-flake-config
# Inside you have all the needed tools
# These will provide you with the amazing kuzu-explorer which allows you to run queries to the database.
kuzu-up
# And
kuzu-down
# Inside the devenv you also have access to Mielisearch
# Inside the devenv pre-commits are setup with all other checks so that is the easiest way to make a commit to the repo.
2. 编辑文档
小贴士:如果编辑README,请遵守标准README规范。同时,请确保文档与代码保持同步。请注意,主要文档存储库已通过git-subrepo添加到此存储库。要更新文档,请使用以下命令
git-subrepo pull docs
cd ./docs
# Make your changes
git commit -am "docs: update documentation for new feature"
# Preview your changes
pnpm install
pnpm run dev
# If you're happy with your changes, push them
git-subrepo push docs
许可证
EUPL-1.2 © 2024 欧洲自由软件基金会 e.V.
项目详情
下载文件
下载适合您平台的文件。如果您不确定选择哪个,请了解更多关于安装包的信息。