基于polars和pydantic构建的数据框建模库。

这些详细信息尚未由PyPI验证

项目链接

项目描述

Patito

Patito通过结合pydantic和polars，提供了一种编写现代、类型注释的数据框逻辑的简单方法。

Patito提供了一个简单的方法来声明pydantic数据模型，这些模型同时充当polars数据框的模式。这些模式可以用于

👮 简单且高效的数据框验证。
🧪 简单生成有效的模拟数据框进行测试。
🐍 以面向对象的方式检索和表示单个行。
🧠 为代码库中的核心数据模型提供一个单一的事实来源。

Patito对polars提供了第一类支持，这是一个"用Rust编写的闪电般快速的数据帧库"。

安装

pip install patito

文档

Patio的完整文档可以在这里找到。

👮 数据验证

Patito允许您通过创建patito.Model的类型注释子类来指定数据框中每列的类型

# models.py
from typing import Literal

import patito as pt


class Product(pt.Model):
    product_id: int = pt.Field(unique=True)
    temperature_zone: Literal["dry", "cold", "frozen"]
    is_for_sale: bool

类 Product 表示数据框的模式，而 Product 的实例表示数据框的单行。Patito 可以高效地验证任意数据框的内容，并提供易于理解的错误信息。

import polars as pl

df = pl.DataFrame(
    {
        "product_id": [1, 1, 3],
        "temperature_zone": ["dry", "dry", "oven"],
    }
)
try:
    Product.validate(df)
except pt.exceptions.DataFrameValidationError as exc:
    print(exc)
# 3 validation errors for Product
# is_for_sale
#   Missing column (type=type_error.missingcolumns)
# product_id
#   2 rows with duplicated values. (type=value_error.rowvalue)
# temperature_zone
#   Rows with invalid values: {'oven'}. (type=value_error.rowvalue)

点击查看与数据框兼容的类型注解摘要。

常见的 Python 数据类型，如 int、float、bool、str、date，会与兼容的 Polars 数据类型进行验证。
使用 typing.Optional 包装类型表示该列接受缺失值。
使用 typing.Literal[...] 注释的模型字段会检查是否只接受一组限制值，这些值可以是本地数据类型（例如 pl.Utf8）或 pl.Categorical。

此外，您可以将 patito.Field 分配给类变量以指定附加检查。

Field(dtype=...) 确保在这些情况下使用特定的数据类型，即多个数据类型都符合注解的 Python 类型时，例如 product_id: int = Field(dtype=pl.UInt32)。
Field(unique=True) 检查每一行是否有唯一值。
Field(gt=..., ge=..., le=..., lt=...) 允许您指定任何组合的 > gt、>= ge、<= le 和 < lt 的边界检查。
Field(multiple_of=divisor) 用来检查给定列是否只包含给定值的倍数。
Field(default=default_value, const=True) 表示给定列是必需的，并且必须采用给定的默认值。
使用 Field(regex=r"<regex-pattern>")、Field(max_length=bound) 和/或 Field(min_length) 注释的字符串字段将通过 Polars 的高效字符串处理能力进行验证。
可以使用 Field(constraints=...) 指定自定义约束，它可以是单个 Polars 表达式或表达式列表。为了被认为是有效的，数据框的所有行都必须满足给定的约束。例如：even_field: int = pt.Field(constraints=pl.col("even_field") % 2 == 0)。

尽管 Patito 支持 pandas，但强烈建议与 polars 结合使用。对于功能更完整的库，请查看 pandera。

🧪 合成有效的测试数据

Patito 鼓励您严格验证数据框输入，从而确保在运行时正确无误。但是，强制正确性会带来摩擦，尤其是在测试期间。以下函数为例

import polars as pl

def num_products_for_sale(products: pl.DataFrame) -> int:
    Product.validate(products)
    return products.filter(pl.col("is_for_sale")).height

以下测试会因为 patito.exceptions.DataFrameValidationError 而失败

def test_num_products_for_sale():
    products = pl.DataFrame({"is_for_sale": [True, True, False]})
    assert num_products_for_sale(products) == 2

为了使测试通过，我们需要为 temperature_zone 和 product_id 列添加有效的虚拟数据。这会迅速引入大量样板代码到所有涉及数据框的测试中，从而模糊了每个测试实际测试的内容。因此，Patito 提供了 examples 构造函数来生成完全符合给定模型模式的测试数据。

Product.examples({"is_for_sale": [True, True, False]})
# shape: (3, 3)
# ┌─────────────┬──────────────────┬────────────┐
# │ is_for_sale ┆ temperature_zone ┆ product_id │
# │ ---         ┆ ---              ┆ ---        │
# │ bool        ┆ str              ┆ i64        │
# ╞═════════════╪══════════════════╪════════════╡
# │ true        ┆ dry              ┆ 0          │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ true        ┆ dry              ┆ 1          │
# ├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
# │ false       ┆ dry              ┆ 2          │
# └─────────────┴──────────────────┴────────────┘

examples() 方法接受与常规数据框构造函数相同的参数，主要区别在于它会为任何未指定的列填充有效的虚拟数据。因此，测试可以重写为

def test_num_products_for_sale():
    products = Product.examples({"is_for_sale": [True, True, False]})
    assert num_products_for_sale(products) == 2

🖼️ 模型感知的数据框类

Patito 提供了 patito.DataFrame 类，该类扩展了 polars.DataFrame，以提供与 patito.Model 相关的实用方法。数据框的模式可以在运行时通过调用 patito.DataFrame.set_model(model) 指定，之后将提供一系列上下文相关的函数。

DataFrame.validate() - 验证给定的数据框并返回自身。
DataFrame.drop() - 删除所有未在模型中指定为字段的冗余列。
DataFrame.cast() - 将不与给定类型注解兼容的列转换为指定类型。当指定 Field(dtype=...) 时，即使在不兼容的情况下，也会强制使用指定的数据类型。
DataFrame.get(predicate) - 从数据帧中检索一行，作为模型的一个实例。如果过滤器谓词不正好返回一行，则会引发异常。
DataFrame.fill_null(strategy="defaults") - 根据模型模式上设置的默认值填充缺失值。
DataFrame.derive() - 带有 Field(derived_from=...) 注解的模型字段表示应该通过某些任意的 polars 表达式定义一个列。如果 derived_from 被指定为字符串，则给定值将被解释为带有 polars.col() 的列名。在调用 DataFrame.derive() 时，根据 derived_from 表达式创建并填充这些列。

以下示例将最好地说明这些方法

from typing import Literal

import patito as pt
import polars as pl


class Product(pt.Model):
    product_id: int = pt.Field(unique=True)
    # Specify a specific dtype to be used
    popularity_rank: int = pt.Field(dtype=pl.UInt16)
    # Field with default value "for-sale"
    status: Literal["draft", "for-sale", "discontinued"] = "for-sale"
    # The eurocent cost is extracted from the Euro cost string "€X.Y EUR"
    eurocent_cost: int = pt.Field(
        derived_from=100 * pl.col("cost").str.extract(r"€(\d+\.+\d+)").cast(float).round(2)
    )


products = pt.DataFrame(
    {
        "product_id": [1, 2],
        "popularity_rank": [2, 1],
        "status": [None, "discontinued"],
        "cost": ["€2.30 EUR", "€1.19 EUR"],
    }
)
product = (
    products
    # Specify the schema of the given data frame
    .set_model(Product)
    # Derive the `eurocent_cost` int column from the `cost` string column using regex
    .derive()
    # Drop the `cost` column as it is not part of the model
    .drop()
    # Cast the popularity rank column to an unsigned 16-bit integer and cents to an integer
    .cast()
    # Fill missing values with the default values specified in the schema
    .fill_null(strategy="defaults")
    # Assert that the data frame now complies with the schema
    .validate()
    # Retrieve a single row and cast it to the model class
    .get(pl.col("product_id") == 1)
)
print(repr(product))
# Product(product_id=1, popularity_rank=2, status='for-sale', eurocent_cost=230)

每个 Patito 模型自动获得一个 .DataFrame 属性，这是一个自定义的数据帧子类，其中在实例化时调用 .set_model()。换句话说，pt.DataFrame(...).set_model(Product) 等同于 Product.DataFrame(...)。

🐍 将行表示为类

数据帧非常适合在一组对象上执行矢量化操作。但是，当需要检索单行并对其操作时，数据帧结构自然就不足了。Patito 允许你在模型上定义的方法中嵌入行级逻辑。

# models.py
import patito as pt

class Product(pt.Model):
    product_id: int = pt.Field(unique=True)
    name: str

    @property
    def url(self) -> str:
        return (
            "https://example.com/no/products/"
            f"{self.product_id}-"
            f"{self.name.lower().replace(' ', '-')}"
        )

可以使用 from_row() 方法从数据帧的单行实例化类

products = pl.DataFrame(
    {
        "product_id": [1, 2],
        "name": ["Skimmed milk", "Eggs"],
    }
)
milk_row = products.filter(pl.col("product_id" == 1))
milk = Product.from_row(milk_row)
print(milk.url)
# https://example.com/no/products/1-skimmed-milk

如果你通过使用 patito.DataFrame.set_model() 或直接使用 Product.DataFrame 将 Product 模型与 DataFrame “连接”，则可以使用 .get() 方法将数据帧过滤到单个行，并将其转换为相应的模型类

products = Product.DataFrame(
    {
        "product_id": [1, 2],
        "name": ["Skimmed milk", "Eggs"],
    }
)
milk = products.get(pl.col("product_id") == 1)
print(milk.url)
# https://example.com/no/products/1-skimmed-milk

项目详情

这些详细信息尚未由PyPI验证

项目链接

发布历史发布通知 | RSS 源

此版本

0.7.0

2024年7月12日

0.6.2

2024年7月12日

0.6.1

2024年3月3日

0.5.1

2023年7月18日

0.5.0

2023年6月21日

0.4.4

2022年12月2日

0.4.3

2022年10月12日

0.4.2

2022年9月5日

0.4.1

2022年9月5日

0.4.0

2022年9月2日

0.3.2

2022年8月15日

0.3.1

2022年8月13日

0.3.0

2022年8月11日

0.2.3

2022年8月4日

0.2.2

2022年7月5日

0.2.1

2022年6月15日

0.2.0

2022年6月14日

0.1.8

2022年6月1日

0.1.7

2022年5月30日

0.1.6

2022年5月27日

0.1.5

2022年5月27日

0.1.4

2022年5月27日

0.1.3

2022年5月27日

0.1.2

2022年5月10日

0.1.1

2022年5月10日

0.1.0

2022年5月4日

下载文件

下载您平台的文件。如果您不确定要选择哪个，请了解有关安装包的更多信息。

源代码发行版

patito-0.7.0.tar.gz (41.3 kB 查看哈希值)

上传时间 2024年7月12日 源代码

构建发行版

patito-0.7.0-py3-none-any.whl (42.2 kB 查看哈希值)

上传时间 2024年7月12日 Python 3

patito-0.7.0.tar.gz 的哈希值

patito-0.7.0.tar.gz 的哈希值
算法	哈希摘要
SHA256	`736a41894280462710c1cf1dfdd5cc278c0885c941ce782df3569551f2d11b7a`
MD5	`0cde69ed16d5893dc3488b6687a80cc1`
BLAKE2b-256	`3c591080fd302f32bdca2382dd1d96b6a21396d563e181d7c32eb14f746c99e5`

哈希值用于 patito-0.7.0-py3-none-any.whl

patito-0.7.0-py3-none-any.whl 的哈希值
算法	哈希摘要
SHA256	`fa92ce44d865d4936227c77d8f1d38348b9b69b090cf615af7827893f7754a33`
MD5	`15f491873da118772e265db2bbaf97b6`
BLAKE2b-256	`df028f281dc62272650c783daf23942b4b63274ddf0b38823c3d02dd4b8ccd08`

patito 0.7.0

导航

已验证详细信息

维护者

未验证详细信息

项目链接

元信息

分类

项目描述

Patito

安装

文档

👮 数据验证

🧪 合成有效的测试数据

🖼️ 模型感知的数据框类

🐍 将行表示为类

项目详情

已验证详细信息

维护者

未验证详细信息

项目链接

元信息

分类

发布历史发布通知 | RSS 源

下载文件

源代码发行版

构建发行版

patito 0.7.0

导航

已验证详细信息

维护者

未验证详细信息

项目链接

元信息

分类

项目描述

Patito

安装

文档

👮 数据验证

🧪 合成有效的测试数据

🖼️ 模型感知的数据框类

🐍 将行表示为类

项目详情

已验证详细信息

维护者

未验证详细信息

项目链接

元信息

分类

发布历史 发布通知 | RSS 源

下载文件

源代码发行版

构建发行版

发布历史发布通知 | RSS 源