json2parquet · PyPI · Python 包索引

'一个简单的JSON/python数据Parquet转换器'

这些详情尚未由PyPI 验证

项目链接

项目描述

这个库封装了pyarrow，提供了一些工具，可以轻松地将JSON数据转换为Parquet格式。它主要使用Python编写。它会遍历文件。它在内存中多次复制数据。它不是最快的，但对于较小的数据集或对速度没有太大问题的用户来说，它是方便的。

安装

使用pip

pip install json2parquet

使用conda

conda install -c conda-forge json2parquet

使用方法

以下是如何加载一个随机JSON数据集的方法。

from json2parquet import convert_json

# Infer Schema (requires reading dataset for column names)
convert_json(input_filename, output_filename)

# Given columns
convert_json(input_filename, output_filename, ["my_column", "my_int"])

# Given columns and custom field names
field_aliases = {'my_column': 'my_updated_column_name', "my_int": "my_integer"}
convert_json(input_filename, output_filename, ["my_column", "my_int"], field_aliases=field_aliases)


# Given PyArrow schema
import pyarrow as pa
schema = pa.schema([
    pa.field('my_column', pa.string),
    pa.field('my_int', pa.int64),
])
convert_json(input_filename, output_filename, schema)

您也可以直接处理Python数据结构

from json2parquet import load_json, ingest_data, write_parquet, write_parquet_dataset

# Loading JSON to a PyArrow RecordBatch (schema is optional as above)
load_json(input_filename, schema)

# Working with a list of dictionaries
ingest_data(input_data, schema)

# Working with a list of dictionaries and custom field names
field_aliases = {'my_column': 'my_updated_column_name', "my_int": "my_integer"}
ingest_data(input_data, schema, field_aliases)

# Writing Parquet Files from PyArrow Record Batches
write_parquet(data, destination)

# You can also pass any keyword arguments that PyArrow accepts
write_parquet(data, destination, compression='snappy')

# You can also write partitioned date
write_parquet_dataset(data, destination_dir, partition_cols=["foo", "bar", "baz"])

如果您知道您的模式，您可以指定自定义的日期时间格式（目前只有一个）。如果您没有传递PyArrow模式，则此格式将被忽略。

from json2parquet import convert_json

# Given PyArrow schema
import pyarrow as pa
schema = pa.schema([
    pa.field('my_column', pa.string),
    pa.field('my_int', pa.int64),
])
date_format = "%Y-%m-%dT%H:%M:%S.%fZ"
convert_json(input_filename, output_filename, schema, date_format=date_format)

尽管json2parquet可以推断模式，但它也有助于从外部拉取模式

from json2parquet import load_json
from json2parquet.helpers import get_schema_from_redshift

# Fetch the schema from Redshift (requires psycopg2)
schema = get_schema_from_redshift(redshift_schema, redshift_table, redshift_uri)

# Load JSON with the Redshift schema
load_json(input_filename, schema)

操作注意事项

如果您使用此库将JSON数据转换为供Spark、Athena、Spectrum或Presto读取，请确保在写入Parquet文件时使用use_deprecated_int96_timestamps，否则您将看到一些非常奇怪的日期。

贡献

代码更改

克隆库的分支
运行make setup
运行make test
应用您的更改（不要增加版本号）
如有需要，添加测试
运行make test以确保没有出错
提交PR

文档更改

保持文档正确和最新总是一个挑战。任何修复都受欢迎。如果您不想克隆仓库进行本地工作，请随时使用Github进行编辑，并通过Github内置功能提交拉取请求。

项目详情

这些详情尚未由PyPI 验证

项目链接

发布历史发布通知 | RSS订阅

当前版本

2.2.0

2024年1月10日

2.1.0

2023年1月19日

2.0.0

2022年3月14日

1.0.0

2020年9月14日

0.0.28

2019年7月17日

0.0.27

2019年2月19日

0.0.26

2019年1月22日

0.0.25

2019年1月3日

0.0.24

2018年11月13日

0.0.23

2018年8月13日

0.0.22

2018年7月9日

0.0.21

2018年4月23日

0.0.20

2018年4月23日

0.0.19

2018年4月13日

0.0.18

2018年4月6日

0.0.17

2018年3月23日

0.0.16

2018年3月19日

0.0.15

2018年3月7日

0.0.14

2018年1月31日

0.0.13

2018年1月25日

0.0.12

2018年1月25日

0.0.11

2018年1月25日

0.0.10

2017年12月8日

0.0.9

2017年9月25日

0.0.8

2017年9月25日

0.0.7

2017年9月25日

0.0.6

2017年9月22日

0.0.5

2017年9月22日

0.0.4

2017年9月19日

0.0.3

2017年9月18日

0.0.2

2017年9月13日

0.0.1

2017年9月13日

下载文件

下载您平台的文件。如果您不确定选择哪个，请了解更多关于安装包的信息。

源分发

json2parquet-2.2.0.tar.gz (10.5 kB 查看哈希值)

上传时间 2024年1月10日 源

构建分发

json2parquet-2.2.0-py3-none-any.whl (7.7 kB 查看哈希值)

上传时间 2024年1月10日 Python 3

json2parquet-2.2.0.tar.gz的哈希值

json2parquet-2.2.0.tar.gz的哈希值
算法	哈希摘要
SHA256	`b40b2d6e2d98c6fe01a5b35e1a0d6685e24200b237c7e69ea64c00a36f555e59`
MD5	`e9376f12010e8dccd93a3ae26c7f04da`
BLAKE2b-256	`ad9af89cf9347e1c3bf3d93fc5a37495ddd5b7d0f04a916d59598b52bfed6044`

json2parquet-2.2.0-py3-none-any.whl的哈希值

json2parquet-2.2.0-py3-none-any.whl的哈希值
算法	哈希摘要
SHA256	`c0c2d458e15805e369445bfbec0461fc102380a5953f3c0c0ace87256710d6ce`
MD5	`1250442095f2a58c3177ba7b8e80a09a`
BLAKE2b-256	`1466c27e1c0db2299ab437284933ee63de4fb40c35dbb1b3a15028b8c4758351`

json2parquet 2.2.0

导航

验证详情

维护者

未验证详情

项目链接

元数据

分类器

项目描述

安装

使用方法

操作注意事项

贡献

代码更改

文档更改

项目详情

验证详情

维护者

未验证详情

项目链接

元数据

分类器

发布历史发布通知 | RSS订阅

下载文件

源分发

构建分发

json2parquet 2.2.0

导航

验证详情

维护者

未验证详情

项目链接

元数据

分类器

项目描述

安装

使用方法

操作注意事项

贡献

代码更改

文档更改

项目详情

验证详情

维护者

未验证详情

项目链接

元数据

分类器

发布历史 发布通知 | RSS订阅

下载文件

源分发

构建分发

发布历史发布通知 | RSS订阅