数据集的快速格式。

项目描述

Bags：数据集的快速格式

Bags是一个用于读取和写入多模态数据集的库。每个数据集是bag文件格式类型的链接文件集合，是一个简单的可寻址容器结构。

特性

🚀 性能：最小开销，以实现最大的读写吞吐量。
🔎 可寻址：通过数据点索引从磁盘快速随机访问。
🎞️ 序列：数据点可以包含模态的可寻址范围。
🤸 灵活：用户提供编码器和解码器；有示例可用。
👥 分片：将数据集存储到分片中以分割处理工作负载。

安装

Bags是一个单个文件，所以您只需将其复制到项目目录中。或者您可以安装该包

pip install bags

快速入门

写入

import bags
import msgpack
import numpy as np

encoders = {
    'utf8': lambda x: x.encode('utf-8'),
    'int': lambda x, size: x.to_bytes(int(size), 'little'),
    'msgpack': msgpack.packb,
}

spec = {
    'foo': 'int(8)',   # 8-byte integer
    'bar': 'utf8[]',   # list of strings
    'baz': 'msgpack',  # packed structure
}

shardsize = 10 * 1024 ** 3  # 10GB shards

with bags.ShardedDatasetWriter(directory, spec, encoders, shardsize) as writer:
  writer.append({'foo': 42, 'bar': ['hello', 'world'], 'baz': {'a': 1})
  # ...

文件

$ tree directory
.
├── 000000
│  ├── spec.json
│  ├── refs.bag
│  ├── foo.bag
│  ├── bar.bag
│  └── baz.bag
├── 000001
│  ├── spec.json
│  ├── refs.bag
│  ├── foo.bag
│  ├── bar.bag
│  └── baz.bag
└── ...

读取

decoders = {
    'utf8': lambda x: x.decode('utf-8'),
    'int': lambda x, size=None: int.from_bytes(x),
    'msgpack': msgpack.unpackb,
}

with bags.ShardedDatasetReader(directory, decoders) as reader:
  print(len(reader))  # Total number of datapoints.
  print(reader.size)  # Total dataset size in bytes.
  print(reader.shards)

  # Read data points by index. This will read only the relevant bytes from
  # disk. An additional small read is used when caching index tables is
  # disabled, supporting arbitrarily large datasets with minimal overhead.
  assert reader[0] == {'foo': 42, 'bar': ['hello', 'world'], 'baz': {'a': 1}

  # Read a subset of keys of a datapoint. For example, this allows quickly
  # iterating over the metadata fields of all datapoints without accessing
  # expensive image or video modalities.
  assert reader[0, {'foo': True, 'baz': True}] == {'foo': 42, 'baz': {'a': 1}}

  # Read only a slice of the 'bar' list. Only the requested slice will be
  # fetched from disk. For example, the could be used to load a subsequence of
  # a long video that is stored as list of consecutive MP4 clips.
  assert reader[0, {'bar': range(1, 2)}] == {'bar': ['world']}

对于不需要分片的小数据集，您也可以使用DatasetReader和DatasetWriter。这些也可以用来查看分片数据集的各个分片。

对于使用多个进程或机器的分布式处理，请使用ShardedDatasetReader和ShardedDatasetWriter，并将shard_start设置为工作器索引，将shard_stop设置为工作器总数。

格式

Bags不对用户施加序列化解决方案。只要提供编码器和解码器，任何词都可以用作类型。

在formats.py中提供了常见类型的编码和解码函数的示例，包括

Numpy
JPEG
PNG
MP4

类型可以用将传递给编码器和解码器的args参数化，例如 array(float32,64,128)。

包

Bag格式是一种简单的容器文件类型。它简单地存储一系列字节块，后面跟一个整数索引表，用于文件中所有起始位置。起始位置以8字节无符号小端编码，同时还包括最后一个块的最后偏移量。

此格式允许快速随机访问，要么预先将索引表加载到内存中，要么通过一次小读取来查找起始和结束位置，然后进行针对块内容的定向大读取。

Bags在Bag的基础上构建，可以读取和写入多种模态的dataset，其中数据点可以包含模态的blob序列，同时高效地对数据点和模态中的范围查询进行寻址。

问题

如果您有任何问题，请提交问题。

项目详情

发布历史发布通知 | RSS源

本版本

0.5.1

2024年6月29日

0.5.0

2024年6月29日

0.4.0

2024年6月28日

0.3.1

2024年6月28日

0.3.0

2024年6月27日

0.2.0

2024年6月27日

0.1.0

2024年6月27日

下载文件

下载适用于您的平台的文件。如果您不确定选择哪个，请了解更多关于安装包的信息。

源分布

bags-0.5.1.tar.gz (13.2 kB 查看散列)

上传时间 2024年6月29日 源

bags-0.5.1.tar.gz 的散列

bags-0.5.1.tar.gz 的散列
算法	散列摘要
SHA256	`a47da7962708a26eb37c49193d34e0def34818fe99c21861c423527444669620`
MD5	`0df5ccb7a656d17a13807a84373b8c13`
BLAKE2b-256	`78c1f3b453f5b4e06ce62c11ffeca7339b72a71d8a07c0f8435b013eb3e3cf16`

bags 0.5.1

导航

已验证详细信息

维护者

未验证详细信息

项目链接

分类

项目描述

Bags：数据集的快速格式

特性

安装

快速入门

格式

包

问题

项目详情

已验证详细信息

维护者

未验证详细信息

项目链接

分类

发布历史发布通知 | RSS源

下载文件

源分布

bags 0.5.1

导航

已验证详细信息

维护者

未验证详细信息

项目链接

分类

项目描述

Bags：数据集的快速格式

特性

安装

快速入门

格式

包

问题

项目详情

已验证详细信息

维护者

未验证详细信息

项目链接

分类

发布历史 发布通知 | RSS源

下载文件

源分布

发布历史发布通知 | RSS源