COCO数据集清理工具

这些详情尚未通过PyPI验证

项目链接

首页

开发状态
- 3 - Alpha
目标受众
- 开发者
许可证
- OSI批准 :: Apache软件许可证
自然语言
- 英语
编程语言

项目描述

Cocorepr

一种在COCO数据集之间转换不同表示形式的工具（目前仅支持目标检测）。

安装

$ pip install -U cocorepr

基本用法

$ cocorepr --help                                                                                       
usage: cocorepr [-h] [--in_json_file [IN_JSON_FILE [IN_JSON_FILE ...]]]
                [--in_json_tree [IN_JSON_TREE [IN_JSON_TREE ...]]]
                [--in_crop_tree [IN_CROP_TREE [IN_CROP_TREE ...]]] --out_path
                OUT_PATH --out_format {json_file,json_tree,crop_tree}
                [--seed SEED] [--max_crops_per_class MAX_CROPS_PER_CLASS]
                [--overwrite] [--indent INDENT] [--debug]

Tool for converting datasets in COCO format between different representations

optional arguments:
  -h, --help            show this help message and exit
  --in_json_file [IN_JSON_FILE [IN_JSON_FILE ...]]
                        Path to one or multiple json files storing COCO
                        dataset in `json_file` representation (all json-based
                        datasets will be merged).
  --in_json_tree [IN_JSON_TREE [IN_JSON_TREE ...]]
                        Path to one or multiple directories storing COCO
                        dataset in `json_tree` representation (all json-based
                        datasets will be merged).
  --in_crop_tree [IN_CROP_TREE [IN_CROP_TREE ...]]
                        Path to one or multiple directories storing COCO
                        dataset in `crop_tree` representation (all crop-based
                        datasets will be merged and will overwrite the json-
                        based datasets).
  --out_path OUT_PATH   Path to the output dataset (file or directory: depends
                        on `--out_format`)
  --out_format {json_file,json_tree,crop_tree}
  --seed SEED           Random seed.
  --max_crops_per_class MAX_CROPS_PER_CLASS
                        If set, the tool will randomly select up to this
                        number of crops (annotations) per each class
                        (category) and drop the others.
  --overwrite           If set, will delete the output file/directory before
                        dumping the result dataset.
  --indent INDENT       Indentation in the output json files.
  --debug

此工具将数据集在三种格式之间进行转换

json文件（单个json文件） - 常见的ML格式，
json树（一组json块） - 适合Git，
裁剪树（一组目标检测注释的png裁剪） - 用于清理目标检测数据集。

虽然基于json的格式是自包含的，但基于裁剪的格式至少需要一个json路径才能重建数据集

$ cocorepr \
    --in_crop_tree /path/to/tree  \
    --out_path /tmp/crop_tree \
    --out_format crop_tree
INFO: Arguments: Namespace(debug=False, in_crop_tree=[PosixPath('/path/to/tree')], in_json_file=[], in_json_tree=[], indent=4, out_format='crop_tree', out_path=PosixPath('/tmp/crop_tree'), overwrite=False)
Traceback (most recent call last):
  File "/home/ay/.pyenv/versions/3.7.6/bin/cocorepr", line 33, in <module>
    sys.exit(load_entry_point('cocorepr', 'console_scripts', 'cocorepr')())
  File "/plain/github/nm/cocorepr/cocorepr/main.py", line 66, in main
    raise ValueError(f'Not found base dataset, please specify either of: '
ValueError: Not found base dataset, please specify either of: --in_json_tree / --in_json_file (multiple arguments allowed)

选项 --in_json_tree、--in_json_file 和 --in_crop_tree 预期1个或多个指向指定数据集表示的路径。如果传递了多个值，则数据集将合并（强制所有元素都具有唯一的 id 字段）。

$ cocorepr \
    --in_json_file /tmp/json_file/file1.json /tmp/json_file/file2.json \
    --in_json_tree /tmp/json_tree/dir1 /tmp/json_file/dir2 /tmp/json_file/dir3 \
    --in_crop_tree /tmp/crop_tree/dir1 /tmp/crop_tree/dir2 \
    --out_path /tmp/json_tree \
    --out_format json_tree

上面的命令将从 /tmp/json_file/file1.json 加载 json_file 数据集，然后加载 /tmp/json_file/file2.json 并将其与第一个合并，然后从 /tmp/json_tree/dir1 加载 json_tree 并将其与前面的结果合并，等等。然后它将使用先前构建的数据集的元信息从 /tmp/crop_tree/dir1 加载 crop_tree 并将其与 /tmp/crop_tree/dir2 合并。结果将以 json_tree 的形式写入 /tmp/json_tree（如果目录存在，则工具将失败，除非指定了 --overwrite）。

动机

这个工具诞生于 Neu.ro，当时我们为一个需要处理照片、检测对象并将它们按大量类别分类的客户进行机器学习项目开发。客户有大量数据，但数据非常嘈杂。

我们的解决方案大致包括两个模型

对象检测（OD）模型：在数据集上训练，寻找通用对象（类似于COCO：瓶子、笔记本电脑、公共汽车），
对象分类（CL）模型：在客户的领域进行微调（例如：哪个瓶子的确切品牌，哪种类型的笔记本电脑）。

虽然第一个模型可以在通用数据集上生成，但第二个问题需要与客户进行大量工作，清理嘈杂数据并准备微调的分类数据集。

由于历史原因，这两个数据集都是收集、清理并存储在COCO格式中。希望我们不需要存储图像块--客户的API强制执行它们的可用性和不可变性，因此我们只能存储图像URL和一些其他元数据（coco_url和id，其他字段是可选的）

{
    "id": 49428,  // image ID
    "coco_url": "http://images.cocodataset.org/train2017/000000049428.jpg",  // URL of the immutable image blob
    // "license": 6,
    // "file_name": "000000049428.jpg",
    // "height": 427,
    // "width": 640,
    // "date_captured": "2013-11-15 04:30:29",
    // "flickr_url": "http://farm7.staticflickr.com/6014/5923365195_bee5603371_z.jpg"
},

尽管COCO格式对于OD数据集来说是本地的，但对于关注注释类别的CL数据集来说可能过于庞大

{
    "id": 124710,  // annotation ID
    "image_id": 140006,  // image ID in the section "images"
    "category_id": 2,  // class ID in the section "categories"
    "bbox": [496.52, 125.94, 143.48, 113.54],  // crop coordinates in pixels: [x,y,w,h] (from top-left, x=horizontal)
}

为了训练CL模型，我们希望为每个类别有特定数量的“干净”裁剪（我们称通过坐标给定注释从给定图像裁剪的小图片为裁剪）。为了便于手动选择干净裁剪，我们希望它们按类别（分类）分组到目录中。清理后，我们希望重建这个COCO数据集的子集，将其注册到Git中，然后使用它来训练模型。这就是cocorepr，它被创建用于自动化COCO数据集不同表示之间的转换。

以下是对COCO数据集表示的详细讨论。

COCO数据集的表示

Json文件

这是一个COCO数据集的常规格式：所有注释都存储在一个单一的json文件中

$ cat examples/coco_chunk/json_file/instances_train2017_chunk3x2.json
{
    "licenses": [
        {
            "url": "http://creativecommons.org/licenses/by-nc-sa/2.0/",
            "id": 1,
            "name": "Attribution-NonCommercial-ShareAlike License"
        },
        ...
    ],
    "info": {
        "description": "COCO 2017 Dataset",
        "url": "http://cocodataset.org",
        "version": "1.0",
        "year": 2017,
        "contributor": "COCO Consortium",
        "date_created": "2017/09/01"
    },
    "categories": [
        {
            "supercategory": "person",
            "id": 1,
            "name": "person"
        },
        ...
    ],
    "images": [
        {
            "license": 6,
            "file_name": "000000049428.jpg",
            "coco_url": "http://images.cocodataset.org/train2017/000000049428.jpg",
            "height": 427,
            "width": 640,
            "date_captured": "2013-11-15 04:30:29",
            "flickr_url": "http://farm7.staticflickr.com/6014/5923365195_bee5603371_z.jpg",
            "id": 49428
        },
        ...
    ],
    "annotations": [
        {
            "image_id": 140006,
            "bbox": [
                496.52,
                125.94,
                143.48,
                113.54
            ],
            "category_id": 2,
            "id": 124710
        },
        ...
    ]
}

此格式被许多机器学习框架用作输入格式，但通常json树文件太大，无法存储在Git仓库中（超过50M），因此我们需要将其存储在Git LFS中（它不显示diff，只显示hash），或者使用更适合与Git一起工作的另一种表示。

Json树

此格式使数据集适合Git：它将每个元素存储在单独的json块中，从而使Git能够在单个块级别上进行diff。

$ cocorepr \
    --in_json_file examples/coco_chunk/json_file/instances_train2017_chunk3x2.json \
    --out_path $TMP \
    --out_format json_tree  # --overwrite
INFO:root:Arguments: Namespace(in_crop_tree_path=None, in_json_path=PosixPath('examples/coco_chunk/json_file/instances_train2017_chunk3x2.json'), out_format='json_tree', out_path=PosixPath('/tmp/json_tree'), overwrite=False)
INFO:root:Loading json file from file: examples/coco_chunk/json_file/instances_train2017_chunk3x2.json
INFO:root:Loaded: images=6, annotations=6, categories=3
INFO:root:Dumping json tree to dir: /tmp/json_tree
INFO:root:[+] Success: json_tree dumped to /tmp/json_tree: ['info.json', 'info', 'categories', 'annotations', 'licenses', 'images']

$ tree /tmp/json_tree
/tmp/json_tree
├── annotations
│   ├── 124710.json
│   ├── 124713.json
│   ├── 131774.json
│   ├── 131812.json
│   ├── 183020.json
│   └── 183030.json
├── categories
│   ├── 1.json
│   ├── 2.json
│   └── 3.json
├── images
│   ├── 117891.json
│   ├── 140006.json
│   ├── 289949.json
│   ├── 49428.json
│   ├── 537548.json
│   └── 71345.json
├── info
├── info.json
└── licenses
    ├── 1.json
    ├── 2.json
    ├── 3.json
    ├── 4.json
    ├── 5.json
    ├── 6.json
    ├── 7.json
    └── 8.json

5 directories, 24 files

裁剪树

此格式用于简化CL数据集的手动清理过程：目录crop包含以{sanitized-class-name}--{class-id}命名的类列表，以便具有相似名称的类（例如，汽车类别Bugatti Veyron EB 16.4和Bugatti Veyron 16.4 Grand Sport将命名为Bugatti_Veyron_EB_16_4--103209和Bugatti_Veyron_16_4_Grand_Sport--376319，这在目录通常按字母顺序排序的情况下是有意义的）。然后，人类通过裁剪图片，删除“脏”图片，并确保每个类包含足够的“干净”裁剪。然后，我们可以重构数据集在json树表示中，并将其注册到Git中。

$ cocorepr \
    --in_json_file examples/coco_chunk/json_file/instances_train2017_chunk3x2.json \
    --out_path /tmp/crop_tree \
    --out_format crop_tree
INFO:root:Arguments: Namespace(in_crop_tree_path=None, in_json_path=PosixPath('examples/coco_chunk/json_file/instances_train2017_chunk3x2.json'), indent=4, out_format='crop_tree', out_path=PosixPath('/tmp/crop_tree'), overwrite=False)
INFO:root:Loading json file from file: examples/coco_chunk/json_file/instances_train2017_chunk3x2.json
INFO:root:Loaded: images=6, annotations=6, categories=3
INFO:root:Detected input dataset type: json_file: examples/coco_chunk/json_file/instances_train2017_chunk3x2.json
INFO:root:Dumping crop tree to dir: /tmp/crop_tree
Processing images: 100%|                                           | 6/6 [00:03<00:00,  1.60it/s]
INFO:root:[+] Success: crop_tree dumped to /tmp/crop_tree: ['crops', 'images']

$ tree /tmp/crop_tree
/tmp/crop_tree
├── crops
│   ├── bicycle--2
│   │   ├── 124710.png
│   │   └── 124713.png
│   ├── car--3
│   │   ├── 131774.png
│   │   └── 131812.png
│   └── person--1
│       ├── 183020.png
│       └── 183030.png
└── images
    ├── 000000049428.jpg
    ├── 000000071345.jpg
    ├── 000000117891.jpg
    ├── 000000140006.jpg
    ├── 000000289949.jpg
    └── 000000537548.jpg

5 directories, 12 files

现在，这棵树可以手动清理（删除“脏”裁剪），然后我们将能够重新构建数据集。

展示：数据集清理过程的单个迭代

我们的设置

我们的数据集以json_tree表示存储在git仓库/project/my-dataset中。这个数据集存在不完整的问题：某些类别缺少“干净”注释。
客户已向我们提供了额外数据作为两个json_file：/inputs/annotations-new-1.json和/inputs/annotations-new-2.json。
我们希望将这两个数据集合并成一个crop_tree表示形式，手动清理它，然后在我们的git仓库中重新构建一个新的数据集并就地保存。

步骤 1：合并数据集 json_tree + json_filex2 -> crop_tree

cocorepr \
    --in_json_tree /project/my-dataset \
    --in_json_file /inputs/annotations-new-1.json /inputs/annotations-new-2.json \
    --out_path /temp/my-dataset-crops \
    --out_format crop_tree \
    --overwrite \
    --debug
ls /temp/my-dataset-crops

步骤 2：在/temp/my-dataset-crops中手动清理crop_tree

步骤 3：重新构建清理后的数据集

# first, verify that your original dataset has no uncommitted changes (they'll be lost)
cd /project/my-dataset
git diff-index --quiet HEAD

cocorepr \
    --in_crop_tree /temp/my-dataset-crops \
    --in_json_tree /project/my-dataset \
    --out_path /project/my-dataset \
    --out_format json_tree \
    --overwrite \
    --debug

现在您可以将数据集 /project/my-dataset 的更改提交。

项目详情

这些详情尚未通过PyPI验证

项目链接

首页

开发状态
- 3 - Alpha
目标受众
- 开发者
许可证
- OSI批准 :: Apache软件许可证
自然语言
- 英语
编程语言

发布历史发布通知 | RSS 源

当前版本

0.1.0

2021年5月13日

0.0.27

2021年5月12日

0.0.26

2021年5月12日

0.0.25

2021年5月12日

0.0.24

2021年5月12日

0.0.23

2021年5月12日

0.0.22

2021年5月12日

0.0.21

2021年5月12日

0.0.20

2021年5月12日

0.0.19

2021年5月12日

0.0.18

2021年5月12日

0.0.17

2021年5月12日

0.0.16.1

2021年5月12日

0.0.16

2021年5月12日

0.0.15

2021年5月12日

0.0.14

2021年5月12日

0.0.13

2021年5月12日

0.0.12

2021年5月12日

0.0.11

2021年5月12日

0.0.10

2021年5月12日

0.0.9

2021年5月12日

0.0.8

2021年5月12日

0.0.7

2021年5月11日

0.0.6

2021年5月11日

0.0.5

2021年5月11日

0.0.4

2021年5月11日

0.0.3

2021年5月11日

0.0.2

2021年5月11日

0.0.1

2021年5月10日

下载文件

下载适用于您平台的文件。如果您不确定选择哪个，请了解有关安装包的更多信息。

源分布

cocorepr-0.1.0.tar.gz (26.0 kB 查看哈希)

上传时间 2021年5月13日 源

构建分布

cocorepr-0.1.0-py3-none-any.whl (34.0 kB 查看哈希)

上传时间 2021年5月13日 Python 3

cocorepr-0.1.0.tar.gz 的哈希

cocorepr-0.1.0.tar.gz 的哈希
算法	哈希摘要
SHA256	`8b8e5dea6d881d523769353ec4fb13ce93c54048453614c157166714a07d1269`
MD5	`f473b8b576edbc94dea9548bceb4d816`
BLAKE2b-256	`3341ce4b453ba848ddb043a4c75dcbcca628d615d9c920c5ccab7a4536b74b51`

cocorepr-0.1.0-py3-none-any.whl 的哈希

cocorepr-0.1.0-py3-none-any.whl 的哈希
算法	哈希摘要
SHA256	`c188e582963f61e23965b71109efdbebda385865e5a900ac79bfa96d0adfcfa2`
MD5	`7e44435f06e550cc094dd3b855294f68`
BLAKE2b-256	`5b7c00bd87a991e1ee271cdaccab5ef7a28ae27fa43fc7d16c9c3b06aa34c935`

cocorepr 0.1.0

导航

验证详情

维护者

未验证详情

项目链接

元信息

分类器

项目描述

Cocorepr

安装

基本用法

动机

COCO数据集的表示

Json文件

Json树

裁剪树

展示：数据集清理过程的单个迭代

项目详情

验证详情

维护者

未验证详情

项目链接

元信息

分类器

发布历史发布通知 | RSS 源

下载文件

源分布

构建分布

cocorepr 0.1.0

导航

验证详情

维护者

未验证详情

项目链接

元信息

分类器

项目描述

Cocorepr

安装

基本用法

动机

COCO数据集的表示

Json文件

Json树

裁剪树

展示：数据集清理过程的单个迭代

项目详情

验证详情

维护者

未验证详情

项目链接

元信息

分类器

发布历史 发布通知 | RSS 源

下载文件

源分布

构建分布

发布历史发布通知 | RSS 源