该仓库旨在简化视觉-语言模型的评估过程。它提供了一套全面的工具和脚本，用于评估VLM模型和基准。

这些详情尚未由PyPI验证

项目链接

项目描述


[Arxiv链接]

入门 • 用法 • 基准和模型 • 信用和引用

视觉-语言模型评估仓库

该仓库旨在简化视觉-语言模型的评估过程。它提供了一套全面的工具和脚本，用于评估VLM模型和基准。我们提供了60个VLM，包括最近的EVACLIP等大型模型，规模达到4.3B参数和12.8B训练样本。此外，我们还提供了40个评估基准的实现。

即将推出

L-VLM（例如PaliGemma，LlavaNext）

入门

安装包

pip install unibench -U

[选项2] 安装依赖项

通过以下方式安装必要的依赖项：
- 选项1，创建新的conda环境：conda env create -f environment.yml
- 选项2，通过以下方式更新您的conda环境并包含所需的库：conda env update --file environment.yml --prune
激活环境：conda activate unibench
安装Spacy英语语言模型：python -m spacy download en_core_web_sm
安装包：pip install git+https://github.com/facebookresearch/unibench

使用方法

打印评估模型的输出结果

以下命令将打印所有基准和模型的评估结果

unibench show_results

使用命令行运行评估

以下命令将在所有基准和模型上运行评估

unibench evaluate

使用自定义脚本运行评估

以下命令将在所有基准和模型上运行评估

import unibench as vlm

evaluator = vlm.Evaluator()
evaluator.evaluate()

评估参数

evaluate 函数接受以下参数

Args:
    save_freq (int): The frequency at which to save results. Defaults to 1000.
    face_blur (bool): Whether to use face blurring during evaluation. Defaults to False.
    device (str): The device to use for evaluation. Defaults to "cuda" if available otherwise "cpu".
    batch_per_gpu (int): Evaluation batch size per GPU. Defaults to 32.

Evaluator 类接受以下参数

Args:
    seed (int): Random seed for reproducibility.
    num_workers (int): Number of workers for data loading.
    models (Union[List[str], str]): List of models to evaluate or "all" to evaluate all available models.
    benchmarks (Union[List[str], str]): List of benchmarks to evaluate or "all" to evaluate all available benchmarks.
    model_id (Union[int, None]): Specific model ID to evaluate.
    benchmark_id (Union[int, None]): Specific benchmark ID to evaluate.
    output_dir (str): Directory to save evaluation results.
    benchmarks_dir (str): Directory containing benchmark data.
    download_aggregate_precomputed (bool): Whether to download aggregate precomputed results.
    download_all_precomputed (bool): Whether to download all precomputed results.

示例

以下命令将在 metaclip400m 训练的 openclip_vitB32 和 CLIP ResNet50 上运行 vg_relation, clevr_distance, fer2013, pcam, imageneta 基准的评估

unibench evaluate --models=[openclip_vitB32_metaclip_400m,clip_resnet50] --benchmarks=[vg_relation,clevr_distance,fer2013,pcam,imageneta]

除了将结果保存在 ~/.cache/unibench 之外，输出还将包含评估结果的摘要

  model_name                      non-natural images   reasoning   relation   robustness  
 ──────────────────────────────────────────────────────────────────────────────────────── 
  clip_resnet50                   63.95                 14.89       54.13      23.27       
  openclip_vitB32_metaclip_400m   63.87                 19.46       51.54      28.71

支持的模型和基准

模型和基准的完整列表可在 models_zoo 和 benchmarks_zoo 中找到。您还可以运行以下命令

unibench list_models
# or
unibench list_benchmarks

示例模型

	数据集大小（百万）	参数数量（百万）	学习目标	架构	模型名称
blip_vitB16_14m	14	86	BLIP	vit	BLIP ViT B 16
blip_vitL16_129m	129	307	BLIP	vit	BLIP ViT L 16
blip_vitB16_129m	129	86	BLIP	vit	BLIP ViT B 16
blip_vitB16_coco	129	86	BLIP	vit	BLIP ViT B 16
blip_vitB16_flickr	129	86	BLIP	vit	BLIP ViT B 16

示例基准

	基准	基准类型
clevr_distance	零样本	vtab
fgvc_aircraft	零样本	迁移
objectnet	零样本	鲁棒性
winoground	关系	关系
imagenetc	零样本	破坏

基准概述

基准类型	基准数量
ImageNet	1
vtab	18
迁移	7
鲁棒性	6
关系	6
破坏	1

结果保存方式

对于每个模型，结果保存在由常量定义的输出目录中：~./.cache/unibench/outputs。

添加新基准

要添加新基准，您可以直接从 torch.utils.data.Dataset 类继承并实现 __getitem__ 和 __len__ 方法。例如，以下是如何将 ImageNetA 添加为新的基准

from functools import partial
from unibench import Evaluator
from unibench.benchmarks_zoo import ZeroShotBenchmarkHandler
from torchvision.datasets import FashionMNIST

class_names = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot",
]

templates = ["an image of {}"]

benchmark = partial(
    FashionMNIST, root="/fsx-robust/haideraltahan", train=False, download=True
)
handler = partial(
    ZeroShotBenchmarkHandler,
    benchmark_name="fashion_mnist_new",
    classes=class_names,
    templates=templates,
)


eval = Evaluator()

eval.add_benchmark(
    benchmark,
    handler,
    meta_data={
        "benchmark_type": "object recognition",
    },
)
eval.update_benchmark_list(["fashion_mnist_new"])
eval.update_model_list(["blip_vitB16_129m"])
eval.evaluate()

添加新模型

添加新模型最重要的组成部分是创建或使用现有的 AbstractModel 并实现 compute_zeroshot_weights、get_image_embeddings 和 get_text_embeddings，类似于 ClipModel 的方式

class ClipModel(AbstractModel):
    def __init__(
        self,
        model,
        model_name,
        **kwargs,
    ):
        super(ClipModel, self).__init__(model, model_name, **kwargs)

    def compute_zeroshot_weights(self):
        zeroshot_weights = []
        for class_name in self.classes:
            texts = [template.format(class_name) for template in self.templates]

            class_embedding = self.get_text_embeddings(texts)

            class_embedding = class_embedding.mean(dim=0)
            class_embedding /= class_embedding.norm(dim=-1, keepdim=True)

            zeroshot_weights.append(class_embedding)
        self.zeroshot_weights = torch.stack(zeroshot_weights).T

    @torch.no_grad()
    def get_image_embeddings(self, images):
        image_features = self.model.encode_image(images.to(self.device))
        image_features /= image_features.norm(dim=1, keepdim=True)
        return image_features.unsqueeze(1)

    @torch.no_grad()
    def get_text_embeddings(self, captions):
        if (
            "truncate" in inspect.getfullargspec(self.tokenizer.__call__)[0]
            or "truncate" in inspect.getfullargspec(self.tokenizer)[0]
        ):
            caption_tokens = self.tokenizer(
                captions, context_length=self.context_length, truncate=True
            ).to(self.device)
        else:
            caption_tokens = self.tokenizer(
                captions, context_length=self.context_length
            ).to(self.device)

        caption_embeddings = self.model.encode_text(caption_tokens)
        caption_embeddings /= caption_embeddings.norm(dim=-1, keepdim=True)

        return caption_embeddings

使用以下类，我们可以在模型列表中添加模型。以下是一个添加并评估 ViTamin-L 的示例

from functools import partial
from io import open_code
from unibench import Evaluator
from unibench.models_zoo.wrappers.clip import ClipModel
import open_clip

model, _, _ = open_clip.create_model_and_transforms(
    "ViTamin-L", pretrained="datacomp1b"
)

tokenizer = open_clip.get_tokenizer("ViTamin-L")

model = partial(
    ClipModel,
    model=model,
    model_name="vitamin_l_comp1b",
    tokenizer=tokenizer,
    input_resolution=model.visual.image_size[0],
    logit_scale=model.logit_scale,
)


eval = Evaluator(benchmarks_dir="/fsx-checkpoints/haideraltahan/.cache/unibench/data")

eval.add_model(model=model)
eval.update_benchmark_list(["imagenet1k"])
eval.update_model_list(["vitamin_l_comp1b"])
eval.evaluate()

贡献

贡献（例如添加新的基准/模型）、问题以及功能请求均受欢迎！对于任何更改，请首先打开问题讨论您想更改或改进的内容。

许可证

UniBench 的主要部分受 CC-BY-NC 许可，但项目的一些部分可用作单独的许可条款

许可证	库
MIT 许可证	zipp, tabulate, rich, openai-clip, latextable, gdown
Apache 2.0 许可证	transformers, timm, opencv-python, open-clip-torch, ftfy, fire, debtcollector, datasets, oslo.concurrency
BSD 许可证	torchvision, torch, seaborn, scipy, scikit-learn, fairscale, cycler, contourpy, click, GitPython

引用

如果您在研究中使用此存储库，请按照以下方式引用

@inproceedings{altahan2024unibenchvisualreasoningrequires,
      title={UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling}, 
      author={Haider Al-Tahan and Quentin Garrido and Randall Balestriero and Diane Bouchacourt and Caner Hazirbas and Mark Ibrahim},
      year={2024},
      eprint={2408.04810},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.04810}, 
}

识别

库结构受到了 Robert Geirhos 的作品 https://github.com/bethgelab/model-vs-human 的启发

项目详情

这些详情尚未由PyPI验证

项目链接

发布历史发布通知 | RSS 源

此版本

0.3.1

2024 年 9 月 26 日

0.3.0

2024 年 9 月 3 日

0.2.0

2024年8月16日

下载文件

下载适用于您平台的文件。如果您不确定选择哪个，请了解更多关于安装包的信息。

源分布

unibench-0.3.1.tar.gz (93.9 kB 查看哈希值)

上传时间 2024年9月26日 源

构建分布

unibench-0.3.1-py3-none-any.whl (104.9 kB 查看哈希值)

上传时间 2024年9月26日 Python 3

unibench-0.3.1.tar.gz 的哈希值

unibench-0.3.1.tar.gz 的哈希值
算法	哈希摘要
SHA256	`71434a4aa0bb3c777568853b57a681e45bebee8b029caa6c141842f803c1acf4`
MD5	`db7160738655ac5e59fecbb82ee340a3`
BLAKE2b-256	`7990cf11ecd5bb83254fc138cb264b17a68b9fdd4c0ac6a022ceff121abe7a00`

unibench-0.3.1-py3-none-any.whl 的哈希值

unibench-0.3.1-py3-none-any.whl 的哈希值
算法	哈希摘要
SHA256	`f7a4fa7bc73116c661e19e2e4965aa2f0bb64d8f0a286c4c16b1bff363fd8e04`
MD5	`f4760280f9d4809584c424f15689b910`
BLAKE2b-256	`71996aa75a6b18326b73b39d42f48d804e8de1343c7bf55a0db5c667fa7b20c1`

unibench 0.3.1

导航

验证详情

维护者

未经验证的详情

项目链接

元信息

分类器

项目描述

视觉-语言模型评估仓库

即将推出

入门

使用方法

打印评估模型的输出结果

使用命令行运行评估

使用自定义脚本运行评估

评估参数

示例

支持的模型和基准

示例模型

示例基准

基准概述

结果保存方式

添加新基准

添加新模型

贡献

许可证

引用

识别

项目详情

验证详情

维护者

未经验证的详情

项目链接

元信息

分类器

发布历史发布通知 | RSS 源

下载文件

源分布

构建分布

unibench 0.3.1

导航

验证详情

维护者

未经验证的详情

项目链接

元信息

分类器

项目描述

视觉-语言模型评估仓库

即将推出

入门

使用方法

打印评估模型的输出结果

使用命令行运行评估

使用自定义脚本运行评估

评估参数

示例

支持的模型和基准

示例模型

示例基准

基准概述

结果保存方式

添加新基准

添加新模型

贡献

许可证

引用

识别

项目详情

验证详情

维护者

未经验证的详情

项目链接

元信息

分类器

发布历史 发布通知 | RSS 源

下载文件

源分布

构建分布

发布历史发布通知 | RSS 源