跳转到主要内容

Intel® oneCCL Bindings for PyTorch(之前称为torch_ccl)

此版本被撤回的原因

这是一个虚拟包

项目描述

Intel® oneCCL Bindings for PyTorch(之前称为torch_ccl)

此存储库包含Intel维护的用于Intel® oneAPI集体通信库(oneCCL)的PyTorch绑定。

介绍

PyTorch 是一个开源的机器学习框架。

Intel® oneCCL(集体通信库)是一个用于实现类似 allreduceallgatheralltoall 等集合的库,以有效地进行分布式深度学习训练。有关oneCCL的更多信息,请参阅 oneCCL文档

oneccl_bindings_for_pytorch 模块实现了PyTorch C10D ProcessGroup API,可以作为外部ProcessGroup动态加载,目前仅在Linux平台上工作。

功能

下表显示了哪些函数可用于与CPU / Intel dGPU张量一起使用。

CPU GPU
send ×
recv ×
broadcast
all_reduce
reduce
all_gather
gather
scatter × ×
reduce_scatter
all_to_all
barrier

Pytorch API Align

我们建议使用Anaconda作为Python包管理系统。以下是oneccl_bindings_for_pytorch和支持的Pytorch的相应分支(标签)。

torch oneccl_bindings_for_pytorch
master master
v2.2.0 ccl_torch2.2.0+cpu
v2.1.0 ccl_torch2.1.0+cpu
v2.0.1 ccl_torch2.0.100
v1.13 ccl_torch1.13
v1.12.1 ccl_torch1.12.100
v1.12.0 ccl_torch1.12
v1.11.0 ccl_torch1.11
v1.10.0 ccl_torch1.10
v1.9.0 ccl_torch1.9
v1.8.1 ccl_torch1.8
v1.7.1 ccl_torch1.7
v1.6.0 ccl_torch1.6
v1.5-rc3 beta09

使用详情可以在对应分支的README中找到。以下部分是关于v1.9标签的使用。如果您想使用其他版本的torch-ccl,请切换到该分支(标签)。对于pytorch-1.5.0-rc3,需要动态注册外部ProcessGroup并启用alltoall集体通信原语,需要#PR28068#PR32361。这两个PR的补丁文件位于patches目录中,可以直接使用。

需求

  • Python 3.8或更高版本和C++17编译器

  • PyTorch v2.2.0

构建选项列表

以下构建选项在Intel® oneCCL Bindings for PyTorch*中受支持。

构建选项 默认值 描述
COMPUTE_BACKEND 设置oneCCL COMPUTE_BACKEDN,设置为dpcpp并使用DPC++编译器启用对Intel XPU的支持
USE_SYSTEM_ONECCL 关闭 使用系统中的oneCCL库
CCL_PACKAGE_NAME oneccl-bind-pt 设置Wheel名称
ONECCL_BINDINGS_FOR_PYTORCH_BACKEND cpu 设置BACKEND
CCL_SHA_VERSION False 将git head sha版本添加到Wheel名称中

午餐选项列表

以下午餐选项在Intel® oneCCL Bindings for PyTorch*中受支持。

午餐选项 默认值 描述
ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE 0 设置ONECCL_BINDINGS_FOR_PYTORCH的详细级别
ONECCL_BINDINGS_FOR_PYTORCH_ENV_WAIT_GDB 0 设置为1以强制onerccl_bindings_for_pytorch等待GDB附加

安装

从源安装

  1. 克隆oneccl_bindings_for_pytorch

    git clone https://github.com/intel/torch-ccl.git && cd torch-ccl
    git submodule sync
    git submodule update --init --recursive
    
  2. 安装oneccl_bindings_for_pytorch

    # for CPU Backend Only
    python setup.py install
    # for XPU Backend: use DPC++ Compiler to enable support for Intel XPU
    # build with oneCCL from third party
    COMPUTE_BACKEND=dpcpp python setup.py install
    # build without oneCCL
    export INTELONEAPIROOT=${HOME}/intel/oneapi
    USE_SYSTEM_ONECCL=ON COMPUTE_BACKEND=dpcpp python setup.py install
    

安装预构建Wheel

以下Python版本的Wheel文件可用。

扩展版本 Python 3.6 Python 3.7 Python 3.8 Python 3.9 Python 3.10 Python 3.11
2.2.0
2.1.0
2.0.100
1.13
1.12.100
1.12.0
1.11.0
1.10.0
python -m pip install oneccl_bind_pt==2.0.100 -f https://developer.intel.com/ipex-whl-stable-xpu

运行时动态链接

  • 如果oneccl_bindings_for_pytorch没有使用oneCCL构建,并且使用系统中的oneCCL,则从oneAPI basekit动态链接oneCCl(推荐使用)
source $basekit_root/ccl/latest/env/vars.sh

注意:当在Intel® GPUs上使用Intel® oneCCL Bindings for Pytorch*时,请确保已安装basekit

  • 如果oneccl_bindings_for_pytorch与第三方oneCCL构建,或从预构建Wheel安装:动态链接oneCCL和Intel MPI库
source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh

仅动态链接oneCCL(不包括Intel MPI)

source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/vars.sh

使用方法

example.py

import torch.nn.parallel
import torch.distributed as dist
import oneccl_bindings_for_pytorch

...

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
os.environ['RANK'] = str(os.environ.get('PMI_RANK', 0))
os.environ['WORLD_SIZE'] = str(os.environ.get('PMI_SIZE', 1))

backend = 'ccl'
dist.init_process_group(backend, ...)
my_rank = dist.get_rank()
my_size = dist.get_world_size()
print("my rank = %d  my size = %d" % (my_rank, my_size))

...

model = torch.nn.parallel.DistributedDataParallel(model, ...)

...

(oneccl_bindings_for_pytorch没有使用oneCCL构建,在系统中使用oneCCL和MPI(如有需要))

source $basekit_root/ccl/latest/env/vars.sh
source $basekit_root/mpi/latest/env/vars.sh

mpirun -n -ppn -f python example.py


## Performance Debugging

For debugging performance of communication primitives PyTorch's [Autograd profiler](https://pytorch.ac.cn/docs/stable/autograd.html#profiler)
can be used to inspect time spent inside oneCCL calls.

Example:

profiling.py

```python

import torch.nn.parallel
import torch.distributed as dist
import oneccl_bindings_for_pytorch
import os

os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
os.environ['RANK'] = str(os.environ.get('PMI_RANK', 0))
os.environ['WORLD_SIZE'] = str(os.environ.get('PMI_SIZE', 1))

backend = 'ccl'
dist.init_process_group(backend)
my_rank = dist.get_rank()
my_size = dist.get_world_size()
print("my rank = %d  my size = %d" % (my_rank, my_size))

x = torch.ones([2, 2])
y = torch.ones([4, 4])
with torch.autograd.profiler.profile(record_shapes=True) as prof:
    for _ in range(10):
        dist.all_reduce(x)
        dist.all_reduce(y)
dist.barrier()
print(prof.key_averages(group_by_input_shape=True).table(sort_by="self_cpu_time_total"))

mpirun -n 2 -l python profiling.py
[0] my rank = 0  my size = 2
[0] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[0]                                                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls          Input Shapes
[0] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[0]                oneccl_bindings_for_pytorch::allreduce        91.41%     297.900ms        91.41%     297.900ms      29.790ms            10              [[2, 2]]
[0]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         8.24%      26.845ms         8.24%      26.845ms       2.684ms            10      [[2, 2], [2, 2]]
[0]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         0.30%     973.651us         0.30%     973.651us      97.365us            10      [[4, 4], [4, 4]]
[0]                oneccl_bindings_for_pytorch::allreduce         0.06%     190.254us         0.06%     190.254us      19.025us            10              [[4, 4]]
[0] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[0] Self CPU time total: 325.909ms
[0]
[1] my rank = 1  my size = 2
[1] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[1]                                                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls          Input Shapes
[1] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[1]                oneccl_bindings_for_pytorch::allreduce        96.03%     318.551ms        96.03%     318.551ms      31.855ms            10              [[2, 2]]
[1]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         3.62%      12.019ms         3.62%      12.019ms       1.202ms            10      [[2, 2], [2, 2]]
[1]                oneccl_bindings_for_pytorch::allreduce         0.33%       1.082ms         0.33%       1.082ms     108.157us            10              [[4, 4]]
[1]     oneccl_bindings_for_pytorch::wait::cpu::allreduce         0.02%      56.505us         0.02%      56.505us       5.651us            10      [[4, 4], [4, 4]]
[1] -----------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------
[1] Self CPU time total: 331.708ms
[1]

已知问题

对于点对点通信,在启动脚本中初始化进程组后直接调用dist.send/recv将触发运行时错误。因为在我们当前实现中,组中的所有进程都应参与此调用以创建通信器,而dist.send/recv只有一对进程的参与。因此,应在集体调用之后使用dist.send/recv,以确保所有进程的参与。支持在初始化进程组后直接调用dist.send/recv的进一步解决方案仍在研究中。

许可证

BSD许可证

项目详情


下载文件

下载适用于您平台的文件。如果您不确定选择哪个,请了解更多关于安装包的信息。

源代码分布

此版本没有可用的源代码分布文件。请参阅生成分布存档的教程

构建分布

oneccl_binding_pt-0.0.4-py3-none-any.whl (6.5 kB 查看散列值)

上传时间 Python 3

支持