Intel® oneCCL Bindings for PyTorch(之前称为torch_ccl)
此版本被撤回的原因
这是一个虚拟包
项目描述
Intel® oneCCL Bindings for PyTorch(之前称为torch_ccl)
此存储库包含Intel维护的用于Intel® oneAPI集体通信库(oneCCL)的PyTorch绑定。
介绍
PyTorch 是一个开源的机器学习框架。
Intel® oneCCL(集体通信库)是一个用于实现类似 allreduce
、allgather
、alltoall
等集合的库,以有效地进行分布式深度学习训练。有关oneCCL的更多信息,请参阅 oneCCL文档。
oneccl_bindings_for_pytorch
模块实现了PyTorch C10D ProcessGroup API,可以作为外部ProcessGroup动态加载,目前仅在Linux平台上工作。
功能
下表显示了哪些函数可用于与CPU / Intel dGPU张量一起使用。
CPU | GPU | |
---|---|---|
send |
× | √ |
recv |
× | √ |
broadcast |
√ | √ |
all_reduce |
√ | √ |
reduce |
√ | √ |
all_gather |
√ | √ |
gather |
√ | √ |
scatter |
× | × |
reduce_scatter |
√ | √ |
all_to_all |
√ | √ |
barrier |
√ | √ |
Pytorch API Align
我们建议使用Anaconda作为Python包管理系统。以下是oneccl_bindings_for_pytorch
和支持的Pytorch的相应分支(标签)。
torch |
oneccl_bindings_for_pytorch |
---|---|
master |
master |
v2.2.0 | ccl_torch2.2.0+cpu |
v2.1.0 | ccl_torch2.1.0+cpu |
v2.0.1 | ccl_torch2.0.100 |
v1.13 | ccl_torch1.13 |
v1.12.1 | ccl_torch1.12.100 |
v1.12.0 | ccl_torch1.12 |
v1.11.0 | ccl_torch1.11 |
v1.10.0 | ccl_torch1.10 |
v1.9.0 | ccl_torch1.9 |
v1.8.1 | ccl_torch1.8 |
v1.7.1 | ccl_torch1.7 |
v1.6.0 | ccl_torch1.6 |
v1.5-rc3 | beta09 |
使用详情可以在对应分支的README中找到。以下部分是关于v1.9标签的使用。如果您想使用其他版本的torch-ccl,请切换到该分支(标签)。对于pytorch-1.5.0-rc3,需要动态注册外部ProcessGroup并启用alltoall
集体通信原语,需要#PR28068
和#PR32361
。这两个PR的补丁文件位于patches
目录中,可以直接使用。
需求
-
Python 3.8或更高版本和C++17编译器
-
PyTorch v2.2.0
构建选项列表
以下构建选项在Intel® oneCCL Bindings for PyTorch*中受支持。
构建选项 | 默认值 | 描述 |
---|---|---|
COMPUTE_BACKEND | 设置oneCCL COMPUTE_BACKEDN ,设置为dpcpp 并使用DPC++编译器启用对Intel XPU的支持 |
|
USE_SYSTEM_ONECCL | 关闭 | 使用系统中的oneCCL库 |
CCL_PACKAGE_NAME | oneccl-bind-pt | 设置Wheel名称 |
ONECCL_BINDINGS_FOR_PYTORCH_BACKEND | cpu | 设置BACKEND |
CCL_SHA_VERSION | False | 将git head sha版本添加到Wheel名称中 |
午餐选项列表
以下午餐选项在Intel® oneCCL Bindings for PyTorch*中受支持。
午餐选项 | 默认值 | 描述 |
---|---|---|
ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE | 0 | 设置ONECCL_BINDINGS_FOR_PYTORCH的详细级别 |
ONECCL_BINDINGS_FOR_PYTORCH_ENV_WAIT_GDB | 0 | 设置为1以强制onerccl_bindings_for_pytorch等待GDB附加 |
安装
从源安装
-
克隆
oneccl_bindings_for_pytorch
。git clone https://github.com/intel/torch-ccl.git && cd torch-ccl git submodule sync git submodule update --init --recursive
-
安装
oneccl_bindings_for_pytorch
# for CPU Backend Only python setup.py install # for XPU Backend: use DPC++ Compiler to enable support for Intel XPU # build with oneCCL from third party COMPUTE_BACKEND=dpcpp python setup.py install # build without oneCCL export INTELONEAPIROOT=${HOME}/intel/oneapi USE_SYSTEM_ONECCL=ON COMPUTE_BACKEND=dpcpp python setup.py install
安装预构建Wheel
以下Python版本的Wheel文件可用。
扩展版本 | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 | Python 3.11 |
---|---|---|---|---|---|---|
2.2.0 | √ | √ | √ | √ | ||
2.1.0 | √ | √ | √ | √ | ||
2.0.100 | √ | √ | √ | √ | ||
1.13 | √ | √ | √ | √ | ||
1.12.100 | √ | √ | √ | √ | ||
1.12.0 | √ | √ | √ | √ | ||
1.11.0 | √ | √ | √ | √ | ||
1.10.0 | √ | √ | √ | √ |
python -m pip install oneccl_bind_pt==2.0.100 -f https://developer.intel.com/ipex-whl-stable-xpu
运行时动态链接
- 如果oneccl_bindings_for_pytorch没有使用oneCCL构建,并且使用系统中的oneCCL,则从oneAPI basekit动态链接oneCCl(推荐使用)
source $basekit_root/ccl/latest/env/vars.sh
注意:当在Intel® GPUs上使用Intel® oneCCL Bindings for Pytorch*时,请确保已安装basekit。
- 如果oneccl_bindings_for_pytorch与第三方oneCCL构建,或从预构建Wheel安装:动态链接oneCCL和Intel MPI库
source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/setvars.sh
仅动态链接oneCCL(不包括Intel MPI)
source $(python -c "import oneccl_bindings_for_pytorch as torch_ccl;print(torch_ccl.cwd)")/env/vars.sh
使用方法
example.py
import torch.nn.parallel
import torch.distributed as dist
import oneccl_bindings_for_pytorch
...
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
os.environ['RANK'] = str(os.environ.get('PMI_RANK', 0))
os.environ['WORLD_SIZE'] = str(os.environ.get('PMI_SIZE', 1))
backend = 'ccl'
dist.init_process_group(backend, ...)
my_rank = dist.get_rank()
my_size = dist.get_world_size()
print("my rank = %d my size = %d" % (my_rank, my_size))
...
model = torch.nn.parallel.DistributedDataParallel(model, ...)
...
(oneccl_bindings_for_pytorch没有使用oneCCL构建,在系统中使用oneCCL和MPI(如有需要))
source $basekit_root/ccl/latest/env/vars.sh
source $basekit_root/mpi/latest/env/vars.sh
mpirun -n -ppn -f python example.py
## Performance Debugging
For debugging performance of communication primitives PyTorch's [Autograd profiler](https://pytorch.ac.cn/docs/stable/autograd.html#profiler)
can be used to inspect time spent inside oneCCL calls.
Example:
profiling.py
```python
import torch.nn.parallel
import torch.distributed as dist
import oneccl_bindings_for_pytorch
import os
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
os.environ['RANK'] = str(os.environ.get('PMI_RANK', 0))
os.environ['WORLD_SIZE'] = str(os.environ.get('PMI_SIZE', 1))
backend = 'ccl'
dist.init_process_group(backend)
my_rank = dist.get_rank()
my_size = dist.get_world_size()
print("my rank = %d my size = %d" % (my_rank, my_size))
x = torch.ones([2, 2])
y = torch.ones([4, 4])
with torch.autograd.profiler.profile(record_shapes=True) as prof:
for _ in range(10):
dist.all_reduce(x)
dist.all_reduce(y)
dist.barrier()
print(prof.key_averages(group_by_input_shape=True).table(sort_by="self_cpu_time_total"))
mpirun -n 2 -l python profiling.py
[0] my rank = 0 my size = 2
[0] ----------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------
[0] Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Input Shapes
[0] ----------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------
[0] oneccl_bindings_for_pytorch::allreduce 91.41% 297.900ms 91.41% 297.900ms 29.790ms 10 [[2, 2]]
[0] oneccl_bindings_for_pytorch::wait::cpu::allreduce 8.24% 26.845ms 8.24% 26.845ms 2.684ms 10 [[2, 2], [2, 2]]
[0] oneccl_bindings_for_pytorch::wait::cpu::allreduce 0.30% 973.651us 0.30% 973.651us 97.365us 10 [[4, 4], [4, 4]]
[0] oneccl_bindings_for_pytorch::allreduce 0.06% 190.254us 0.06% 190.254us 19.025us 10 [[4, 4]]
[0] ----------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------
[0] Self CPU time total: 325.909ms
[0]
[1] my rank = 1 my size = 2
[1] ----------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------
[1] Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Input Shapes
[1] ----------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------
[1] oneccl_bindings_for_pytorch::allreduce 96.03% 318.551ms 96.03% 318.551ms 31.855ms 10 [[2, 2]]
[1] oneccl_bindings_for_pytorch::wait::cpu::allreduce 3.62% 12.019ms 3.62% 12.019ms 1.202ms 10 [[2, 2], [2, 2]]
[1] oneccl_bindings_for_pytorch::allreduce 0.33% 1.082ms 0.33% 1.082ms 108.157us 10 [[4, 4]]
[1] oneccl_bindings_for_pytorch::wait::cpu::allreduce 0.02% 56.505us 0.02% 56.505us 5.651us 10 [[4, 4], [4, 4]]
[1] ----------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------
[1] Self CPU time total: 331.708ms
[1]
已知问题
对于点对点通信,在启动脚本中初始化进程组后直接调用dist.send/recv将触发运行时错误。因为在我们当前实现中,组中的所有进程都应参与此调用以创建通信器,而dist.send/recv只有一对进程的参与。因此,应在集体调用之后使用dist.send/recv,以确保所有进程的参与。支持在初始化进程组后直接调用dist.send/recv的进一步解决方案仍在研究中。