跳转到主要内容

计算汉明距离的快速工具

项目描述

一个小型且快速的C++工具,用于计算给定fasta格式的基因序列之间的成对距离。

DOI pypi releases python versions

Python接口

要使用Python接口,您应从PyPI安装它

python -m pip install hammingdist

距离矩阵

然后,您可以从Python中以以下方式使用它

import hammingdist

# To see the different optional arguments available:
help(hammingdist.from_fasta)

# To import all sequences from a fasta file
data = hammingdist.from_fasta("example.fasta")

# To import only the first 100 sequences from a fasta file
data = hammingdist.from_fasta("example.fasta", n=100)

# To import all sequences and remove any duplicates
data = hammingdist.from_fasta("example.fasta", remove_duplicates=True)

# To import all sequences from a fasta file, also treating 'X' as a valid character
data = hammingdist.from_fasta("example.fasta", include_x=True)

# The distance data can be accessed point-wise, though looping over all distances might be quite inefficient
print(data[14,42])

输出格式

构建的距离矩阵可以写入磁盘的几种不同格式

# The data can be written to disk in csv format (default `distance` Ripser format) and retrieved:
data.dump("backup.csv")
retrieval = hammingdist.from_csv("backup.csv")

# It can also be written in lower triangular format (comma-delimited row-major, `lower-distance` Ripser format):
data.dump_lower_triangular("lt.txt")
retrieval = hammingdist.from_lower_triangular("lt.txt")

# Or in sparse format (`sparse` Ripser format: space-delimited triplet of `i j d(i,j)`
# with one line for each distance entry i > j which is not above threshold):
data.dump_sparse("sparse.txt", threshold=3)

# If the `remove_duplicates` option was used, the sequence indices can also be written.
# For each input sequence, this prints the corresponding index in the output:
data.dump_sequence_indices("indices.txt")

# The lower-triangular distance elements can also be directly accessed as a 1-d numpy array:
lt_array = data.lt_array
# The elements in this array correspond to the 2-d indices (row=1,col=0), (row=2,col=0), (row=2,col=1), ...
# These indices can be generated using the numpy tril_indices function, e.g. to construct the lower-triangular matrix:
lt_matrix = np.zeros((n_seq, n_seq))
lt_matrix[np.tril_indices(n_seq, -1)] = lt_array

重复项

当使用选项remove_duplicates=True调用from_fasta时,在构建差异矩阵之前会删除重复序列。

例如,给定以下三个输入序列

索引 序列
0 ACG
1 ACG
2 TAG

距离矩阵将是ACGTAG之间的距离的2x2矩阵

ACG TAG
ACG 0 2
TAG 2 0

距离矩阵的行与原始序列中的每个索引对应

索引 序列 距离矩阵中的行
0 ACG 0
1 ACG 0
2 TAT 1

最后一列是DataSet.dump_sequence_indices写入磁盘的内容。

您也可以使用hammingdist.fasta_sequence_indices而不计算距离矩阵来构建它(作为numpy数组)

import hammingdist

sequence_indices = hammingdist.fasta_sequence_indices(fasta_file)

最大距离值

默认情况下,hammingdist.from_fasta返回的距离矩阵中的元素具有最大值255。您还可以使用max_distance参数设置较小的最大值。对于大于此值的距离,hammingdist.from_fasta_large支持高达65535的距离(但需要两倍的RAM)

与参考序列的距离

可以使用以下方法计算fasta文件中每个序列与给定参考序列的距离

import hammingdist

distances = hammingdist.fasta_reference_distances(sequence, fasta_file, include_x=True)

此函数返回一个包含每个序列与参考序列距离的numpy数组。

您还可以计算两个单独序列之间的距离

import hammingdist

distance = hammingdist.distance("ACGTX", "AAGTX", include_x=True)

Linux上的OpenMP

在Linux上,hammingdist是带有OpenMP(多线程)支持的构建的,并将自动使用所有可用的CPU线程。

Linux上的CUDA

在Linux上,hammingdist还带有CUDA(Nvidia GPU)支持。要使用GPU而不是CPU,请在调用from_fasta时设置use_gpu=True。在这里,我们还设置了最大距离为2

import hammingdist

data = hammingdist.from_fasta("example.fasta", use_gpu=True, max_distance=2)

此外,现在可以直接使用GPU使用from_fasta_to_lower_triangular函数从fasta文件构建下三角矩阵文件。这避免了在内存中存储整个距离矩阵,并且将GPU上的计算与CPU上的磁盘I/O交替进行,这意味着它需要更少的RAM并运行得更快。

import hammingdist

hammingdist.from_fasta_to_lower_triangular('input_fasta.txt', 'output_lower_triangular.txt', use_gpu=True, max_distance=2)

overview

性能历史

hammingdist中不同性能改进的影响的粗略度量

overview

项目详情


下载文件

下载适用于您的平台的文件。如果您不确定选择哪个,请了解有关安装包的更多信息。

源分布

此版本没有可用的源分布文件。请参阅生成分发存档的教程。

构建分布

hammingdist-1.3.0-pp310-pypy310_pp73-win_amd64.whl (159.6 kB 查看哈希值)

上传时间: PyPy Windows x86-64

hammingdist-1.3.0-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (512.7 kB 查看哈希值)

上传时间: PyPy manylinux: glibc 2.17+ x86-64

hammingdist-1.3.0-pp310-pypy310_pp73-macosx_11_0_arm64.whl (137.2 kB 查看哈希值)

上传时间: PyPy macOS 11.0+ ARM64

hammingdist-1.3.0-pp310-pypy310_pp73-macosx_10_15_x86_64.whl (163.7 kB 查看哈希值)

上传时间: PyPy macOS 10.15+ x86-64

hammingdist-1.3.0-pp39-pypy39_pp73-win_amd64.whl (159.4 kB 查看哈希值)

上传时间: PyPy Windows x86-64

hammingdist-1.3.0-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (512.1 kB 查看哈希值)

上传时间: PyPy manylinux: glibc 2.17+ x86-64

hammingdist-1.3.0-pp39-pypy39_pp73-macosx_11_0_arm64.whl (137.2 kB 查看哈希值)

上传时间: PyPy macOS 11.0+ ARM64

hammingdist-1.3.0-pp39-pypy39_pp73-macosx_10_15_x86_64.whl (163.6 kB 查看哈希值)

上传时间: PyPy macOS 10.15+ x86-64

hammingdist-1.3.0-pp38-pypy38_pp73-win_amd64.whl (159.5 kB 查看哈希值)

上传时间: PyPy Windows x86-64

hammingdist-1.3.0-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (512.2 kB 查看哈希值)

上传时间: PyPy manylinux: glibc 2.17+ x86-64

hammingdist-1.3.0-pp38-pypy38_pp73-macosx_11_0_arm64.whl (137.2 kB 查看哈希值)

上传时间: PyPy macOS 11.0+ ARM64

hammingdist-1.3.0-pp38-pypy38_pp73-macosx_10_9_x86_64.whl (164.0 kB 查看哈希值)

上传时间 PyPy macOS 10.9+ x86-64

hammingdist-1.3.0-pp37-pypy37_pp73-win_amd64.whl (159.2 kB 查看哈希值)

上传时间 PyPy Windows x86-64

hammingdist-1.3.0-pp37-pypy37_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (511.7 kB 查看哈希值)

上传时间 PyPy manylinux: glibc 2.17+ x86-64

hammingdist-1.3.0-pp37-pypy37_pp73-macosx_10_9_x86_64.whl (163.4 kB 查看哈希值)

上传时间 PyPy macOS 10.9+ x86-64

hammingdist-1.3.0-cp313-cp313-win_amd64.whl (161.7 kB 查看哈希值)

上传时间 CPython 3.13 Windows x86-64

hammingdist-1.3.0-cp313-cp313-win32.whl (139.3 kB 查看哈希值)

上传时间 CPython 3.13 Windows x86

hammingdist-1.3.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (512.3 kB 查看哈希值)

上传时间 CPython 3.13 manylinux: glibc 2.17+ x86-64

hammingdist-1.3.0-cp313-cp313-macosx_11_0_arm64.whl (138.9 kB 查看哈希值)

上传时间 CPython 3.13 macOS 11.0+ ARM64

hammingdist-1.3.0-cp313-cp313-macosx_10_13_x86_64.whl (165.8 kB 查看哈希值)

上传时间 CPython 3.13 macOS 10.13+ x86-64

hammingdist-1.3.0-cp312-cp312-win_amd64.whl (161.7 kB 查看哈希值)

上传时间 CPython 3.12 Windows x86-64

hammingdist-1.3.0-cp312-cp312-win32.whl (139.3 kB 查看哈希值)

上传于 CPython 3.12 Windows x86

hammingdist-1.3.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (511.7 kB 查看哈希值)

上传于 CPython 3.12 manylinux: glibc 2.17+ x86-64

hammingdist-1.3.0-cp312-cp312-macosx_11_0_arm64.whl (138.9 kB 查看哈希值)

上传于 CPython 3.12 macOS 11.0+ ARM64

hammingdist-1.3.0-cp312-cp312-macosx_10_9_x86_64.whl (166.2 kB 查看哈希值)

上传于 CPython 3.12 macOS 10.9+ x86-64

hammingdist-1.3.0-cp311-cp311-win_amd64.whl (160.9 kB 查看哈希值)

上传于 CPython 3.11 Windows x86-64

hammingdist-1.3.0-cp311-cp311-win32.whl (139.0 kB 查看哈希值)

上传于 CPython 3.11 Windows x86

hammingdist-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (513.7 kB 查看哈希值)

上传于 CPython 3.11 manylinux: glibc 2.17+ x86-64

hammingdist-1.3.0-cp311-cp311-macosx_11_0_arm64.whl (139.1 kB 查看哈希值)

上传于 CPython 3.11 macOS 11.0+ ARM64

hammingdist-1.3.0-cp311-cp311-macosx_10_9_x86_64.whl (166.1 kB 查看哈希值)

上传于 CPython 3.11 macOS 10.9+ x86-64

hammingdist-1.3.0-cp310-cp310-win_amd64.whl (159.9 kB 查看哈希值)

上传于 CPython 3.10 Windows x86-64

hammingdist-1.3.0-cp310-cp310-win32.whl (138.4 kB 查看哈希值)

上传于 CPython 3.10 Windows x86

hammingdist-1.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (512.0 kB 查看哈希值)

上传时间: CPython 3.10 manylinux: glibc 2.17+ x86-64

hammingdist-1.3.0-cp310-cp310-macosx_11_0_arm64.whl (137.8 kB 查看哈希值)

上传时间: CPython 3.10 macOS 11.0+ ARM64

hammingdist-1.3.0-cp310-cp310-macosx_10_9_x86_64.whl (164.7 kB 查看哈希值)

上传时间: CPython 3.10 macOS 10.9+ x86-64

hammingdist-1.3.0-cp39-cp39-win_amd64.whl (158.9 kB 查看哈希值)

上传时间: CPython 3.9 Windows x86-64

hammingdist-1.3.0-cp39-cp39-win32.whl (138.4 kB 查看哈希值)

上传时间: CPython 3.9 Windows x86

hammingdist-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (512.8 kB 查看哈希值)

上传时间: CPython 3.9 manylinux: glibc 2.17+ x86-64

hammingdist-1.3.0-cp39-cp39-macosx_11_0_arm64.whl (137.9 kB 查看哈希值)

上传时间: CPython 3.9 macOS 11.0+ ARM64

hammingdist-1.3.0-cp39-cp39-macosx_10_9_x86_64.whl (164.8 kB 查看哈希值)

上传时间: CPython 3.9 macOS 10.9+ x86-64

hammingdist-1.3.0-cp38-cp38-win_amd64.whl (159.9 kB 查看哈希值)

上传时间: CPython 3.8 Windows x86-64

hammingdist-1.3.0-cp38-cp38-win32.whl (138.4 kB 查看哈希值)

上传时间: CPython 3.8 Windows x86

hammingdist-1.3.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (511.7 kB 查看哈希值)

上传时间: CPython 3.8 manylinux: glibc 2.17+ x86-64

hammingdist-1.3.0-cp38-cp38-macosx_11_0_arm64.whl (137.7 kB 查看哈希值)

上传于 CPython 3.8 macOS 11.0+ ARM64

hammingdist-1.3.0-cp38-cp38-macosx_10_9_x86_64.whl (164.5 kB 查看哈希值)

上传于 CPython 3.8 macOS 10.9+ x86-64

hammingdist-1.3.0-cp37-cp37m-win_amd64.whl (160.8 kB 查看哈希值)

上传于 CPython 3.7m Windows x86-64

hammingdist-1.3.0-cp37-cp37m-win32.whl (139.5 kB 查看哈希值)

上传于 CPython 3.7m Windows x86

hammingdist-1.3.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (517.0 kB 查看哈希值)

上传于 CPython 3.7m manylinux: glibc 2.17+ x86-64

hammingdist-1.3.0-cp37-cp37m-macosx_10_9_x86_64.whl (164.1 kB 查看哈希值)

上传于 CPython 3.7m macOS 10.9+ x86-64