一个用于在混乱的数据集中搜索和删除重复文件的Python包

这些详情未经过PyPI验证

项目链接

主页

项目描述

deduplify

一个用于在混乱的数据集中搜索和删除重复文件的Python工具。

概述
安装
- 从PyPI
- 手动安装
用法
贡献

概述

deduplify是一个Python命令行工具，它将在目录树中搜索重复文件，并可选择删除它们。它为目标目录下的每个文件递归生成MD5哈希值，并识别生成唯一和重复哈希的文件路径。在删除重复文件时，它首先删除目录树中最深的文件，留下最后出现的文件。

安装

deduplify的最低Python要求是v3.7，但已在v3.8中开发。

从PyPI

首先，请确保您的pip版本是最新的。

python -m pip install --upgrade pip

然后安装deduplify。

pip install deduplify

手动安装

首先克隆此存储库并进入它。

git clone https://github.com/Living-with-machines/deduplify.git
cd deduplify

现在运行设置脚本。这将安装所有需求并将CLI工具安装到您的Python $PATH。

python setup.py install

用法

deduplify有3个命令：hash、compare和clean。

文件哈希

hash命令接受一个目标目录的路径作为参数。它遍历该目录树的结构，并为所有文件生成MD5哈希值，并以JSON文件的形式输出数据库，该文件名可以用--dbfile [-f]标志覆盖。

生成的数据库中的每个文档都可以描述为具有以下属性的字典

{
  "filepath": "",     # String. The full path to a given file.
  "hash": "",         # String. The MD5 hash of the given file.
  "duplicate": bool,  # Boolean. Whether this hash is repeated in the database (True) or not (False).
}

默认情况下，deduplify 会为目录下所有文件生成哈希值。但可以使用 --ext 标志指定要搜索的一个或多个特定文件扩展名。

命令行用法

usage: deduplify hash [-h] [-c COUNT] [-v] [-f DBFILE] [--exts [EXTS]] [--restart] dir

positional arguments:
  dir                   Path to directory to begin search from

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT
                        Number of threads to parallelise over. Default: 1
  -v, --verbose         Print logging messages to the console
  -f DBFILE, --dbfile DBFILE
                        Destination database for file hashes. Must be a JSON file. Default: file_hashes.json
  --exts [EXTS]         A list of file extensions to search for.
  --restart             Restart a run of hashing files and skip over files that have already been hashed. Output file containing a database of
                        filenames and hashes must already exist.

比较文件

compare 命令读取由 hash 命令生成的 JSON 数据库，其名称可以通过 --infile [-f] 标志重写（如果数据以不同的名称保存）。该命令会对所有生成给定哈希值的路径的文件路径基础名进行校验，以检查它们是否等效。这表明文件是真正的重复文件，因为其名称和内容都匹配。如果它们不匹配，则意味着相同的内容被保存为两个不同的文件名。在这种情况下，将发出警告，提示用户手动调查这些文件。

如果给定哈希值的文件名都匹配，则从列表中删除最短的文件路径，其余的文件路径被返回以进行删除。要删除文件，需要使用带有 --purge 标志的 compare 命令。

为确保删除所有重复文件，建议的流程如下

deduplify hash target_dir  # First pass at hashing files
deduplify compare --purge  # Delete duplicated files
deduplify hash target_dir  # Second pass at hashing files
deduplify compare          # Compare the filenames again. The code should return nothing to compare

命令行用法

usage: deduplify compare [-h] [-c COUNT] [-v] [-f INFILE] [--list-files] [--purge]

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT
                        Number of threads to parallelise over. Default: 1
  -v, --verbose         Print logging messages to the console
  -f INFILE, --infile INFILE
                        Database to analyse. Must be a JSON file. Default: file_hashes.json
  --list-files          List duplicated files. Default: False
  --purge               Deletes duplicated files. Default: False

清理

删除重复文件后，目标目录可能会留下空子目录。运行 clean 命令将定位并删除这些空子目录。

命令行用法

usage: deduplify clean [-h] [-c COUNT] [-v] dir

positional arguments:
  dir                   Path to directory to begin search from

optional arguments:
  -h, --help            show this help message and exit
  -c COUNT, --count COUNT
                        Number of threads to parallelise over. Default: 1
  -v, --verbose         Print logging messages to the console

全局参数

以下标志可以传递给 deduplify 的任何命令。

--verbose [-v]：此标志会将详细输出打印到控制台，而不是保存到 deduplify.log 文件。
--count [-c]：当处理大型数据集时，deduplify 中的一些过程可以在多个线程上并行化。为此，包括带有您希望并行化的（整数）线程数的 --count 标志。如果请求的线程数超过主机机器上的 CPU 数量，此标志将引发错误。

贡献

感谢您想为 deduplify 贡献！:tada: :sparkling_heart: 为帮助您开始，请阅读我们的行为准则和贡献指南。

项目详情

这些详情未经过PyPI验证

项目链接

主页

发布历史发布通知 | RSS 源

20.9.0 已撤销

2020 年 9 月 11 日

撤销此版本的原因

版本号错误

此版本

0.5.0

2022 年 4 月 24 日

0.4.2

2022 年 3 月 9 日

0.4.1

2022 年 3 月 6 日

0.4.0

2022 年 2 月 28 日

0.3.0

2022 年 2 月 28 日

0.2.0

2022 年 2 月 26 日

0.1.5

2022 年 2 月 26 日

0.1.4

2022 年 2 月 26 日

0.1.3

2022 年 2 月 26 日

0.1.2

2020 年 10 月 1 日

下载文件

下载适合您平台的文件。如果您不确定选择哪个，请了解更多关于安装包的信息。

源代码分发

deduplify-0.5.0.tar.gz (12.2 kB 查看哈希值)

上传时间 2022 年 4 月 24 日 源代码

构建分发

deduplify-0.5.0-py3-none-any.whl (11.4 kB 查看哈希值)

上传时间 2022 年 4 月 24 日 Python 3

deduplify-0.5.0.tar.gz 的哈希值

deduplify-0.5.0.tar.gz 的哈希值
算法	哈希摘要
SHA256	`91c23f348bf4a5c46d33535388827e10872fb328d567f5c81a6f0629262ac94f`
MD5	`d6fa8011b2a1e459a8fd3ff2c5c80bb7`
BLAKE2b-256	`309518145ab4d547784bdd32df8775f281a761e66ad70a07237d76c9cbc2d315`

哈希值用于 deduplify-0.5.0-py3-none-any.whl

deduplify-0.5.0-py3-none-any.whl 的哈希值
算法	哈希摘要
SHA256	`6d4508981367b6c6947d945347323e062d769fe02602172f305f30e8f050c8c1`
MD5	`43a2efee719d0827aaa017ba3a73afe3`
BLAKE2b-256	`427f974127b0ea7a92d4ef6574567b3349b80c8cea2194f7293b8f0afe9939f1`

deduplify 0.5.0

导航

已验证详情

维护者

未验证详情

项目链接

元数据

分类

项目描述

deduplify

概述

安装

从PyPI

手动安装

用法

文件哈希

比较文件

清理

全局参数

贡献

项目详情

已验证详情

维护者

未验证详情

项目链接

元数据

分类

发布历史发布通知 | RSS 源

下载文件

源代码分发

构建分发

deduplify 0.5.0

导航

已验证详情

维护者

未验证详情

项目链接

元数据

分类

项目描述

deduplify

概述

安装

从PyPI

手动安装

用法

文件哈希

比较文件

清理

全局参数

贡献

项目详情

已验证详情

维护者

未验证详情

项目链接

元数据

分类

发布历史 发布通知 | RSS 源

下载文件

源代码分发

构建分发

发布历史发布通知 | RSS 源