rpatterson.stripdupes · PyPI · Python 包索引

从文件中删除重复的行序列

项目描述

安装

$ easy_install rpatterson.stripdupes

用法

查看stripdupes控制台脚本的帮助信息。

>>> import subprocess
>>> popen = subprocess.Popen(
...     [stripdupes_script, '--help'],
...     stdout=subprocess.PIPE, stderr=subprocess.PIPE)
>>> print popen.stdout.read()
Usage: stripdupes [options]
Strip duplicated sequences of lines.
Options:
  -h, --help  show this help message and exit
  -m NUM, --min=NUM  Minimum length of duplicated sequence.  If
                     NUM is less than one, use a proportion of the
                     total number of lines, otherwise NUM is a
                     number of lines. [default: 0.01]
  -p REGEXP, --pattern=REGEXP
                        Regular expression pattern used to
                        normalize strings in sequences of strings.
                        The default matches all whitespace. Use an
                        empty string to disable. [default: '\s+']
  -r STRING, --repl=STRING
                        String to replace matches of pattern with
                        for normalizing strings in sequences of
                        strings. [default: ' ']

如果给定的输入文件的内容中包含超过阈值的行序列，并且在输入文件的其他地方有重复，则输出文件将不包括这些重复的序列。

>>> input = """\
... foo
... foo
... bar
... baz
... qux
... quux
... foo
... bar
... baz
... qux
... bah
... blah1
... quux
... blah
... quux
... fin
... """

>>> import cStringIO
>>> from rpatterson import stripdupes
>>> for line in stripdupes.strip(
...     cStringIO.StringIO(input).readlines()): print line,
foo
bar
baz
qux
quux
bah
blah1
blah
fin

>>> input = """\
... blah
... quux
... bah
... foo
... foo\t
... bar
... baz
... qux
... quux
... foo
... bar
... baz
... qux
... fin
... fin
... fin
... null
... fin
... """

>>> for line in stripdupes.strip(
...     cStringIO.StringIO(input).readlines()): print line,
blah
quux
bah
foo
bar
baz
qux
fin
null

确保可以处理奇数序列。

>>> list(stripdupes.strip([]))
[]
>>> list(stripdupes.strip(['foo']))
['foo']

如果重复的序列长度是原始序列的1%或更短，则不会删除重复序列。

>>> seq = range(149)+[0]
>>> len(seq)
150
>>> seq[0] == seq[149]
True
>>> len(list(stripdupes.strip(seq, pattern=None)))
150

>>> seq = range(148)+[0]
>>> len(seq)
149
>>> seq[0] == seq[148]
True
>>> len(list(stripdupes.strip(seq, pattern=None)))
148

变更日志

0.1 - 2009-05-27

初始发布

项目详情

发布历史发布通知 | RSS源

本版本

0.1

2009年5月28日

下载文件

下载适用于您的平台的文件。如果您不确定要选择哪个，请了解有关安装软件包的更多信息。

源分发

rpatterson.stripdupes-0.1.tar.gz (6.3 kB 查看散列)

上传时间： 2009年5月28日 源

rpatterson.stripdupes-0.1.tar.gz的散列

rpatterson.stripdupes-0.1.tar.gz的散列
算法	散列摘要
SHA256	`561004ae1cb2bdd70b1d636890ba7defb7aea64e997eec5162c8b4bd3da1eb62`
MD5	`7aff2d3323800088d519c3bfed82ac76`
BLAKE2b-256	`48bf27cd00d8e34e7eeaacd019c2329bb141d3d608ae68a72ffb899f6bddd43d`

rpatterson.stripdupes 0.1

导航

已验证详情

维护者

未验证详情

项目链接

元信息

分类器

项目描述

安装

用法

变更日志

项目详情

已验证详情

维护者

未验证详情

项目链接

元信息

分类器

发布历史发布通知 | RSS源

下载文件

源分发

rpatterson.stripdupes 0.1

导航

已验证详情

维护者

未验证详情

项目链接

元信息

分类器

项目描述

安装

用法

变更日志

项目详情

已验证详情

维护者

未验证详情

项目链接

元信息

分类器

发布历史 发布通知 | RSS源

下载文件

源分发

发布历史发布通知 | RSS源