从文件中删除重复的行序列
项目描述
安装
您只需要 easy_install
$ easy_install rpatterson.stripdupes
用法
查看stripdupes控制台脚本的帮助信息。
>>> import subprocess >>> popen = subprocess.Popen( ... [stripdupes_script, '--help'], ... stdout=subprocess.PIPE, stderr=subprocess.PIPE) >>> print popen.stdout.read() Usage: stripdupes [options] Strip duplicated sequences of lines. Options: -h, --help show this help message and exit -m NUM, --min=NUM Minimum length of duplicated sequence. If NUM is less than one, use a proportion of the total number of lines, otherwise NUM is a number of lines. [default: 0.01] -p REGEXP, --pattern=REGEXP Regular expression pattern used to normalize strings in sequences of strings. The default matches all whitespace. Use an empty string to disable. [default: '\s+'] -r STRING, --repl=STRING String to replace matches of pattern with for normalizing strings in sequences of strings. [default: ' ']
如果给定的输入文件的内容中包含超过阈值的行序列,并且在输入文件的其他地方有重复,则输出文件将不包括这些重复的序列。
>>> input = """\ ... foo ... foo ... bar ... baz ... qux ... quux ... foo ... bar ... baz ... qux ... bah ... blah1 ... quux ... blah ... quux ... fin ... """>>> import cStringIO >>> from rpatterson import stripdupes >>> for line in stripdupes.strip( ... cStringIO.StringIO(input).readlines()): print line, foo bar baz qux quux bah blah1 blah fin>>> input = """\ ... blah ... quux ... bah ... foo ... foo\t ... bar ... baz ... qux ... quux ... foo ... bar ... baz ... qux ... fin ... fin ... fin ... null ... fin ... """>>> for line in stripdupes.strip( ... cStringIO.StringIO(input).readlines()): print line, blah quux bah foo bar baz qux fin null
确保可以处理奇数序列。
>>> list(stripdupes.strip([])) [] >>> list(stripdupes.strip(['foo'])) ['foo']
如果重复的序列长度是原始序列的1%或更短,则不会删除重复序列。
>>> seq = range(149)+[0] >>> len(seq) 150 >>> seq[0] == seq[149] True >>> len(list(stripdupes.strip(seq, pattern=None))) 150>>> seq = range(148)+[0] >>> len(seq) 149 >>> seq[0] == seq[148] True >>> len(list(stripdupes.strip(seq, pattern=None))) 148
变更日志
0.1 - 2009-05-27
初始发布
项目详情
关闭
rpatterson.stripdupes-0.1.tar.gz的散列
算法 | 散列摘要 | |
---|---|---|
SHA256 | 561004ae1cb2bdd70b1d636890ba7defb7aea64e997eec5162c8b4bd3da1eb62 |
|
MD5 | 7aff2d3323800088d519c3bfed82ac76 |
|
BLAKE2b-256 | 48bf27cd00d8e34e7eeaacd019c2329bb141d3d608ae68a72ffb899f6bddd43d |