从文件中删除重复的行序列
rpatterson.stripdupes的Python项目详细描述
用法
请参阅StripDupes控制台脚本的帮助消息。
>>> import subprocess >>> popen = subprocess.Popen( ... [stripdupes_script, '--help'], ... stdout=subprocess.PIPE, stderr=subprocess.PIPE) >>> print popen.stdout.read() Usage: stripdupes [options] Strip duplicated sequences of lines. Options: -h, --help show this help message and exit -m NUM, --min=NUM Minimum length of duplicated sequence. If NUM is less than one, use a proportion of the total number of lines, otherwise NUM is a number of lines. [default: 0.01] -p REGEXP, --pattern=REGEXP Regular expression pattern used to normalize strings in sequences of strings. The default matches all whitespace. Use an empty string to disable. [default: '\s+'] -r STRING, --repl=STRING String to replace matches of pattern with for normalizing strings in sequences of strings. [default: ' ']
当给定的输入文件的组合内容包括 超过阈值的行在 输入文件,输出文件将不重复 顺序。
>>> input = """\ ... foo ... foo ... bar ... baz ... qux ... quux ... foo ... bar ... baz ... qux ... bah ... blah1 ... quux ... blah ... quux ... fin ... """>>> import cStringIO >>> from rpatterson import stripdupes >>> for line in stripdupes.strip( ... cStringIO.StringIO(input).readlines()): print line, foo bar baz qux quux bah blah1 blah fin>>> input = """\ ... blah ... quux ... bah ... foo ... foo\t ... bar ... baz ... qux ... quux ... foo ... bar ... baz ... qux ... fin ... fin ... fin ... null ... fin ... """>>> for line in stripdupes.strip( ... cStringIO.StringIO(input).readlines()): print line, blah quux bah foo bar baz qux fin null
确保可以处理奇数序列。
>>> list(stripdupes.strip([])) [] >>> list(stripdupes.strip(['foo'])) ['foo']
如果重复序列是 序列的长度。
>>> seq = range(149)+[0] >>> len(seq) 150 >>> seq[0] == seq[149] True >>> len(list(stripdupes.strip(seq, pattern=None))) 150>>> seq = range(148)+[0] >>> len(seq) 149 >>> seq[0] == seq[148] True >>> len(list(stripdupes.strip(seq, pattern=None))) 148
更改日志
0.1-2009-05-27
- 初始版本