Python - 比较两个字符串的最佳方法,记录特定项在串行位置的比较统计?
我正在处理两个文件,这两个文件里的某些行看起来像这样:
这是 || 一个例子 || 行。
在其中一个文件中,上面的行会出现,而在另一个文件中,对应的行是一样的,但可能“||”的位置不同:
这 || 是一个 || 例子 || 行。
我只需要统计一下,在第二个文件中,“||”出现在“正确”位置的次数(我们假设第一个文件的“||”位置总是正确的),还有“||”出现在第一个文件没有“||”的位置的次数,以及这行中“||”的总数是如何不同的。
我知道我可以自己完成这个,但我想知道你们这些聪明的人有没有什么特别简单的方法来做到这一点?一些基本的操作(比如读取文件)我都很熟悉——我其实只是想要一些关于如何进行行比较和收集统计数据的建议!
最好的祝福,
乔治娜
2 个回答
1
这就是你想要的吗?
这段代码假设每一行的格式都和你给的例子一样。
fileOne = open('theCorrectFile', 'r')
fileTwo = open('theSecondFile', 'r')
for corrrectLine in fileOne:
otherLine = fileTwo.readline()
for i in len(correctLine.split("||")):
count = 0
wrongPlacement = 0
if (len(otherLine.split("||")) >= i+1) and (correctLine.split("||")[i] == otherLine.split("||")[i]):
count += 1
else:
wrongPLacement += 1
print 'there are %d out of %d "||" in the correct places and %d in the wrong places' %(count, len(correctLine.split("||"), wrongPlacement)
0
我不太确定这个过程有多简单,因为它使用了一些比较高级的概念,比如生成器,但至少它很稳健,并且文档写得很好。实际的代码在最后,内容也比较简洁。
基本的想法是,函数 iter_delim_sets
会返回一个迭代器,这个迭代器会生成一系列元组,每个元组包含三个部分:行号、在“预期”字符串中找到分隔符的索引集合,以及在“实际”字符串中的类似集合。每对(预期,结果)行都会生成一个这样的元组。这些元组被简洁地整理成一个叫 DelimLocations
的 collections.namedtuple
类型。
接着,函数 analyze
会根据这些数据集返回更高层次的信息,这些信息存储在一个 DelimAnalysis
的 namedtuple
中。这是通过基本的集合运算来完成的。
"""Compare two sequences of strings.
Test data:
>>> from pprint import pprint
>>> delimiter = '||'
>>> expected = (
... delimiter.join(("one", "fish", "two", "fish")),
... delimiter.join(("red", "fish", "blue", "fish")),
... delimiter.join(("I do not like them", "Sam I am")),
... delimiter.join(("I do not like green eggs and ham.",)))
>>> actual = (
... delimiter.join(("red", "fish", "blue", "fish")),
... delimiter.join(("one", "fish", "two", "fish")),
... delimiter.join(("I do not like spam", "Sam I am")),
... delimiter.join(("I do not like", "green eggs and ham.")))
The results:
>>> pprint([analyze(v) for v in iter_delim_sets(delimiter, expected, actual)])
[DelimAnalysis(index=0, correct=2, incorrect=1, count_diff=0),
DelimAnalysis(index=1, correct=2, incorrect=1, count_diff=0),
DelimAnalysis(index=2, correct=1, incorrect=0, count_diff=0),
DelimAnalysis(index=3, correct=0, incorrect=1, count_diff=1)]
What they mean:
>>> pprint(delim_analysis_doc)
(('index',
('The number of the lines from expected and actual',
'used to perform this analysis.')),
('correct',
('The number of delimiter placements in ``actual``',
'which were correctly placed.')),
('incorrect', ('The number of incorrect delimiters in ``actual``.',)),
('count_diff',
('The difference between the number of delimiters',
'in ``expected`` and ``actual`` for this line.')))
And a trace of the processing stages:
>>> def dump_it(it):
... '''Wraps an iterator in code that dumps its values to stdout.'''
... for v in it:
... print v
... yield v
>>> for v in iter_delim_sets(delimiter,
... dump_it(expected), dump_it(actual)):
... print v
... print analyze(v)
... print '======'
one||fish||two||fish
red||fish||blue||fish
DelimLocations(index=0, expected=set([9, 3, 14]), actual=set([9, 3, 15]))
DelimAnalysis(index=0, correct=2, incorrect=1, count_diff=0)
======
red||fish||blue||fish
one||fish||two||fish
DelimLocations(index=1, expected=set([9, 3, 15]), actual=set([9, 3, 14]))
DelimAnalysis(index=1, correct=2, incorrect=1, count_diff=0)
======
I do not like them||Sam I am
I do not like spam||Sam I am
DelimLocations(index=2, expected=set([18]), actual=set([18]))
DelimAnalysis(index=2, correct=1, incorrect=0, count_diff=0)
======
I do not like green eggs and ham.
I do not like||green eggs and ham.
DelimLocations(index=3, expected=set([]), actual=set([13]))
DelimAnalysis(index=3, correct=0, incorrect=1, count_diff=1)
======
"""
from collections import namedtuple
# Data types
## Here ``expected`` and ``actual`` are sets
DelimLocations = namedtuple('DelimLocations', 'index expected actual')
DelimAnalysis = namedtuple('DelimAnalysis',
'index correct incorrect count_diff')
## Explanation of the elements of DelimAnalysis.
## There's no real convenient way to add a docstring to a variable.
delim_analysis_doc = (
('index', ("The number of the lines from expected and actual",
"used to perform this analysis.")),
('correct', ("The number of delimiter placements in ``actual``",
"which were correctly placed.")),
('incorrect', ("The number of incorrect delimiters in ``actual``.",)),
('count_diff', ("The difference between the number of delimiters",
"in ``expected`` and ``actual`` for this line.")))
# Actual functionality
def iter_delim_sets(delimiter, expected, actual):
"""Yields a DelimLocations tuple for each pair of strings.
``expected`` and ``actual`` are sequences of strings.
"""
from re import escape, compile as compile_
from itertools import count, izip
index = count()
re = compile_(escape(delimiter))
def delimiter_locations(string):
"""Set of the locations of matches of ``re`` in ``string``."""
return set(match.start() for match in re.finditer(string))
string_pairs = izip(expected, actual)
return (DelimLocations(index=index.next(),
expected=delimiter_locations(e),
actual=delimiter_locations(a))
for e, a in string_pairs)
def analyze(locations):
"""Returns an analysis of a DelimLocations tuple.
``locations.expected`` and ``locations.actual`` are sets.
"""
return DelimAnalysis(
index=locations.index,
correct=len(locations.expected & locations.actual),
incorrect=len(locations.actual - locations.expected),
count_diff=(len(locations.actual) - len(locations.expected)))