将列表的列表传入函数
我有一个多维数组,想把它放进 difflib.get_close_matches()
里。
我的数组长这样: array[(ORIGINAL, FILTERED)]
。这里的 ORIGINAL
是一个字符串,而 FILTERED
是把 ORIGINAL
中常见的词去掉后的字符串。
现在我创建了一个新数组,只把 FILTERED
的词放进 difflib.get_close_matches()
。然后我试着把 difflib
的结果和 array[(ORIGINAL, FILTERED)]
进行匹配。我的问题是,常常会有两个或更多的 FILTERED
词是相同的,这样就无法用这种方法进行匹配了。
有没有办法让我把整个 array[(ORIGINAL,FILTERED)]
放进 difflib
,但只看 FILTERED
的部分(同时还返回 [(ORIGINAL,FILTERED)]
)呢?
提前谢谢你!
import time
import csv
import difflib
import sys
import os.path
import datetime
### Filters out common words in an attempt to get better results ###
def ignoredWords (word):
filtered = word.lower()
#Common Full Words
## Majority of filters were edited out
#Common Abbreviations
if "univ" in filtered:
filtered = filtered.replace("univ","")
#Special Characters
if " " in filtered: #Two White Spaces
filtered = filtered.replace(" "," ")
if "-" in filtered:
filtered = filtered.replace("-"," ")
if "\'" in filtered:
filtered = filtered.replace("\'"," ")
if " & " in filtered:
filtered = filtered.replace(" &","")
if "(\"" in filtered:
filtered = filtered.replace("(\"","")
if "\")" in filtered:
filtered = filtered.replace("\")","")
if "\t" in filtered:
filtered = filtered.replace("\t"," ")
return filtered
### Takes in a list, then outputs a 2D list. array[Original, Filtered] ###
### For XXX: array[Original, Filtered, Account Number, Code] ###
def create2DArray (list):
array = []
for item in list:
clean = ignoredWords(item[2])
entry = (item[2].lower(), clean, item[0],item[1])
array.append(entry)
return array
def main(argv):
if(len(argv) < 3):
print "Not enough parameters. Please enter two file names"
sys.exit(2)
elif (not os.path.isfile(argv[1])):
print "%s is not found" %(argv[1])
sys.exit(2)
elif (not os.path.isfile(argv[2])):
print "%s is not found" %(argv[2])
sys.exit(2)
#Recode File ----- Not yet implemented
# if(len(argv) == 4):
# if(not os.path.isfile(argv[3])):
# print "%s is not found" %(argv[3])
# sys.exit(2)
#
# recode = open(argv[1], 'r')
# try:
# setRecode = c.readlines()
# finally:
# recode.close()
# setRecode.sort()
# print setRecode[0]
#Measure execution time
t0 = time.time()
cReader = csv.reader(open(argv[1], 'rb'), delimiter='|')
try:
setC = []
for row in cReader:
setC.append(row)
finally:
setC.sort()
aReader = csv.reader(open(argv[2], 'rb'), delimiter='|')
try:
setA = []
for row in aReader:
setA.append(row)
finally:
setA.sort()
#Put Set A and Set C into their own 2 dimmensional arrays.array[Original Word] [Cleaned Up Word]
arrayC = create2DArray(setC)
arrayA = create2DArray(setA)
#Create clean list versions for use with difflib
cleanListC = []
for item in arrayC:
cleanListC.append(item[1])
cleanListA = []
for item in arrayA:
cleanListA.append(item[1])
############OUTPUT FILENAME############
fMatch75 = open("Match75.csv", 'w')
Match75 = csv.writer(fMatch75, dialect='excel')
try:
header = "Fuzzy Matching Report. Generated: "
header += str(datetime.date.today())
Match75.writerow([header])
Match75.writerow(['C','A','C Cleaned','A Cleaned','C Account', 'C Group','A Account', 'A Group', 'Filtered Ratio %','Unfiltered Ratio %','Average Ratio %'])
for item in cleanListC:
match = difflib.get_close_matches(item,cleanListA,1,0.75)
if len(match) > 0:
filteredratio = difflib.SequenceMatcher(None,item,match[0]).ratio()
strfilteredratio = '%.2f' % (filteredratio*100)
found = 0
for group in arrayA:
if match[0] == group[1]:
origA = group[0]
acode = group[3]
aaccount = group[2]
found = found + 1
for group in arrayC:
if item == group[1]:
origC = group[0]
ccode = group[3]
caccount = group[2]
found = found + 2
if found == 3:
unfilteredratio = difflib.SequenceMatcher(None,origC,origA).ratio()
strunfilteredratio = '%.2f' % (unfilteredratio*100)
averageratio = (filteredratio+unfilteredratio)/2
straverageratio = '%.2f' % (averageratio*100)
row = [origC.rstrip(),origA.rstrip(),item.rstrip(),match[0].rstrip(),caccount,ccode,aaccount,acode,strfilteredratio,strunfilteredratio,straverageratio]
Match75.writerow(row)
#These Else Ifs are for debugging. If NULL is found anywhere in the CSV, then an error has occurred
elif found == 2:
row = [origC.rstrip(),"NULL",item.rstrip(),match[0].rstrip(),caccount,ccode,"NULL","NULL",strfilteredratio,"NULL","NULL"]
Match75.writerow(row)
elif found == 1:
row = ["NULL",origA.rstrip(),item.rstrip(),match[0].rstrip(),"NULL","NULL",aaccount,acode,strfilteredratio,"NULL","NULL"]
Match75.writerow(row)
else:
row = ["NULL","NULL",item.rstrip(),match[0].rstrip(),"NULL","NULL","NULL","NULL",strfilteredratio,"NULL","NULL"]
Match75.writerow(row)
finally:
Match75.writerow(["A Proprietary and Confidential. Do Not Distribute"])
fMatch75.close()
print (time.time()-t0,"seconds")
if __name__ == "__main__":
main(argv=sys.argv)
我想要实现的目标:
- 读取输入文件
- 从名字中去掉常见词,这样模糊匹配('difflib.get_close_matches()')能返回更准确的结果
- 比较文件A中的名字和文件B中的名字,找出最可能匹配的。
- 打印出原始(未过滤的)名字和匹配的百分比。
为什么这很难
两个输入文件中的命名规则差异很大。有些名字是部分缩写的(例如:文件A:Acme Company;文件B:Acme Co)。因为命名规则不一致,我不能用 'FileA.intersect(FileB)',这本来是最理想的方式。
修改应该发生的地方
for item in cleanListC:
match = difflib.get_close_matches(item,cleanListA,1,0.75)
CleanListA 是通过以下方式创建的:
cleanListA = []
for item in arrayA:
cleanListA.append(item[1])
这样就失去了 (ORIGINAL,FILTERED)
的配对。
最终目标
我希望把 arrayA 放进 difflib.get_close_matches()
,而不是 cleanListA,以保留 (ORIGINAL,FILTERED)
的配对。difflib.get_close_matches()
只会在确定相似匹配时查看配对中的 'FILTERED' 部分,但会返回整个配对。
1 个回答
既然你已经直接使用了 SequenceMatcher
来获取匹配比例,那么最简单的改动就是自己来做 get_close_matches
的操作。
你可以看看 get_close_matches()
的源代码 [比如,http://svn.python.org/view/python/tags/r271/Lib/difflib.py?revision=86833&view=markup 在大约737行附近]。这个函数返回的是匹配比例最高的 n 个序列。因为你只想要最好的匹配,所以你可以记录下 (原始值, 过滤后的值, 比例),其中比例是到目前为止最高的,而不是使用原方法中的 heapq
来追踪 n 个最高的匹配。
例如,在你的主循环中,可以这样做:
seqm = difflib.SequenceMatcher()
for i in arrayC:
origC, cleanC, caccount, ccode = i
seqm.set_seq2(cleanC)
bestRatio = 0
for j in arrayA:
origA, cleanA = j[:2]
seqm.set_seq1(cleanA)
if (seqm.real_quick_ratio() >= bestRatio and
seqm.quick_ratio() >= bestRatio):
r = seqm.ratio()
if r >= bestRatio:
bestRatio = r
bestA = j
if bestRatio >= 0.75: # the cutoff from the original get_close_matches() call
origA, cleanA, aaccount, acode = bestA
filteredratio = bestRatio
strfilteredratio = '%.2f' % (filteredratio*100)
seqm.set_seqs( origC, origA )
unfilteredratio = seqm.ratio()
strunfilteredratio = '%.2f' % (unfilteredratio*100)
averageratio = (filteredratio+unfilteredratio)/2
straverageratio = '%.2f' % (averageratio*100)
row = [origC.rstrip(),origA.rstrip(),cleanC.rstrip(),cleanA.rstrip(),caccount,ccode,aaccount,acode,strfilteredratio,strunfilteredratio,straverageratio]
else:
row = ["NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL","0.00","NULL","NULL"]
Match75.writerow(row)