将列表的列表传入函数

1 投票
1 回答
729 浏览
提问于 2025-04-16 16:59

我有一个多维数组,想把它放进 difflib.get_close_matches() 里。

我的数组长这样: array[(ORIGINAL, FILTERED)]。这里的 ORIGINAL 是一个字符串,而 FILTERED 是把 ORIGINAL 中常见的词去掉后的字符串。

现在我创建了一个新数组,只把 FILTERED 的词放进 difflib.get_close_matches()。然后我试着把 difflib 的结果和 array[(ORIGINAL, FILTERED)] 进行匹配。我的问题是,常常会有两个或更多的 FILTERED 词是相同的,这样就无法用这种方法进行匹配了。

有没有办法让我把整个 array[(ORIGINAL,FILTERED)] 放进 difflib,但只看 FILTERED 的部分(同时还返回 [(ORIGINAL,FILTERED)])呢?

提前谢谢你!

import  time
import  csv
import  difflib
import  sys
import  os.path
import  datetime

### Filters out common  words   in  an  attempt to  get better      results ###
def ignoredWords (word):
    filtered = word.lower()
    #Common Full Words
## Majority of filters were edited out
    #Common Abbreviations
    if "univ" in filtered:
        filtered = filtered.replace("univ","")
    #Special Characters
    if "  " in filtered: #Two White Spaces
        filtered = filtered.replace("  "," ")
    if "-" in filtered:
        filtered = filtered.replace("-"," ")
    if "\'" in filtered:
        filtered = filtered.replace("\'"," ")
    if " & " in filtered:
        filtered = filtered.replace(" &","")
    if "(\"" in filtered:
        filtered = filtered.replace("(\"","")
    if "\")" in filtered:
        filtered = filtered.replace("\")","")
    if "\t" in filtered:
        filtered = filtered.replace("\t"," ")
    return  filtered

### Takes in a list, then outputs a 2D list. array[Original, Filtered] ###
### For XXX: array[Original, Filtered, Account Number, Code] ###
def create2DArray (list):
    array = []
    for item in list:
        clean = ignoredWords(item[2])
        entry = (item[2].lower(), clean, item[0],item[1])
        array.append(entry)
    return array

def main(argv):
    if(len(argv) < 3):
        print "Not enough parameters. Please enter two file names"
        sys.exit(2)
    elif (not os.path.isfile(argv[1])):
        print "%s is not found" %(argv[1])
        sys.exit(2)
    elif (not os.path.isfile(argv[2])):
        print "%s is not found" %(argv[2])
        sys.exit(2)
    #Recode File ----- Not yet implemented
#       if(len(argv) == 4):
#       if(not os.path.isfile(argv[3])):
#           print "%s is not found" %(argv[3])
#           sys.exit(2)
#           
#       recode = open(argv[1], 'r')
#       try:
#           setRecode = c.readlines()
#       finally:
#           recode.close()
#           setRecode.sort()
#           print setRecode[0]
    #Measure execution time
    t0 = time.time()

    cReader = csv.reader(open(argv[1], 'rb'), delimiter='|')
    try:
        setC = []
        for row in cReader:
            setC.append(row)
    finally:
        setC.sort()

    aReader = csv.reader(open(argv[2], 'rb'), delimiter='|')
    try:
        setA = []
        for row in aReader:
            setA.append(row)
    finally:
        setA.sort()

    #Put Set A and Set C into their own 2 dimmensional arrays.array[Original Word]    [Cleaned Up Word]
    arrayC = create2DArray(setC)
    arrayA = create2DArray(setA)

    #Create clean list versions for use with difflib
    cleanListC = []
    for item in arrayC:
        cleanListC.append(item[1])

    cleanListA = []
    for item in arrayA:
        cleanListA.append(item[1])

    ############OUTPUT FILENAME############
    fMatch75 = open("Match75.csv", 'w')
    Match75 = csv.writer(fMatch75, dialect='excel')
    try:
        header = "Fuzzy Matching Report. Generated: "
        header += str(datetime.date.today())
        Match75.writerow([header])
        Match75.writerow(['C','A','C Cleaned','A Cleaned','C Account', 'C Group','A Account', 'A Group', 'Filtered Ratio %','Unfiltered Ratio %','Average Ratio %'])
        for item in cleanListC:
            match = difflib.get_close_matches(item,cleanListA,1,0.75)
            
            if len(match) > 0:
                filteredratio = difflib.SequenceMatcher(None,item,match[0]).ratio()
                strfilteredratio = '%.2f' % (filteredratio*100)
                found = 0
                for group in arrayA:
                    if match[0] == group[1]:
                        origA = group[0]
                        acode = group[3]
                        aaccount = group[2]
                        found = found + 1
                for group in arrayC:
                    if item == group[1]:
                        origC = group[0]
                        ccode = group[3]
                        caccount = group[2]
                        found = found + 2
                if found == 3:
                    unfilteredratio = difflib.SequenceMatcher(None,origC,origA).ratio()
                    strunfilteredratio = '%.2f' % (unfilteredratio*100)
                    averageratio = (filteredratio+unfilteredratio)/2
                    straverageratio = '%.2f' % (averageratio*100)
                    
                    row = [origC.rstrip(),origA.rstrip(),item.rstrip(),match[0].rstrip(),caccount,ccode,aaccount,acode,strfilteredratio,strunfilteredratio,straverageratio]
                    Match75.writerow(row)
                #These Else Ifs are for debugging. If NULL is found anywhere in the CSV, then an error has occurred
                elif found == 2:
                    row = [origC.rstrip(),"NULL",item.rstrip(),match[0].rstrip(),caccount,ccode,"NULL","NULL",strfilteredratio,"NULL","NULL"]
                    Match75.writerow(row)
                elif found == 1:
                    row = ["NULL",origA.rstrip(),item.rstrip(),match[0].rstrip(),"NULL","NULL",aaccount,acode,strfilteredratio,"NULL","NULL"]
                    Match75.writerow(row)
            else:
                    row = ["NULL","NULL",item.rstrip(),match[0].rstrip(),"NULL","NULL","NULL","NULL",strfilteredratio,"NULL","NULL"]
                    Match75.writerow(row)
                
    finally:
        Match75.writerow(["A Proprietary and Confidential. Do Not Distribute"])
        fMatch75.close()

    print (time.time()-t0,"seconds")

if __name__ == "__main__":
    main(argv=sys.argv)

我想要实现的目标:

  1. 读取输入文件
  2. 从名字中去掉常见词,这样模糊匹配('difflib.get_close_matches()')能返回更准确的结果
  3. 比较文件A中的名字和文件B中的名字,找出最可能匹配的。
  4. 打印出原始(未过滤的)名字和匹配的百分比。

为什么这很难

两个输入文件中的命名规则差异很大。有些名字是部分缩写的(例如:文件A:Acme Company;文件B:Acme Co)。因为命名规则不一致,我不能用 'FileA.intersect(FileB)',这本来是最理想的方式。

修改应该发生的地方

for item in cleanListC:
    match = difflib.get_close_matches(item,cleanListA,1,0.75)

CleanListA 是通过以下方式创建的:

cleanListA = []
    for item in arrayA:
        cleanListA.append(item[1])

这样就失去了 (ORIGINAL,FILTERED) 的配对。

最终目标

我希望把 arrayA 放进 difflib.get_close_matches(),而不是 cleanListA,以保留 (ORIGINAL,FILTERED) 的配对。difflib.get_close_matches() 只会在确定相似匹配时查看配对中的 'FILTERED' 部分,但会返回整个配对。

1 个回答

0

既然你已经直接使用了 SequenceMatcher 来获取匹配比例,那么最简单的改动就是自己来做 get_close_matches 的操作。

你可以看看 get_close_matches() 的源代码 [比如,http://svn.python.org/view/python/tags/r271/Lib/difflib.py?revision=86833&view=markup 在大约737行附近]。这个函数返回的是匹配比例最高的 n 个序列。因为你只想要最好的匹配,所以你可以记录下 (原始值, 过滤后的值, 比例),其中比例是到目前为止最高的,而不是使用原方法中的 heapq 来追踪 n 个最高的匹配。

例如,在你的主循环中,可以这样做:

seqm = difflib.SequenceMatcher()

for i in arrayC:
  origC, cleanC, caccount, ccode = i
  seqm.set_seq2(cleanC)

  bestRatio = 0

  for j in arrayA:
    origA, cleanA = j[:2]
    seqm.set_seq1(cleanA)

    if (seqm.real_quick_ratio() >= bestRatio and
        seqm.quick_ratio() >= bestRatio):
      r = seqm.ratio()
      if r >= bestRatio:
        bestRatio = r
        bestA = j

  if bestRatio >= 0.75: # the cutoff from the original get_close_matches() call
    origA, cleanA, aaccount, acode = bestA

    filteredratio = bestRatio
    strfilteredratio = '%.2f' % (filteredratio*100)

    seqm.set_seqs( origC, origA )
    unfilteredratio = seqm.ratio()
    strunfilteredratio = '%.2f' % (unfilteredratio*100)

    averageratio = (filteredratio+unfilteredratio)/2
    straverageratio = '%.2f' % (averageratio*100)

    row = [origC.rstrip(),origA.rstrip(),cleanC.rstrip(),cleanA.rstrip(),caccount,ccode,aaccount,acode,strfilteredratio,strunfilteredratio,straverageratio]
  else:
    row = ["NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL","0.00","NULL","NULL"]

  Match75.writerow(row)

撰写回答