将列表的列表传入函数

Question

我有一个多维数组，想把它放进 difflib.get_close_matches() 里。

我的数组长这样： array[(ORIGINAL, FILTERED)]。这里的 ORIGINAL 是一个字符串，而 FILTERED 是把 ORIGINAL 中常见的词去掉后的字符串。

现在我创建了一个新数组，只把 FILTERED 的词放进 difflib.get_close_matches()。然后我试着把 difflib 的结果和 array[(ORIGINAL, FILTERED)] 进行匹配。我的问题是，常常会有两个或更多的 FILTERED 词是相同的，这样就无法用这种方法进行匹配了。

有没有办法让我把整个 array[(ORIGINAL,FILTERED)] 放进 difflib，但只看 FILTERED 的部分（同时还返回 [(ORIGINAL,FILTERED)]）呢？

提前谢谢你！

import  time
import  csv
import  difflib
import  sys
import  os.path
import  datetime

### Filters out common  words   in  an  attempt to  get better      results ###
def ignoredWords (word):
    filtered = word.lower()
    #Common Full Words
## Majority of filters were edited out
    #Common Abbreviations
    if "univ" in filtered:
        filtered = filtered.replace("univ","")
    #Special Characters
    if "  " in filtered: #Two White Spaces
        filtered = filtered.replace("  "," ")
    if "-" in filtered:
        filtered = filtered.replace("-"," ")
    if "\'" in filtered:
        filtered = filtered.replace("\'"," ")
    if " & " in filtered:
        filtered = filtered.replace(" &","")
    if "(\"" in filtered:
        filtered = filtered.replace("(\"","")
    if "\")" in filtered:
        filtered = filtered.replace("\")","")
    if "\t" in filtered:
        filtered = filtered.replace("\t"," ")
    return  filtered

### Takes in a list, then outputs a 2D list. array[Original, Filtered] ###
### For XXX: array[Original, Filtered, Account Number, Code] ###
def create2DArray (list):
    array = []
    for item in list:
        clean = ignoredWords(item[2])
        entry = (item[2].lower(), clean, item[0],item[1])
        array.append(entry)
    return array

def main(argv):
    if(len(argv) < 3):
        print "Not enough parameters. Please enter two file names"
        sys.exit(2)
    elif (not os.path.isfile(argv[1])):
        print "%s is not found" %(argv[1])
        sys.exit(2)
    elif (not os.path.isfile(argv[2])):
        print "%s is not found" %(argv[2])
        sys.exit(2)
    #Recode File ----- Not yet implemented
#       if(len(argv) == 4):
#       if(not os.path.isfile(argv[3])):
#           print "%s is not found" %(argv[3])
#           sys.exit(2)
#           
#       recode = open(argv[1], 'r')
#       try:
#           setRecode = c.readlines()
#       finally:
#           recode.close()
#           setRecode.sort()
#           print setRecode[0]
    #Measure execution time
    t0 = time.time()

    cReader = csv.reader(open(argv[1], 'rb'), delimiter='|')
    try:
        setC = []
        for row in cReader:
            setC.append(row)
    finally:
        setC.sort()

    aReader = csv.reader(open(argv[2], 'rb'), delimiter='|')
    try:
        setA = []
        for row in aReader:
            setA.append(row)
    finally:
        setA.sort()

    #Put Set A and Set C into their own 2 dimmensional arrays.array[Original Word]    [Cleaned Up Word]
    arrayC = create2DArray(setC)
    arrayA = create2DArray(setA)

    #Create clean list versions for use with difflib
    cleanListC = []
    for item in arrayC:
        cleanListC.append(item[1])

    cleanListA = []
    for item in arrayA:
        cleanListA.append(item[1])

    ############OUTPUT FILENAME############
    fMatch75 = open("Match75.csv", 'w')
    Match75 = csv.writer(fMatch75, dialect='excel')
    try:
        header = "Fuzzy Matching Report. Generated: "
        header += str(datetime.date.today())
        Match75.writerow([header])
        Match75.writerow(['C','A','C Cleaned','A Cleaned','C Account', 'C Group','A Account', 'A Group', 'Filtered Ratio %','Unfiltered Ratio %','Average Ratio %'])
        for item in cleanListC:
            match = difflib.get_close_matches(item,cleanListA,1,0.75)
            
            if len(match) > 0:
                filteredratio = difflib.SequenceMatcher(None,item,match[0]).ratio()
                strfilteredratio = '%.2f' % (filteredratio*100)
                found = 0
                for group in arrayA:
                    if match[0] == group[1]:
                        origA = group[0]
                        acode = group[3]
                        aaccount = group[2]
                        found = found + 1
                for group in arrayC:
                    if item == group[1]:
                        origC = group[0]
                        ccode = group[3]
                        caccount = group[2]
                        found = found + 2
                if found == 3:
                    unfilteredratio = difflib.SequenceMatcher(None,origC,origA).ratio()
                    strunfilteredratio = '%.2f' % (unfilteredratio*100)
                    averageratio = (filteredratio+unfilteredratio)/2
                    straverageratio = '%.2f' % (averageratio*100)
                    
                    row = [origC.rstrip(),origA.rstrip(),item.rstrip(),match[0].rstrip(),caccount,ccode,aaccount,acode,strfilteredratio,strunfilteredratio,straverageratio]
                    Match75.writerow(row)
                #These Else Ifs are for debugging. If NULL is found anywhere in the CSV, then an error has occurred
                elif found == 2:
                    row = [origC.rstrip(),"NULL",item.rstrip(),match[0].rstrip(),caccount,ccode,"NULL","NULL",strfilteredratio,"NULL","NULL"]
                    Match75.writerow(row)
                elif found == 1:
                    row = ["NULL",origA.rstrip(),item.rstrip(),match[0].rstrip(),"NULL","NULL",aaccount,acode,strfilteredratio,"NULL","NULL"]
                    Match75.writerow(row)
            else:
                    row = ["NULL","NULL",item.rstrip(),match[0].rstrip(),"NULL","NULL","NULL","NULL",strfilteredratio,"NULL","NULL"]
                    Match75.writerow(row)
                
    finally:
        Match75.writerow(["A Proprietary and Confidential. Do Not Distribute"])
        fMatch75.close()

    print (time.time()-t0,"seconds")

if __name__ == "__main__":
    main(argv=sys.argv)

我想要实现的目标：

读取输入文件
从名字中去掉常见词，这样模糊匹配（'difflib.get_close_matches()'）能返回更准确的结果
比较文件A中的名字和文件B中的名字，找出最可能匹配的。
打印出原始（未过滤的）名字和匹配的百分比。

为什么这很难

两个输入文件中的命名规则差异很大。有些名字是部分缩写的（例如：文件A：Acme Company；文件B：Acme Co）。因为命名规则不一致，我不能用 'FileA.intersect(FileB)'，这本来是最理想的方式。

修改应该发生的地方

for item in cleanListC:
    match = difflib.get_close_matches(item,cleanListA,1,0.75)

CleanListA 是通过以下方式创建的：

cleanListA = []
    for item in arrayA:
        cleanListA.append(item[1])

这样就失去了 (ORIGINAL,FILTERED) 的配对。

最终目标

我希望把 arrayA 放进 difflib.get_close_matches()，而不是 cleanListA，以保留 (ORIGINAL,FILTERED) 的配对。difflib.get_close_matches() 只会在确定相似匹配时查看配对中的 'FILTERED' 部分，但会返回整个配对。

字符串处理数据过滤多维数组相似性比较模糊匹配命名规则词汇去除结果匹配

将列表的列表传入函数

1 个回答

撰写回答