比较不同文件中的单词

import glob from collections import Counter path = "c-darwin-chapter-?.txt" wordcount = {} for filename in glob.glob(path): with open("c-darwin-chapter-1.txt", 'r') as f1, open("c-darwin-chapter-2.txt", 'r') as f2: f1_word_list = Counter(f1.read().replace(',','').replace('.','').replace("'",'').replace('!','').replace('&','').replace(';','').replace('(','').replace(')','').replace(':','').replace('?','').lower().split()) print("Total word count per file: ", sum(f1_word_list.values())) print("Total unique word count: ", len(f1_word_list)) f2_word_list = Counter(f2.read().replace(',','').replace('.','').replace("'",'').replace('!','').replace('&','').replace(';','').replace('(','').replace(')','').replace(':','').replace('?','').lower().split()) print("Total word count per file: ", sum(f2_word_list.values())) print("Total unique word count: ", len(f2_word_list)) #if/main commented out but final code must use if/main and loop #if __name__ == '__main__': # main()

Total word count Chapter1 = 11615 Chapter2 = 4837 Unique word count Chapter1 = 1991 Chapter2 = 1025 Words in Chapter1 and Chapter2: 623 Words in Chapter1 not in Chapter2: 1368 Words in Chapter2 not in Chapter1: 402

1条回答

网友

1楼 · 发布于 2024-05-13 14:58:13

读入两个文件并将读取的文本转换为列表/集。使用集合，可以使用集合运算符计算它们之间的交点/差值：

s.intersection(t)    s & t    new set with elements common to s and t  
s.difference(t)      s - t    new set with elements in s but not in t
An explanatory table of set-operations can be found here: Doku 2.x / valid for 3.7 as well

演示：

file1 = "This is some text in some file that you can preprocess as you " +\
        "like. This is some text in some file that you can preprocess as you like."

file2 = "this is other text about animals and flowers and flowers and " +\
        "animals but not animal-flowers that has to be processed as well"

# split into list - no .lower().replace(...) - you solved that already
list_f1 = file1.split() 
list_f2 = file2.split()

# create sets from list (case sensitive)
set_f1 = set( list_f1 )
set_f2 = set( list_f2 )

print(f"Words: {len(list_f1)} vs {len(list_f2)} Unique {len(set_f1)} vs {len(set_f2)}.")
# difference
print(f"Only in 1: {set_f1-set_f2} [{len(set_f1-set_f2)}]")
# intersection
print(f"In both {set_f1&set_f2} [{len(set_f1&set_f2)}]")
# difference the other way round
print(f"Only in 2:{set_f2-set_f1} [{len(set_f2-set_f1)}]")

输出：

Words: 28 vs 22 Unique 12 vs 18.
Only in 1: {'like.', 'in', 'you', 'can', 'file', 'This', 'preprocess', 'some'} [8]
In both {'is', 'that', 'text', 'as'} [4]
Only in 2:{'animals', 'not', 'but', 'animal-flowers', 'to', 'processed',
           'has', 'be', 'and', 'well', 'this', 'about', 'other', 'flowers'} [14]

你已经在处理文件读取和“统一”到小写-我忘在这里了。输出使用python3.6的字符串插值语法：参见PEP 498

相关问题更多 >

编程相关推荐

热门问题

热门文章