我有超过10万个句子的语料库和一本字典。我想匹配语料库中的单词,并在句子中标记它们
语料库文件“testing.txt”
Hello how are you doing. HiV is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.
字典文件“dict.csv”
abc, anxiety, disorder
def, HIV, virus
hij, Malaria, virus
klm, headache, symptom
我的python程序
import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs
with open('dictionary.csv','r') as csvFile:
reader = csv.reader(csvFile)
myfile = open("testing.txt", "rt")
my2file = open("match.txt" ,"w")
hay = myfile.read()
myfile.close()
for row in reader:
needle = row[1]
needle_length = len(needle.split())
max_sim_val = 0.9
max_sim_string = u""
for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
hay_ngram = u" ".join(ngram)
similarity = SM(None, hay_ngram, needle).ratio()
if similarity > max_sim_val:
max_sim_val = similarity
max_sim_string = hay_ngram
str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
my2file.writelines(str)
print(str)
csvFile.close()
我现在的任务是
disorder 0.9333333333333333 anxiety
virus 0.9333333333333333 Malaria
我希望我的输出
Hello how are you doing. HIV [virus] is dangerous
Malaria [virus] can be cure.
he has anxiety [disorder] thats why he is behaving like that
您可以迭代
testing.txt
上的行并替换这些值,类似这样的操作应该可以:相关问题 更多 >
编程相关推荐