使用自定义词典标注句子中的单词

2024-05-23 14:21:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我有超过10万个句子的语料库和一本字典。我想匹配语料库中的单词,并在句子中标记它们

语料库文件“testing.txt”

Hello how are you doing. HiV is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.

字典文件“dict.csv”

abc, anxiety, disorder
def, HIV, virus
hij, Malaria, virus
klm, headache, symptom

我的python程序

import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams

import codecs

with open('dictionary.csv','r') as csvFile:
    reader = csv.reader(csvFile)
    myfile = open("testing.txt", "rt")
    my2file = open("match.txt" ,"w")
    hay = myfile.read()
    myfile.close()

for row in reader:
    needle = row[1]
    needle_length = len(needle.split())
    max_sim_val = 0.9
    max_sim_string = u""
    for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
        hay_ngram = u" ".join(ngram)

        similarity = SM(None, hay_ngram, needle).ratio()
        if similarity > max_sim_val:
            max_sim_val = similarity
            max_sim_string = hay_ngram
            str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
            my2file.writelines(str)
            print(str)

csvFile.close()

我现在的任务是

 disorder 0.9333333333333333 anxiety
 virus 0.9333333333333333 Malaria

我希望我的输出

 Hello how are you doing. HIV [virus] is dangerous
 Malaria [virus] can be cure.
 he has anxiety [disorder] thats why he is behaving like that

Tags: csvimportisvalsimmaxhe语料库
1条回答
网友
1楼 · 发布于 2024-05-23 14:21:14

您可以迭代testing.txt上的行并替换这些值,类似这样的操作应该可以:

...
if similarity > max_sim_val:
    max_sim_val = similarity
    max_sim_string = hay_ngram
    str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
    my2file.writelines(str)
    print(str)

    for line in hay.splitlines():
        if max_sim_string in line:
            print(line.replace(max_sim_string, f"{max_sim_string} [{row[1]}]"))
            break

相关问题 更多 >