使用自定义词典标注句子中的单词

2024-05-23 14:21:14 发布

您现在位置：Python中文网/ 问答频道 /正文

3019

网友

男 | 程序猿一只，喜欢编程写python代码。

我有超过10万个句子的语料库和一本字典。我想匹配语料库中的单词，并在句子中标记它们

语料库文件“testing.txt”

Hello how are you doing. HiV is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.

字典文件“dict.csv”

abc, anxiety, disorder
def, HIV, virus
hij, Malaria, virus
klm, headache, symptom

我的python程序

import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams

import codecs

with open('dictionary.csv','r') as csvFile:
    reader = csv.reader(csvFile)
    myfile = open("testing.txt", "rt")
    my2file = open("match.txt" ,"w")
    hay = myfile.read()
    myfile.close()

for row in reader:
    needle = row[1]
    needle_length = len(needle.split())
    max_sim_val = 0.9
    max_sim_string = u""
    for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
        hay_ngram = u" ".join(ngram)

        similarity = SM(None, hay_ngram, needle).ratio()
        if similarity > max_sim_val:
            max_sim_val = similarity
            max_sim_string = hay_ngram
            str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
            my2file.writelines(str)
            print(str)

csvFile.close()

我现在的任务是

 disorder 0.9333333333333333 anxiety
 virus 0.9333333333333333 Malaria

我希望我的输出

 Hello how are you doing. HIV [virus] is dangerous
 Malaria [virus] can be cure.
 he has anxiety [disorder] thats why he is behaving like that

Tags： csv import is val sim max he 语料库

1条回答

网友

1楼 · 发布于 2024-05-23 14:21:14

您可以迭代testing.txt上的行并替换这些值，类似这样的操作应该可以：

...
if similarity > max_sim_val:
    max_sim_val = similarity
    max_sim_string = hay_ngram
    str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
    my2file.writelines(str)
    print(str)

    for line in hay.splitlines():
        if max_sim_string in line:
            print(line.replace(max_sim_string, f"{max_sim_string} [{row[1]}]"))
            break

使用自定义词典标注句子中的单词

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用自定义词典标注句子中的单词

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >