根据用户指定关键字从句子列表中返回句子

0 投票
1 回答
60 浏览
提问于 2025-04-14 17:26

我有一个存储在名为 list.xlsx 的 Excel 文件中的句子列表(大约 20000 个),在名为 Sentence 的工作表下,列名也叫 Sentence。

我的目的是获取用户输入的单词,并返回那些包含这些确切单词的句子。

我目前使用 spacy 开发的代码可以做到这一点,但检查和返回结果的时间很长。

有没有其他更省时间的方法可以实现这个目标呢?

我看到在 geany 的记事本或 libre calc 中,它们的搜索功能能很快返回句子。

这是怎么做到的呢?

请帮帮我。

import pandas as pd
import spacy

# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")

# Function to extract sentences containing the keyword
def extract_sentences_with_keyword(text, keyword):
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents if keyword in sent.text.lower()]
    return sentences

i = input("Enter Keyword(s):")

# Read the Excel file
file_path = "list.xlsx"
sheet_name = "Sentence"  # Update with your sheet name
column_name = "Sentence"   # Update with the column containing text data

data = pd.read_excel(file_path, sheet_name=sheet_name)



# Iterate over the rows and extract sentences with the keyword
keyword = i  # Update with the keyword you want to search for
for index, row in data.iterrows():
    text = row[column_name]
    sentences = extract_sentences_with_keyword(text, keyword)
    
    if sentences:
        for sentence in sentences:
            print(sentence)
        print("\n")

1 个回答

1

你可以使用Sqlite配合全文索引。我试过用一个6MB的文本文件做的示例代码,运行得非常快。当然,你需要根据自己的需求调整代码,像你之前提到的用spacy来分句也是个不错的选择。

import sqlite3
import re

conn = sqlite3.connect(':memory:')
cursor = conn.cursor()

cursor.execute('CREATE VIRTUAL TABLE fts_sentences USING fts5(content)')

def load_and_split_file(file_path):
    sentence_endings = r'[.!?]\s+|\s*$'
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
        sentences = re.split(sentence_endings, text)
        return sentences

def insert_sentences(sentences):
    for sentence in sentences:
        cursor.execute('INSERT INTO fts_sentences (content) VALUES (?)', (sentence,))
    conn.commit()

def search_word(word):
    cursor.execute('SELECT content FROM fts_sentences WHERE fts_sentences MATCH ?', (word,))
    return cursor.fetchall()

file_path = 'big.txt' 
sentences = load_and_split_file(file_path)
insert_sentences(sentences)

while True:
    word_to_search = input('Enter a word to search for: ')
    matching_sentences = search_word(word_to_search)

    for sentence in matching_sentences:
        print(sentence[0])

你用spacy的代码运行得很慢,因为你没有关闭任何处理流程,这样它还会进行一些你不需要的操作,比如词性检测。想了解更多细节,可以看看这里:https://spacy.io/usage/processing-pipelines

引用文档中的内容(你可能需要关闭更多或更少的处理流程):

import spacy

texts = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]

nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])

撰写回答