根据用户指定关键字从句子列表中返回句子
我有一个存储在名为 list.xlsx 的 Excel 文件中的句子列表(大约 20000 个),在名为 Sentence 的工作表下,列名也叫 Sentence。
我的目的是获取用户输入的单词,并返回那些包含这些确切单词的句子。
我目前使用 spacy 开发的代码可以做到这一点,但检查和返回结果的时间很长。
有没有其他更省时间的方法可以实现这个目标呢?
我看到在 geany 的记事本或 libre calc 中,它们的搜索功能能很快返回句子。
这是怎么做到的呢?
请帮帮我。
import pandas as pd
import spacy
# Load the English language model in spaCy
nlp = spacy.load("en_core_web_sm")
# Function to extract sentences containing the keyword
def extract_sentences_with_keyword(text, keyword):
doc = nlp(text)
sentences = [sent.text for sent in doc.sents if keyword in sent.text.lower()]
return sentences
i = input("Enter Keyword(s):")
# Read the Excel file
file_path = "list.xlsx"
sheet_name = "Sentence" # Update with your sheet name
column_name = "Sentence" # Update with the column containing text data
data = pd.read_excel(file_path, sheet_name=sheet_name)
# Iterate over the rows and extract sentences with the keyword
keyword = i # Update with the keyword you want to search for
for index, row in data.iterrows():
text = row[column_name]
sentences = extract_sentences_with_keyword(text, keyword)
if sentences:
for sentence in sentences:
print(sentence)
print("\n")
1 个回答
1
你可以使用Sqlite配合全文索引。我试过用一个6MB的文本文件做的示例代码,运行得非常快。当然,你需要根据自己的需求调整代码,像你之前提到的用spacy来分句也是个不错的选择。
import sqlite3
import re
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
cursor.execute('CREATE VIRTUAL TABLE fts_sentences USING fts5(content)')
def load_and_split_file(file_path):
sentence_endings = r'[.!?]\s+|\s*$'
with open(file_path, 'r', encoding='utf-8') as file:
text = file.read()
sentences = re.split(sentence_endings, text)
return sentences
def insert_sentences(sentences):
for sentence in sentences:
cursor.execute('INSERT INTO fts_sentences (content) VALUES (?)', (sentence,))
conn.commit()
def search_word(word):
cursor.execute('SELECT content FROM fts_sentences WHERE fts_sentences MATCH ?', (word,))
return cursor.fetchall()
file_path = 'big.txt'
sentences = load_and_split_file(file_path)
insert_sentences(sentences)
while True:
word_to_search = input('Enter a word to search for: ')
matching_sentences = search_word(word_to_search)
for sentence in matching_sentences:
print(sentence[0])
你用spacy的代码运行得很慢,因为你没有关闭任何处理流程,这样它还会进行一些你不需要的操作,比如词性检测。想了解更多细节,可以看看这里:https://spacy.io/usage/processing-pipelines
引用文档中的内容(你可能需要关闭更多或更少的处理流程):
import spacy
texts = [
"Net income was $9.4 million compared to the prior year of $2.7 million.",
"Revenue exceeded twelve billion dollars, with a loss of $1b.",
]
nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
# Do something with the doc here
print([(ent.text, ent.label_) for ent in doc.ents])