索引搜索列表以提高python的性能

2024-03-29 12:39:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个巨大的concepts列表和一个巨大的sentences列表。我想在sentences中按感觉的顺序识别concepts。我使用multithreadingfor loops来执行这个任务。你知道吗

import queue
import threading

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process',
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

def func(sentence):
    sentence_tokens = []
    for item in concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    return sentence_tokens

def do_find_all_concepts(q_in, l_out):
    while True:
        sentence = q_in.get()
        l_out.append(func(sentence))
        q_in.task_done()

# Queue with default maxsize of 0, infinite queue size
sentences_q = queue.Queue()
output = []
counting = 0

# any reasonable number of workers
num_threads = 4
for i in range(num_threads):
    worker = threading.Thread(target=do_find_all_concepts, args=(sentences_q, output))
    # once there's nothing but daemon threads left, Python exits the program
    worker.daemon = True
    worker.start()

# put all the input on the queue
for s in sentences:
    sentences_q.put(s)
    counting = counting + 1
    print(counting)

# wait for the entire queue to be processed
sentences_q.join()
print(output)

这是迄今为止我拥有的最有效的代码。但是,对于我来说,使用真实的数据集仍然很慢。你知道吗

我的concepts列表按字母顺序排列。因此,我想知道python中是否有任何indexingserialisation机制只使用句子中第一个单词的字符搜索concepts列表的一部分(而不是搜索整个concepts列表)

我主要关心的是时间复杂性(因为根据我目前的时间估计,运行数据需要将近1.5周)。空间复杂性不是问题。你知道吗

如有需要,我很乐意提供更多细节。你知道吗


Tags: ofthein列表fordataqueuesentences