我有一个巨大的concepts
列表和一个巨大的sentences
列表。我想在sentences
中按感觉的顺序识别concepts
。我使用multithreading
和for loops
来执行这个任务。你知道吗
import queue
import threading
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd']
concepts = ['data mining', 'database systems', 'databases process',
'interdisciplinary subfield', 'information', 'knowledge discovery',
'methods', 'machine learning', 'patterns', 'process']
def func(sentence):
sentence_tokens = []
for item in concepts:
index = sentence.find(item)
if index >= 0:
sentence_tokens.append((index, item))
sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
return sentence_tokens
def do_find_all_concepts(q_in, l_out):
while True:
sentence = q_in.get()
l_out.append(func(sentence))
q_in.task_done()
# Queue with default maxsize of 0, infinite queue size
sentences_q = queue.Queue()
output = []
counting = 0
# any reasonable number of workers
num_threads = 4
for i in range(num_threads):
worker = threading.Thread(target=do_find_all_concepts, args=(sentences_q, output))
# once there's nothing but daemon threads left, Python exits the program
worker.daemon = True
worker.start()
# put all the input on the queue
for s in sentences:
sentences_q.put(s)
counting = counting + 1
print(counting)
# wait for the entire queue to be processed
sentences_q.join()
print(output)
这是迄今为止我拥有的最有效的代码。但是,对于我来说,使用真实的数据集仍然很慢。你知道吗
我的concepts
列表按字母顺序排列。因此,我想知道python中是否有任何indexing
或serialisation
机制只使用句子中第一个单词的字符搜索concepts
列表的一部分(而不是搜索整个concepts
列表)
我主要关心的是时间复杂性(因为根据我目前的时间估计,运行数据需要将近1.5周)。空间复杂性不是问题。你知道吗
如有需要,我很乐意提供更多细节。你知道吗
目前没有回答
相关问题 更多 >
编程相关推荐