在一个大文件的每一行中匹配大量的关键字（>300万行；~4GB大小）问题的回答

在一个大文件的每一行中匹配大量的关键字（>300万行；~4GB大小）

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一个大的（超过300万行；~4 GB大小）csv文件，包含以下列： 帖子类别，帖子内容，帖子日期 “post_content”栏对我特别感兴趣，它包含以下形式的医学领域文本： 体外冲击波碎石术（ESWL）是一种常用的治疗梗阻性肾结石的方法，在有症状的慢性钙化性胰腺炎患者中，已被证明对胰腺结石的治疗是有效的 我有一个单独的文件，包含几千个（约10000个）医学领域关键字，如下所示： 文件：查找.txt在 胰腺 胰腺 胰腺炎 急性胰腺炎 慢性胰腺炎 慢性钙化性胰腺炎 。。。。 .... 在 现在，我希望从“大”csv文件的每一行中搜索并提取每个“post_message”中的所有匹配关键字，并将所有匹配的关键字附加到新列“keywords”中。在 所以，前面提到的eg文本应该返回：（‘胰腺’，‘慢性钙化性胰腺炎’） 注：“胰腺炎”也与“慢性钙化性胰腺炎”匹配，但不应视为与特定关键字完全匹配。在 所需的o/p是包含以下列的csv文件：帖子类别，帖子内容，帖子日期，关键词 “器官：胰腺”，“体外冲击波碎石术…”，“2014年7月24日上午10:00”，“胰腺；慢性钙化性胰腺炎” 我尝试了下面的代码，但它继续运行了2天没有退出我的机器，有8个核心，我杀死了它。在 如果我能有效地减少代码的处理时间，我该怎么做？在 <pre><code># -*- coding: utf-8 -*- import datetime import multiprocessing as mp import numpy as np import os import pandas as pd import re import sys KEYWORDS = set(line.strip() for line in open('keywords.txt')) def clean_raw_text(series): # Perform some text pre-processing to remove accented/non-ascii text etc # return processed_text def match_indications(series): # Perfrom actual keyword search of text matches = [] for indication in KEYWORDS: matchObj = re.search(indication, str(series['cleaned_post']), flags=re.U) if matchObj: matches.append(matchObj.group(0)) return ";".join(matches) def worker(df): name = mp.current_process().name print '%s: Processing Started...' % name result["cleaned_post"] = df.apply(clean_raw_text, axis=1) print "%s : Text Cleaning done.." % name result["keywords"] = result.apply(match_indications, axis=1) print "%s : 'Keywords matching done.." % name return result if __name__ == '__main__': if len(sys.argv) < 3: print "%d Arguments Given : Exiting..." % (len(sys.argv)-1) print "Usage: python %s <Path-to-Input-File> <Path-to-Output-File>" % sys.argv[0] exit(1) ifpath = os.path.abspath(sys.argv[1]) oppath = os.path.abspath(sys.argv[2]) big_df = pd.read_csv(ifpath, header=0, engine='python', quotechar='"') print big_df.info() num_processes = mp.cpu_count() p = mp.Pool(processes=num_processes) split_dfs = np.array_split(big_df, num_processes) pool_results = p.map(worker, split_dfs) p.close() p.join() # join parts along rows parts = pd.concat(pool_results, axis=0) # merge parts to big_df big_df = pd.concat([big_df, parts], axis=1) print big_df.info() big_df.drop('cleaned_post', axis=1, inplace=True) # Drop all rows where no keyword was found processed_df = big_df[np.isfinite(big_df['keywords'])] print processed_df.info() ctime = datetime.datetime.now().strftime('%d-%m-%Y_%H-%M-%S') ofpath = os.path.join(oppath, "%s.csv"%ctime) processed_df.to_csv(ofpath, sep=",", index=False) </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

在一个大文件的每一行中匹配大量的关键字（>300万行；~4GB大小）

1 个回答

相关Python问题