在Python中解析和避免嵌套循环

2024-06-16 11:39:03 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个sql表,其中12000个条目存储在数据帧df1中,如下所示:

^{tb1}$

我还有另一个表,它包含20000个条目,存储在dataframe df中:

^{tb2}$

目的是在一个条件为CA单元格值的长度应大于2的句子中,将df1中的名称与df中的CA(用“”空格分隔)的名称的每个可能组合进行匹配。最简单的逻辑是在句子中搜索df1中的所有名称值,如果找到匹配项,则在同一句子中搜索CA值。但这样做会限制资源的使用

下面是我尝试过的代码,我只能想到嵌套循环来完成任务。如果我使用两个函数,那么我将创建一个函数调用开销,如果我尝试递归,那么如果我超过Python中的递归函数调用,这将迫使内核关闭。 通过向其传递一个句子(我必须解析500k个句子)来调用以下函数:

 def disease_search(nltk_tokens_sen):
  for dis_index in range(len(df1)): 
        disease_name=df1.at[dis_index,'name']
        regex_for_dis = rf"\b{disease_name}\b"
        matches_for_dis= re.findall(regex_for_dis, nltk_tokens_sen, re.IGNORECASE | re.MULTILINE)
        if len(matches_for_dis)!=0:
            disease_marker(nltk_tokens_sen, disease_name)
        

如果上述函数找到匹配项,则调用此函数:

    def disease_marker(nltk_tokens_sen, disease_name):
     for zz in range(len(df)):
      biomarker_txt=((df.at[zz,'CA'])) 
      biomarker = biomarker_txt.split(" ")
      for tt in range(len(biomarker)):
        if len(biomarker[tt])>2:
            matches_for_marker = re.findall(rf"\b{re.escape(biomarker[tt])}\b", nltk_tokens_sen)
            if len(matches_for_marker)!=0:
                print("Match_found:", disease_name, biomarker[tt] )

我是否需要完全改变我的逻辑,或者是否有一种Pythonic运行时有效的方法来实现它


Tags: 函数nameredfforlen句子ca
3条回答

根据粘贴在评论中的链接,您正在尝试循环浏览所有可用的疾病名称,以在给定的单词段落中查找疾病。我建议您循环阅读段落中的单词,并在数据框中找到匹配项

您可以尝试执行以下步骤

  1. 将nltk_标记拆分为单词列表,并将其命名为nltk_标记_单词

  2. 您可以使用诸如match&;之类的DF字符串过滤器,而不是在整个数据帧中循环查找给定单词列表中的匹配行;包含。这将减少整个DF的循环

    filtered_rows = (df1['name'].str.contains(string) for string in nltk_tokens_words)

  3. 使用np和apply创建一个组合标记,以获得过滤后的DF

    combined_mask = np.vstack(filtered_rows).all(axis=0)

    df1[combined_mask]

  4. 对第二个DF重复相同的步骤

试试这个,让我知道这是否对你有帮助

优化和修改

  • 使用Aho-Corasick算法在字符串中查找多个子字符串(比根据字符串连续检查每个子字符串快得多)
  • 由于疾病和生物标记物是独立的,因此在字符串中查找所有疾病和所有生物标记物,并执行两个结果的product(即查找一次生物标记物,而不是每个发现的疾病查找一次)
  • OP想要疾病名称&;id和生物标记物CA以及每个结果的条目名称

结果:

  • 33倍于单一nltk判决的速度(107毫秒vs.3.54秒)
  • 538e3 X针对随后的nltk判决加速(6.57美国vs.3.54美国) (在后续nltk语句中使用相同的键时)

新代码

文件:main.py

##################################################################
# Imports keyword data from simulate_data module and processes
# on nltk sentence
##################################################################
if __name__ == "__main__":
  from process import Finder, format_result
  from simulate_data import df_disease, df_biomarker

  # Sentence to process
  nltk_sentence = 'very hard angiocarcoma diagnosed 3BHS1'

  # Set up search engine
  finder = Finder(df_disease, df_biomarker)

  # Process and loop through results
  for d, m in finder.process(nltk_sentence):
      format_result(d, m)

文件:process.py

from itertools import product
import ahocorasick as ahc

def make_aho_automaton(keywords):
    A = ahc.Automaton()  # initialize
    for (key, cat) in keywords:
        A.add_word(key, (cat, key)) # add keys and categories
    A.make_automaton() # generate automaton
    return A

class Finder():
    def  __init__(self, df_disease, df_biomarker):
        ' Initialize Automatons for Disease and biomarkers '
        #   Disease and biomarker Keywords
        #      Pairs of (name, id) with name as key and id as category
        #      note: used underscore on variable names (e.g. _id) to ensure against shadowing builtins
        disease_keys = ((_name.lower(), _id) for _name, _id in zip(df_disease['name'], df_disease['id']))
        #     Pairs of (CA, Entry_name) with CA as key and Entry_name as category
        biomarker_keys = (
            (_ca_entry, _entry_name) 
            for _ca, _entry_name in zip(df_biomarker['CA'], df_biomarker['Entry_name']) 
            for _ca_entry in _ca.split() if len(_ca_entry) >= 3
        )
        
        # Create Aho-Corasick automatons
        #    Surround keywords with space so we find only only word boundaries
        self.disease_automaton = make_aho_automaton((f' {keyw} ', cat) for keyw, cat in disease_keys)
        self.biomarker_automaton = make_aho_automaton((f' {keyw} ', cat) for keyw, cat in biomarker_keys)
        
    def find_keywords(self, line, A):
        ' Finds key words in line that exist in Automaton A (as generator)'
        # ensure line has space at beginning and end, but remove from key
        return ((keyw.strip(), cat) for end_index, (cat, keyw) in A.iter(f' {line} '))
    
    def process(self, nltk_sentence):
        sentence = f' {nltk_sentence} '  # ensures sentence as space at both ends (for keyword matching)
        for d, m in product(self.find_keywords(sentence, self.disease_automaton), 
                            self.find_keywords(sentence, self.biomarker_automaton)):
            yield d, m
  
def format_result(d, m):
    ' Format results for printing '
    return print(f'Match_found: Disease: name {d[0].title()} with id {d[1]}, Biomarker CA {m[0]} with Entry_name {m[1]}')

文件:simulate_data.py

import string
from random import randint, choice
import pandas as pd

def random_word(min_length = 2, max_length = 5, upper = False):
    ' Generate random word '
    if upper:
        # Used for biomarker
        letters = string.ascii_uppercase + string.digits
    else:
        # used by disease
        letters = string.ascii_lowercase
    
    return ''.join(choice(letters) for _ in range(randint(min_length, max_length)))

def random_sentence(min_length = 1, max_length = 3, upper = False):
    '''Random sentence   upper (True) generates letters and digits, 
      letters only for lower
    '''
    return ' '.join(random_word(upper = upper) for _ in range(randint(min_length, max_length)))

diseases = {"id":['00001', '00261'],
        'name':['angiocarcoma', 'shrimp allergy']}
biomarkers = {'Entry_name':['TRGV2', 'TRGJ1'],
          'CA':['3BHS1 HSD3B1 3BH HSDB3', '3BP1 SH3BP1 IF']}

N = 10000
for i in range(N):
    diseases['id'].append(str(i+300).zfill(5))
    diseases['name'].append(random_sentence(min_length = 1, max_length = 5, upper = False))
    biomarkers['Entry_name'].append(random_word(4, 5, True))
    biomarkers['CA'].append(random_sentence(2, 6, True))
    
df_disease = pd.DataFrame(diseases)
df_biomarker = pd.DataFrame(biomarkers)

输出

示例1在单个句子上使用nltk语句

nltk_sentence = 'allergy very hard angiocarcoma diagnosed 3BH 3BHS1'
finder = Finder(df_disease, df_biomarker)  # Finder for disease and biomarker keys
for d, m in finder.process(nltk_sentence): # apply to nltk_sentence
    format_result(d, m)
    
 # Out:
 # Match_found: Disease: name Angiocarcoma with id 00001, Biomarker CA 3BH with Entry_name TRGV2
 # Match_found: Disease: name Angiocarcoma with id 00001, Biomarker CA 3BHS1 with Entry_name TRGV2

示例2在多个nltk句子上循环

sentences = ("Very hard angiocarcoma diagnosed 3BHS1\n"                # Note: Un-Capitalized A in Angiocarcoma
             "very hard Angiocarcoma diagnosed IF\n"                   # Note: Biomarker IF is too short)
             "very hard angiocarcoma diagnosed 3BP0\n"                 # Note: Un-Capitalized A in Angiocarcoma
             "Very hard Angiocarcoma diagnosed 3BP0 3BHS1\n"           # Note: Capitalized A in Angiocarcoma
             "Fish allergy very hard angiocarcoma diagnosed 3BP0 3BHS1"
            )

for sentence in sentences.split('\n'):
    print(sentence)
    for d, m in finder.process(sentence):
        format_result(d, m)
    print()
 
# Out:
# Very hard angiocarcoma diagnosed 3BHS1
# Match_found: Disease: name Angiocarcoma with id 00001, Biomarker CA 3BHS1 with Entry_name TRGV2
#  
very hard Angiocarcoma diagnosed IF
# 
very hard angiocarcoma diagnosed 3BP0
# 
Very hard Angiocarcoma diagnosed 3BP0 3BHS1
# 
# Fish allergy very hard angiocarcoma diagnosed 3BP0 3BHS1
# Match_found: Disease: name Angiocarcoma with id 00001, Biomarker CA 3BHS1 with Entry_name TRGV2

定时测试

使用具有超过10K条记录的数据集(参见上面的simulated_data.py)

修改操作码(使定时比较公平)

  • 使用发电机避免打印(会减慢计时)
  • 使用比df和df1更有意义的名称(实际上与计时无关)

修改的操作代码(用于对原始代码计时)

def disease_search(nltk_tokens_sen):
  for dis_index in range(len(df1)): 
        disease_name = df_disease.at[dis_index,'name']
        regex_for_dis = rf"\b{disease_name}\b"
        matches_for_dis= re.findall(regex_for_dis, nltk_tokens_sen, re.IGNORECASE | re.MULTILINE)
        if len(matches_for_dis)!=0:
            yield from disease_marker(nltk_tokens_sen, disease_name)

def disease_marker(nltk_tokens_sen, disease_name):
     for zz in range(len(df_biomarkers)):
      biomarker_txt=((df_biomarkers.at[zz,'CA'])) 
      biomarker = biomarker_txt.split(" ")
      for tt in range(len(biomarker)):
        if len(biomarker[tt])>2:
            matches_for_marker = re.findall(rf"\b{re.escape(biomarker[tt])}\b", nltk_tokens_sen)
            if len(matches_for_marker)!=0:
                yield disease_name, biomarker[tt]

    

定时测试

使用Jupyter笔记本“神奇命令”

计时新代码

%timeit finder = Finder(df_disease, df_biomarker) # Initilization
107 ms ± 4.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(finder.process(nltk_sentence))       # Per nltk string
6.57 µs ± 618 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)    
Total: ~107 ms to setup and 6.57 us per nltk sentence

定时原始代码

 %timeit list(disease_search(nltk_sentence))
 3.54 s ± 446 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

试试这个,让我知道。由于访问结构(如列表和dict)比pandas数据帧更快,并且有效项的快速初步选择(不使用库re),因此这应该更具时间效率

# necessary imports
import pandas as pd
import itertools
import re

# test dataframes
df1 = pd.DataFrame({
    'id': ['00001','00261','00002'],
    'name': ['angiocarcoma', 'shrimp allergy', 'fish allergy']
})

df = pd.DataFrame({
    'Entry_name': ['TRGV2','TRGJ1','TRGJ2'],
    'CA': ['3BHS1 HSD3B1 3BH HSDB3', '3BP1 SH3BP1 IF', '3BP0']
})

# redesign data structures you work with
# set() will deduplicate for you
disease_list = list(set(df1['name']))
CA_list = list(set(df['CA']))
valid_CA_list_tmp = list(itertools.chain(*[x.split() for x in CA_list]))
valid_CA_list = [x for x in valid_CA_list_tmp if len(x)>2]

# the function
def disease_search_v2(nltk_tokens_sen):
    """Takes string as input"""
    
    found_diseases_preliminary = [x for x in disease_list if x.lower() in nltk_tokens_sen.lower()]
    found_CA_preliminary = [x for x in valid_CA_list if x.lower() in nltk_tokens_sen.lower()]
    
    found_diseases = [x for x in found_diseases_preliminary if re.search(rf"\b{x}\b", nltk_tokens_sen)]
    found_CA = [x for x in found_CA_preliminary if re.search(rf"\b{x}\b", nltk_tokens_sen)]

    if len(found_diseases) > 0 and len(found_CA) > 0:
        return {x:found_CA for x in found_diseases}
    else:
        return {}

# testing cases
disease_search_v2('very hard angiocarcoma diagnosed 3BHS1')
disease_search_v2('very hard angiocarcoma diagnosed IF')
disease_search_v2('very hard angiocarcoma diagnosed 3BP0')
disease_search_v2('very hard angiocarcoma diagnosed 3BP0 3BHS1')
disease_search_v2('fish allergy very hard angiocarcoma diagnosed 3BP0 3BHS1')
disease_search_v2('fish allergy very hard angiocarcoma diagnosed 3BP0 3BHS1\nfish allergy very hard angiocarcoma diagnosed 3BP0 3BHS1\n')

相关问题 更多 >