如何编写相关的REGEX模式来提取python中较大文本字符串的子字符串

2024-05-16 01:21:24 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个dataframedata(带有冗长且不一致的文本字符串注释)和匹配的id。我的目标是使用子字符串的list提取相关的子字符串,并为提取的子字符串创建一个新列。我被告知regex是一个很好的开始,但我还没有想出一个好的模式,可以产生匹配的结果。我希望有人看到这一点,并指导我以正确的方式来解决这个问题。你知道吗

list = ['sentara williamsburg regional medical',
       'shady grove adventist hospital',
       'sibley memorial hospital',
       'southern maryland hospital center',
       'st. mary`s hospital',
       'suburban hospital healthcare system',
       'the cancer center at lake manassas',
       'ucla medical center',
       'united medical center- greater southeast community',
       'univ of md charles regional medical ctr',
       'university of maryland medical center',
       'university of north carolina hospital',
       'university of virginia health system',
       'unknown facility',
       'va medical center',
       'virginia hospital center-arlington',
       'walter reed army medical center',
       'washington adventist hospital',
       'washington hospital center',
       'wellstar health system, inc',
       'winchester medical center']

 data:
     ID     Notes                             
     530.0  Cancer is best diag @Wwashington Adventist Hospital
     651.0  nan
     692.0  GMC-009 can be accessed at ST. Mary`s but not in UCLA Med. Center
     993.0  I'm not sure of Sibley; however, Shady Grove Adventist Hosp. is great hospital
     044.0  nan
     055.0  2015-01-20 was the day she visited WR Army Medical Center in WDC
     476.0  nan

预期输出-情况真的不重要!你知道吗

 data_out: 
     ID     Notes                             
     530.0  Washington Adventist Hospital
     651.0  nan
     692.0  ST. Mary`s Hospital, UCLA Medical Center
     993.0  Sibley Memorial Hoapital, Shady Grove Adventist Hospital
     044.0  nan
     055.0  Walter Reed Army Medical Center
     476.0  nan

Tags: of字符串nansystemlistmedicalcenterregional
2条回答

更新:此代码遍历列表中的所有单词,并将它们与“Notes”列进行比较。如果有一个单词在“列表”和“注释”中,这个单词将写在新的“输出”列中。您必须使用正则表达式来获得所需的结果。注: 由于“列表”中的单词可能看起来完全不同,但与“列”中的单词具有相同的含义(缩写、拼写、错误、区分大小写),因此很难获得所有不同的情况。因此,也许用“纸袋法”来解决这个问题是有用的。。。你知道吗

#Create a new list
newlist=[]

#Split the sentences of the "Notes" column
[newlist.append(data.loc[i,"Notes"].split(" ")) for i in range(len(data["Notes"]))]

#Create the new column "output" and default the values to be the same as in the column "Notes"
data["output"]=data["Notes"]

#Run through all words
for i in range(len(list)):
    for j in range(len(newlist)):
        for element in range(len(newlist[j])):
            if re.search(newlist[j][element],list[i]):
                data.loc[j,"output"]= "' '{0}".format(newlist[j][element])

如果有一个更矢量化的方法,我将非常感谢评论

我会做smth。比如:

import re
reg = re.compile('|'.join(your_list))
results = reg.match(your_data)

相关问题 更多 >