如何编写相关的REGEX模式来提取python中较大文本字符串的子字符串

list = ['sentara williamsburg regional medical', 'shady grove adventist hospital', 'sibley memorial hospital', 'southern maryland hospital center', 'st. mary`s hospital', 'suburban hospital healthcare system', 'the cancer center at lake manassas', 'ucla medical center', 'united medical center- greater southeast community', 'univ of md charles regional medical ctr', 'university of maryland medical center', 'university of north carolina hospital', 'university of virginia health system', 'unknown facility', 'va medical center', 'virginia hospital center-arlington', 'walter reed army medical center', 'washington adventist hospital', 'washington hospital center', 'wellstar health system, inc', 'winchester medical center'] data: ID Notes 530.0 Cancer is best diag @Wwashington Adventist Hospital 651.0 nan 692.0 GMC-009 can be accessed at ST. Mary`s but not in UCLA Med. Center 993.0 I'm not sure of Sibley; however, Shady Grove Adventist Hosp. is great hospital 044.0 nan 055.0 2015-01-20 was the day she visited WR Army Medical Center in WDC 476.0 nan

data_out: ID Notes 530.0 Washington Adventist Hospital 651.0 nan 692.0 ST. Mary`s Hospital, UCLA Medical Center 993.0 Sibley Memorial Hoapital, Shady Grove Adventist Hospital 044.0 nan 055.0 Walter Reed Army Medical Center 476.0 nan

2条回答

网友

1楼 · 编辑于 2024-05-16 01:21:24

更新：此代码遍历列表中的所有单词，并将它们与“Notes”列进行比较。如果有一个单词在“列表”和“注释”中，这个单词将写在新的“输出”列中。您必须使用正则表达式来获得所需的结果。注：由于“列表”中的单词可能看起来完全不同，但与“列”中的单词具有相同的含义（缩写、拼写、错误、区分大小写），因此很难获得所有不同的情况。因此，也许用“纸袋法”来解决这个问题是有用的。。。你知道吗

#Create a new list
newlist=[]

#Split the sentences of the "Notes" column
[newlist.append(data.loc[i,"Notes"].split(" ")) for i in range(len(data["Notes"]))]

#Create the new column "output" and default the values to be the same as in the column "Notes"
data["output"]=data["Notes"]

#Run through all words
for i in range(len(list)):
    for j in range(len(newlist)):
        for element in range(len(newlist[j])):
            if re.search(newlist[j][element],list[i]):
                data.loc[j,"output"]= "' '{0}".format(newlist[j][element])

如果有一个更矢量化的方法，我将非常感谢评论

网友

2楼 · 编辑于 2024-05-16 01:21:24

我会做smth。比如：

import re
reg = re.compile('|'.join(your_list))
results = reg.match(your_data)

相关问题更多 >

编程相关推荐

热门问题

热门文章