无法将特定字符串与正则表达式匹配

"[CLASS 8B PHY | TUE | 9AM to 9:40AM BigBlueButtonBN, BigBlueButtonBN, CLASS 8AB 2ND LG (HINDI) | TUE | 10AM to 10:40AM BigBlueButtonBN, BigBlueButtonBN, CLASS 8AB 2ND LG (BENGALI) | TUE | 10AM to 10:40AM BigBlueButtonBN, BigBlueButtonBN, CLASS 8AB 2ND LG (NEPALI) | TUE | 10AM to 10:40AM BigBlueButtonBN, BigBlueButtonBN, CLASS 8B GEOG | TUE | 11AM to 11:40AM BigBlueButtonBN, BigBlueButtonBN, CLASS 8B BIO | TUE | 12NOON to 12:40PM BigBlueButtonBN, BigBlueButtonBN, CLASS 8AB CP APP | TUE | 5PM to 5:40PM BigBlueButtonBN, BigBlueButtonBN, CLASS 8AB CM APP | TUE | 5PM to 5:40PM BigBlueButtonBN, BigBlueButtonBN]"

import re html_text = [CLASS 8B PHY | TUE | 9AM to 9:40AM BigBlueButtonBN, BigBlueButtonBN, CLASS 8AB 2ND LG (HINDI) | TUE | 10AM to 10:40AM BigBlueButtonBN, BigBlueButtonBN, CLASS 8AB 2ND LG (BENGALI) | TUE | 10AM to 10:40AM BigBlueButtonBN, BigBlueButtonBN, CLASS 8AB 2ND LG (NEPALI) | TUE | 10AM to 10:40AM BigBlueButtonBN, BigBlueButtonBN, CLASS 8B GEOG | TUE | 11AM to 11:40AM BigBlueButtonBN, BigBlueButtonBN, CLASS 8B BIO | TUE | 12NOON to 12:40PM BigBlueButtonBN, BigBlueButtonBN, CLASS 8AB CP APP | TUE | 5PM to 5:40PM BigBlueButtonBN, BigBlueButtonBN, CLASS 8AB CM APP | TUE | 5PM to 5:40PM BigBlueButtonBN, BigBlueButtonBN] regex = re.compile(r'^[CLASS]*[M]') match = regex.findall(str(html_text)) print(match)

2条回答

网友

1楼 · 编辑于 2024-05-14 14:29:53

试一试

regex = re.compile(r'CLASS.*?[\d:]+[AP]M to [\d:]+[AP]M')

您不应该以^开始模式，因为这样它只会在开始时匹配，而不会找到所有匹配项
CLASS不应放在方括号内[CLASS]匹配单个字符，该字符可以是C、L、'a, or 'S
您需要.*来匹配CLASS之后的任何文本。并使用?使其非贪婪
不能只在末尾匹配M，因为这样它将匹配字符串中的下一个M。您应该只在时间和A或P之后匹配它。您还需要匹配开始时间和to，这样它就不会在开始时间停止匹配

网友

2楼 · 编辑于 2024-05-14 14:29:53

您正在处理HTML，因此使用BeautifulSoup在Python中解析HTML是有意义的

from bs4 import BeautifulSoup
s = """Your HTML goes here""" # 's' is a string variable I initialized the `doc`ument
doc = BeautifulSoup(s, 'html.parser')
for span in doc.find_all("span", attrs={'class':"instancename"}):
    innerspans = [x.extract() for x in span.find_all("span", attrs={'class':'accesshide'})]
    print(span.text)

输出：

CLASS 8B PHY  | TUE | 9AM to 9:40AM
CLASS 8AB 2ND LG (HINDI)  | TUE | 10AM to 10:40AM
CLASS 8AB 2ND LG (BENGALI)  | TUE | 10AM to 10:40AM 
CLASS 8AB 2ND LG (NEPALI)  | TUE | 10AM to 10:40AM
CLASS 8B GEOG | TUE | 11AM to 11:40AM
CLASS 8B BIO | TUE | 12NOON to 12:40PM
CLASS 8AB CP APP | TUE | 5PM to 5:40PM
CLASS 8AB CM APP | TUE | 5PM to 5:40PM

注意[x.extract() for x in span.find_all("span", attrs={'class':'accesshide'})]用accesshide类提取span元素，并将它们从span中删除。因此，剩下的实际文本是span文本，没有内部span的文本

相关问题更多 >

编程相关推荐

热门问题

热门文章