无法将特定字符串与正则表达式匹配

2024-05-14 14:29:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图匹配转换成字符串的html文本。但是我的正则表达式都不起作用

我正在尝试匹配的Html文本:

"[<span class="instancename">CLASS 8B PHY  | TUE | 9AM to 9:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (HINDI)  | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (BENGALI)  | TUE | 10AM to 10:40AM <span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (NEPALI)  | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B GEOG | TUE | 11AM to 11:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B BIO | TUE | 12NOON to 12:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CP APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CM APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>]"

我想匹配的句子是:

  1. CLASS 8B PHY | TUE | 9AM to 9:40AM

  2. CLASS 8AB 2ND LG (HINDI) | TUE | 10AM to 10:40AM

  3. CLASS 8B GEOG | TUE | 11AM to 11:40AM

在上面提供的html文本中还有更多内容

我用来匹配这些的代码似乎不起作用:

import re
html_text = [<span class="instancename">CLASS 8B PHY  | TUE | 9AM to 9:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (HINDI)  | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (BENGALI)  | TUE | 10AM to 10:40AM <span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (NEPALI)  | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B GEOG | TUE | 11AM to 11:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B BIO | TUE | 12NOON to 12:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CP APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CM APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>]

regex = re.compile(r'^[CLASS]*[M]')
match = regex.findall(str(html_text))
print(match)

我认为我没有提供合适的正则表达式


Tags: to文本apphtmlphyclassspanlg
2条回答

试一试

regex = re.compile(r'CLASS.*?[\d:]+[AP]M to [\d:]+[AP]M')
  1. 您不应该以^开始模式,因为这样它只会在开始时匹配,而不会找到所有匹配项
  2. CLASS不应放在方括号内[CLASS]匹配单个字符,该字符可以是CL、'a, or 'S
  3. 您需要.*来匹配CLASS之后的任何文本。并使用?使其非贪婪
  4. 不能只在末尾匹配M,因为这样它将匹配字符串中的下一个M。您应该只在时间和AP之后匹配它。您还需要匹配开始时间和to,这样它就不会在开始时间停止匹配

您正在处理HTML,因此使用BeautifulSoup在Python中解析HTML是有意义的

from bs4 import BeautifulSoup
s = """Your HTML goes here""" # 's' is a string variable I initialized the `doc`ument
doc = BeautifulSoup(s, 'html.parser')
for span in doc.find_all("span", attrs={'class':"instancename"}):
    innerspans = [x.extract() for x in span.find_all("span", attrs={'class':'accesshide'})]
    print(span.text)

输出:

CLASS 8B PHY  | TUE | 9AM to 9:40AM
CLASS 8AB 2ND LG (HINDI)  | TUE | 10AM to 10:40AM
CLASS 8AB 2ND LG (BENGALI)  | TUE | 10AM to 10:40AM 
CLASS 8AB 2ND LG (NEPALI)  | TUE | 10AM to 10:40AM
CLASS 8B GEOG | TUE | 11AM to 11:40AM
CLASS 8B BIO | TUE | 12NOON to 12:40PM
CLASS 8AB CP APP | TUE | 5PM to 5:40PM
CLASS 8AB CM APP | TUE | 5PM to 5:40PM

注意[x.extract() for x in span.find_all("span", attrs={'class':'accesshide'})]accesshide提取span元素,并将它们从span中删除。因此,剩下的实际文本是span文本,没有内部span的文本

相关问题 更多 >

    热门问题