我试图匹配转换成字符串的html文本。但是我的正则表达式都不起作用
我正在尝试匹配的Html文本:
"[<span class="instancename">CLASS 8B PHY | TUE | 9AM to 9:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (HINDI) | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (BENGALI) | TUE | 10AM to 10:40AM <span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (NEPALI) | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B GEOG | TUE | 11AM to 11:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B BIO | TUE | 12NOON to 12:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CP APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CM APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>]"
我想匹配的句子是:
CLASS 8B PHY | TUE | 9AM to 9:40AM
CLASS 8AB 2ND LG (HINDI) | TUE | 10AM to 10:40AM
CLASS 8B GEOG | TUE | 11AM to 11:40AM
在上面提供的html文本中还有更多内容
我用来匹配这些的代码似乎不起作用:
import re
html_text = [<span class="instancename">CLASS 8B PHY | TUE | 9AM to 9:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (HINDI) | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (BENGALI) | TUE | 10AM to 10:40AM <span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (NEPALI) | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B GEOG | TUE | 11AM to 11:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B BIO | TUE | 12NOON to 12:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CP APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CM APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>]
regex = re.compile(r'^[CLASS]*[M]')
match = regex.findall(str(html_text))
print(match)
我认为我没有提供合适的正则表达式
试一试
^
开始模式,因为这样它只会在开始时匹配,而不会找到所有匹配项李>CLASS
不应放在方括号内[CLASS]
匹配单个字符,该字符可以是C
、L
、'a, or 'S
李>.*
来匹配CLASS
之后的任何文本。并使用?
使其非贪婪李>M
,因为这样它将匹配字符串中的下一个M
。您应该只在时间和A
或P
之后匹配它。您还需要匹配开始时间和to
,这样它就不会在开始时间停止匹配李>您正在处理HTML,因此使用BeautifulSoup在Python中解析HTML是有意义的
输出:
注意
[x.extract() for x in span.find_all("span", attrs={'class':'accesshide'})]
用accesshide
类提取span
元素,并将它们从span
中删除。因此,剩下的实际文本是span
文本,没有内部span
的文本相关问题 更多 >
编程相关推荐