我正试图从带有RegEx的网站上抓取文本段落以放入Python列表,但是对于这个特定的网站,我很难格式化RegEx以捕获所有事件。 有人能帮助收集所有实例的结果吗? 或者至少告诉我,如果这是不实际的,我会找到一个替代网站
from re import *
from urllib.request import urlopen
## Create Empty List
EventInfoListBEC = []
## Asign Website to a Variable
WebsiteBEC = 'https://www.brisent.com.au/Event-Calendar'
## Search for Event Info
EventInfoBEC = findall('<p class="event-description">(.+?)</p>', WebsiteBEC)
## Add Event Info to Event Info List and Print Details
print('Event Info appears', len(EventInfoBEC), 'times (BEC).')
for EventInfo in EventInfoBEC:
EventInfoListBEC.append(EventInfo)
print(EventInfoListBEC)
## There are Three Styles of Input from the HTML File
# One
<p class="event-description"><p>This is a sport where 8 seconds can cost you everything. Welcome to the world of the PBR.</p>
</p>
# Two
<p class="event-description"><p style="text-align: justify; color: rgb(0, 0, 0); font-family: sans-serif; font-size: 12px;">Fresh off the back of winning a Brit Award for ‘British Artist Video of the Year’ for ‘Woman Like Me’, and two Global Awards for ‘Best Group’ and ‘Best Song’; pop superstars Little Mix today announce that five new Australian shows have been added to 'LM5 - The Tour' for 2019!</p>
</p>
#Three
<p class="event-description"><p style="font-family: sans-serif; font-size: 12px; color: rgb(0, 0, 0); text-align: center;"><strong>OPENING NIGHT PERFORMANCE ADDED!</strong></p>
<p style="font-family: sans-serif; font-size: 12px; color: #000000; text-align: justify;">The world’s most beloved movie-musical comes to life on the arena stage like you’ve never seen it before! From the producers of GREASE - THE ARENA EXPERIENCE comes this lavish new arena production of THE WIZARD OF OZ.</p>
正如许多人所指出的,有比使用regex更好的方法:我喜欢使用
lxml
(lxml.html
),但是bs4
也可以无论如何,这里有一个使用模块
regex
的解决方案(在这个模块中,lookbehinds可以具有不同于re
的可变长度)。解决方案依赖于正则表达式它捕获
event-description
类中段落的内容。自定义组[\w\s\#\;\(\)\"\=\:\-\,]
包含样式参数中使用的所有字符。最后,start*
也允许匹配空样式我们仍然需要处理结果以除去
<strong>
标记。另外,上面提供的源代码中的最后一行不是类event-description
,因此它不会被regex捕获相关问题 更多 >
编程相关推荐