用Python捕获HTML文本的RegEx

2024-06-16 13:55:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从带有RegEx的网站上抓取文本段落以放入Python列表,但是对于这个特定的网站,我很难格式化RegEx以捕获所有事件。 有人能帮助收集所有实例的结果吗? 或者至少告诉我,如果这是不实际的,我会找到一个替代网站

from re import *
from urllib.request import urlopen

## Create Empty List
EventInfoListBEC = []

## Asign Website to a Variable
WebsiteBEC = 'https://www.brisent.com.au/Event-Calendar'

## Search for Event Info
EventInfoBEC = findall('<p class="event-description">(.+?)</p>', WebsiteBEC)

## Add Event Info to Event Info List and Print Details
print('Event Info appears', len(EventInfoBEC), 'times (BEC).')
for EventInfo in EventInfoBEC:
    EventInfoListBEC.append(EventInfo)
print(EventInfoListBEC)

## There are Three Styles of Input from the HTML File
# One
<p class="event-description"><p>This is a sport where 8 seconds can cost you everything. Welcome to the world of the PBR.</p>

</p>

# Two
<p class="event-description"><p style="text-align: justify; color: rgb(0, 0, 0); font-family: sans-serif; font-size: 12px;">Fresh off the back of winning a Brit Award for &lsquo;British Artist Video of the Year&rsquo; for &lsquo;Woman Like Me&rsquo;, and two Global Awards for &lsquo;Best Group&rsquo; and &lsquo;Best Song&rsquo;; pop superstars Little Mix today announce that five new Australian shows have been added to &#39;LM5 - The Tour&#39; for 2019!</p>

</p>

#Three
<p class="event-description"><p style="font-family: sans-serif; font-size: 12px; color: rgb(0, 0, 0); text-align: center;"><strong>OPENING NIGHT PERFORMANCE ADDED!</strong></p>



<p style="font-family: sans-serif; font-size: 12px; color: #000000; text-align: justify;">The world&rsquo;s most beloved movie-musical comes to life on the arena stage&nbsp;like you&rsquo;ve never seen it before! From the producers of GREASE - THE ARENA EXPERIENCE comes this lavish new arena production of THE WIZARD OF OZ.</p>

Tags: ofthetofrominfoeventfor网站
1条回答
网友
1楼 · 发布于 2024-06-16 13:55:38

正如许多人所指出的,有比使用regex更好的方法:我喜欢使用lxmllxml.html),但是bs4也可以

无论如何,这里有一个使用模块regex的解决方案(在这个模块中,lookbehinds可以具有不同于re的可变长度)。解决方案依赖于正则表达式

(?<=class="event-description"><p[\w\s\#\;\(\)\"\=\:\-\,]*>).*(?=</p>)

它捕获event-description类中段落的内容。自定义组[\w\s\#\;\(\)\"\=\:\-\,]包含样式参数中使用的所有字符。最后,start*也允许匹配空样式

# import regex
# import requests

# Asign Website to a Variable
WebsiteBEC = 'https://www.brisent.com.au/Event-Calendar'

# Get source code
req = requests.get(WebsiteBEC, timeout=5)
source_code = req.text

# Extract data
EventInfoBEC = regex.findall(r'(?<=class="event-description"><p[\w\s\#\;\(\)\"\=\:\-\,]*>).*(?=</p>)', source_code)
# ['This is a sport where 8 seconds can cost you everything. Welcome to the world of the PBR.',
#  'See fearless Moana with demigod Maui, follow Dory through the Pacific Ocean, join the Toy Story pals on an exciting adventure and discover true love with Elsa and Anna. Buckle in for the emotional rollercoaster of Inside Out and &ldquo;Live Your Story&rdquo; alongside Disney Princesses as they celebrate their favourite Disney memories!',
#  'Fresh off the back of winning a Brit Award for &lsquo;British Artist Video of the Year&rsquo; for &lsquo;Woman Like Me&rsquo;, and two Global Awards for &lsquo;Best Group&rsquo; and &lsquo;Best Song&rsquo;; pop superstars Little Mix today announce that five new Australian shows have been added to &#39;LM5 - The Tour&#39; for 2019!',
#  '<strong>OPENING NIGHT PERFORMANCE ADDED!</strong>',
#  '<strong>THIRD SHOW ANNOUNCED - ON SALE FROM 2PM FRI 1 FEB!</strong>',
#  '<strong>COMING TO AUSTRALIA FOR THE VERY FIRST TIME.&nbsp;</strong>',
#  'WWE LIVE is returning to Australia!&nbsp;Fans will be able to see their favorite WWE Superstars for the first time since last year&rsquo;s incredible Super Show-Down',
#  '<strong>SHAWN MENDES ANNOUNCES RUEL AS SPECIAL GUEST + ADDITIONAL TICKETS AVAILABLE FOR ALL SHOWS!</strong>',
#  'Steve Martin and Martin Short will bring their critically acclaimed comedy tour Now You See Them, Soon You Won&rsquo;t for the first time to Australian audiences in November.&nbsp;',
#  'After an epic and storied 45-year career that launched an era of rock n roll legends, KISS announced that they will launch their final tour ever in 2019, appropriately named END OF THE ROAD.',
#  '<strong>ELTON JOHN ANNOUNCES 3RD BRISBANE SHOW!</strong>']

我们仍然需要处理结果以除去<strong>标记。另外,上面提供的源代码中的最后一行不是类event-description,因此它不会被regex捕获

相关问题 更多 >