用Python捕获HTML文本的RegEx

from re import * from urllib.request import urlopen ## Create Empty List EventInfoListBEC = [] ## Asign Website to a Variable WebsiteBEC = 'https://www.brisent.com.au/Event-Calendar' ## Search for Event Info EventInfoBEC = findall('(.+?)', WebsiteBEC) ## Add Event Info to Event Info List and Print Details print('Event Info appears', len(EventInfoBEC), 'times (BEC).') for EventInfo in EventInfoBEC: EventInfoListBEC.append(EventInfo) print(EventInfoListBEC)

## There are Three Styles of Input from the HTML File # One This is a sport where 8 seconds can cost you everything. Welcome to the world of the PBR. # Two Fresh off the back of winning a Brit Award for ‘British Artist Video of the Year’ for ‘Woman Like Me’, and two Global Awards for ‘Best Group’ and ‘Best Song’; pop superstars Little Mix today announce that five new Australian shows have been added to 'LM5 - The Tour' for 2019! #Three OPENING NIGHT PERFORMANCE ADDED! The world’s most beloved movie-musical comes to life on the arena stage like you’ve never seen it before! From the producers of GREASE - THE ARENA EXPERIENCE comes this lavish new arena production of THE WIZARD OF OZ.

1条回答

网友

1楼 · 发布于 2024-06-16 13:55:38

正如许多人所指出的，有比使用regex更好的方法：我喜欢使用lxml（lxml.html），但是bs4也可以

无论如何，这里有一个使用模块regex的解决方案（在这个模块中，lookbehinds可以具有不同于re的可变长度）。解决方案依赖于正则表达式

(?<=class="event-description"><p[\w\s\#\;\(\)\"\=\:\-\,]*>).*(?=</p>)

它捕获event-description类中段落的内容。自定义组[\w\s\#\;\(\)\"\=\:\-\,]包含样式参数中使用的所有字符。最后，start*也允许匹配空样式

# import regex
# import requests

# Asign Website to a Variable
WebsiteBEC = 'https://www.brisent.com.au/Event-Calendar'

# Get source code
req = requests.get(WebsiteBEC, timeout=5)
source_code = req.text

# Extract data
EventInfoBEC = regex.findall(r'(?<=class="event-description"><p[\w\s\#\;\(\)\"\=\:\-\,]*>).*(?=</p>)', source_code)
# ['This is a sport where 8 seconds can cost you everything. Welcome to the world of the PBR.',
#  'See fearless Moana with demigod Maui, follow Dory through the Pacific Ocean, join the Toy Story pals on an exciting adventure and discover true love with Elsa and Anna. Buckle in for the emotional rollercoaster of Inside Out and &ldquo;Live Your Story&rdquo; alongside Disney Princesses as they celebrate their favourite Disney memories!',
#  'Fresh off the back of winning a Brit Award for &lsquo;British Artist Video of the Year&rsquo; for &lsquo;Woman Like Me&rsquo;, and two Global Awards for &lsquo;Best Group&rsquo; and &lsquo;Best Song&rsquo;; pop superstars Little Mix today announce that five new Australian shows have been added to &#39;LM5 - The Tour&#39; for 2019!',
#  '<strong>OPENING NIGHT PERFORMANCE ADDED!</strong>',
#  '<strong>THIRD SHOW ANNOUNCED - ON SALE FROM 2PM FRI 1 FEB!</strong>',
#  '<strong>COMING TO AUSTRALIA FOR THE VERY FIRST TIME.&nbsp;</strong>',
#  'WWE LIVE is returning to Australia!&nbsp;Fans will be able to see their favorite WWE Superstars for the first time since last year&rsquo;s incredible Super Show-Down',
#  '<strong>SHAWN MENDES ANNOUNCES RUEL AS SPECIAL GUEST + ADDITIONAL TICKETS AVAILABLE FOR ALL SHOWS!</strong>',
#  'Steve Martin and Martin Short will bring their critically acclaimed comedy tour Now You See Them, Soon You Won&rsquo;t for the first time to Australian audiences in November.&nbsp;',
#  'After an epic and storied 45-year career that launched an era of rock n roll legends, KISS announced that they will launch their final tour ever in 2019, appropriately named END OF THE ROAD.',
#  '<strong>ELTON JOHN ANNOUNCES 3RD BRISBANE SHOW!</strong>']

我们仍然需要处理结果以除去标记。另外，上面提供的源代码中的最后一行不是类event-description，因此它不会被regex捕获

相关问题更多 >

编程相关推荐

热门问题

热门文章