如何使用Python从网站中提取逗号、句点或colin之前的所有文本

location_address1 = soup.select_one(f"[data-id='{num}'] .heading:contains('Address') + p").contents[0].strip() location_address2 = ','.join(location_address1.split(',|.|:')[1:]) <p> 2 Hemlock Rd. PO Box 904 <br> Corner Brook, NL <br> A2H 6J2 </p>

2条回答

网友

1楼 · 编辑于 2024-05-23 15:03:00

您可以确定行的长度，并相应地将文本解析为变量。见下面的例子

num = 267
location_address = soup.select_one(f"[data-id='{num}'] .heading:contains('Address') + p")
print(location_address)

#Determine the number of address lines
print(len(location_address.find_all('br')))

TotalLines = len(location_address.find_all('br'))
line1 =''
line2=''
if TotalLines >1:
    line1 = location_address.contents[0]
    line2 = location_address.contents[2]
else:
    line1 = location_address.contents[0]

print('Address Line1:',line1)
print('Address Line2:',line2)

输出：

地址行1：铁杉路2号邮政信箱904 地址行2：角溪，NL

网友

2楼 · 编辑于 2024-05-23 15:03:00

您需要选择更好的HTML，这里是解决方案。我使用了CSS选择器，因为它更准确；因为没有xPath在您的服务器上。在得到所有我们需要的东西后，将对象转换为文本，然后围绕可用的内容工作；然后在这里，我们分割行并删除换行符以获得更好的缩进

Note: This has been tested and runs correctly.

运行代码：

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.winmar.ca/find-a-location/#267")

soup = BeautifulSoup(page.content, 'html.parser')

address = soup.select('#box-309 > div:nth-child(2) > p:nth-child(5)')

text = address[0].get_text()
print(text)

产出：

 358 Keltic Drive Sydney River ,NS B1R 1V7

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用Python从网站中提取逗号、句点或colin之前的所有文本

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >