如何用Beautiful Soup BS4(Python)刮下<Script>标签

2024-06-16 09:11:36 发布

您现在位置:Python中文网/ 问答频道 /正文

如果您确实在下面的链接上查看页面源代码

https://www.zoopla.co.uk/for-sale/details/53818653?search_identifier=7e57533214fc2402ba53dd6c14b624f8

第89行有标签<script>,下面有信息,一直到第164行。我正试图用漂亮的汤来提取这个,但无法。 我可以使用以下方法成功提取其他标签,如“h2”/“Div”等:

来自页面源的第1028行

for item_name in soup.findAll('h2', {'class': 'ui-property-summary__address'}):
     ad = item_name.get_text(strip=True)"

你能告诉我如何从第89行提取脚本标签吗? 谢谢


Tags: namehttpsfor源代码链接www页面标签
2条回答

本例将定位<script>标记并从中解析一些数据:

import re
import json
import requests
from bs4 import BeautifulSoup

url = 'https://www.zoopla.co.uk/for-sale/details/53818653?search_identifier=7e57533214fc2402ba53dd6c14b624f8'

# locate the tag
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
script = soup.select_one('script:contains("ZPG.trackData.taxonomy")')

# parse some data from script
data1 = re.findall(r'ZPG\.trackData\.ecommerce = ({.*?});', script.text, flags=re.S)[0]
data1 = json.loads( re.sub(r'([^"\s]+):\s', r'"\1": ', data1) )

data2 = re.findall(r'ZPG\.trackData\.taxonomy = ({.*?});', script.text, flags=re.S)[0]
data2 = json.loads( re.sub(r'([^"\s]+):\s', r'"\1": ', data2) )

# print the data
print(json.dumps(data1, indent=4))
print(json.dumps(data2, indent=4))

印刷品:

{
    "detail": {
        "products": [
            {
                "brand": "Walton and Allen Estate Agents Ltd",
                "category": "for-sale/resi/agent/pre-owned/gb",
                "id": 53818653,
                "name": "FS_Contact",
                "price": 1,
                "quantity": 1,
                "variant": "standard"
            }
        ]
    }
}
{
    "signed_in_status": "signed out",
    "acorn": 44,
    "acorn_type": 44,
    "area_name": "Aspley, Nottingham",
    "beds_max": 3,
    "beds_min": 3,
    "branch_id": "43168",
    "branch_logo_url": "https://st.zoocdn.com/zoopla_static_agent_logo_(586192).png",
    "branch_name": "Walton & Allen Estate Agents",
    "brand_name": "Walton and Allen Estate Agents Ltd",
    "chain_free": false,
    "company_id": "21619",
    "country_code": "gb",
    "county_area_name": "Nottingham",
    "currency_code": "GBP",
    "display_address": "Melbourne Road, Aspley, Nottingham NG8",
    "furnished_state": "",
    "group_id": "",
    "has_epc": false,
    "has_floorplan": true,
    "incode": "5HN",
    "is_retirement_home": false,
    "is_shared_ownership": false,
    "listing_condition": "pre-owned",
    "listing_id": 53818653,
    "listing_status": "for_sale",
    "listings_category": "residential",
    "location": "Aspley",
    "member_type": "agent",
    "num_baths": 1,
    "num_beds": 3,
    "num_images": 15,
    "num_recepts": 1,
    "outcode": "NG8",
    "post_town_name": "Nottingham",
    "postal_area": "NG",
    "price": 150000,
    "price_actual": 150000,
    "price_max": 150000,
    "price_min": 150000,
    "price_qualifier": "guide_price",
    "property_highlight": "",
    "property_type": "semi_detached",
    "region_name": "East Midlands",
    "section": "for-sale",
    "size_sq_feet": "",
    "tenure": "",
    "zindex": "129806"
}

找到所有<script>标记,然后在其中搜索包含ZPG.trackData.ecommerce的标记

ecommerce = None
for item in soup.findAll('script'):
    if 'ZPG.trackData.ecommerce' in item.string:
        ecommerce = item.string
        break

相关问题 更多 >