如何从两个html页面中提取数据?

2024-05-23 13:59:23 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从两个html页面中提取数据。当我从一个页面提取数据并转到另一个页面时,一些元素发生了变化,数据出现在列表和列表中。你知道吗

以下问题的代码

details_containers = soup_page.findAll("div",{"id":"RESTAURANT_DETAILS"})
       details_container = details_containers[0].findAll("div",{"class":"content"})
       cuisine = details_container[0].text.strip()
       print(cuisine)
       meals = details_container[1].text.strip()
       print(meals)
       hotel_features = details_container[2].text.strip()
       print(hotel_features)

从第一个html开始,我想要烹饪、膳食、家宴和内容值。但也有一些额外的内容价值小时,平均价格。

<div id="RESTAURANT_DETAILS" class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS">
    <div class="header_with_improve wrap">
        <a href="/UpdateListing-g297595-d6384395-Ocellus-Raipur_Raipur_District_Chhattisgarh.html" onclick="ta.setEvtCookie('UpdateListing', 'entry-detail-moreinfo', null, 0, '/UpdateListingRedesign')">
            <div class="improve_listing_btn ui_button primary">Improve this listing</div>
        </a>
        <h3 class="tabs_header">Restaurant Details</h3> </div>
    <div class="details_tab">
        <div class="table_section">
            <div class="row">
                <div class="ratingSummary wrap">
                    <div class="histogramCommon bubbleHistogram wrap">
                        <div class="colTitle">
                            Rating summary
                        </div>
                        <ul class="barChart">
                            <li>
                                <div class="ratingRow wrap">
                                    <div class="label part ">
                                        <span class="text">Food</span>
                                    </div>
                                    <div class="wrap row part ">
                                        <span class="ui_bubble_rating bubble_45" alt="4.5 of 5 bubbles"></span>
                                    </div>
                                </div>
                                <div class="ratingRow wrap">
                                    <div class="label part ">
                                        <span class="text">Service</span>
                                    </div>
                                    <div class="wrap row part ">
                                        <span class="ui_bubble_rating bubble_35" alt="3.5 of 5 bubbles"></span>
                                    </div>
                                </div>
                            </li>
                            <li>
                                <div class="ratingRow wrap">
                                    <div class="label part ">
                                        <span class="text">Value</span>
                                    </div>
                                    <div class="wrap row part ">
                                        <span class="ui_bubble_rating bubble_35" alt="3.5 of 5 bubbles"></span>
                                    </div>
                                </div>
                            </li>
                        </ul>
                    </div>
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Average prices
                </div>
                <div class="content">
                    <span>₹&nbsp;448 -
₹&nbsp;768</span>
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Cuisine
                </div>
                <div class="content">
                    <a href="/Restaurants-g297595-c24-Raipur_Raipur_District_Chhattisgarh.html">Indian</a>, <a href="/Restaurants-g297595-c3-Raipur_Raipur_District_Chhattisgarh.html">Asian</a>, <a href="/Restaurants-g297595-c26-Raipur_Raipur_District_Chhattisgarh.html">Italian</a>, <a href="/Restaurants-g297595-c20-Raipur_Raipur_District_Chhattisgarh.html">French</a>, <a href="/Restaurants-g297595-c11-Raipur_Raipur_District_Chhattisgarh.html">Chinese</a>, <a href="/Restaurants-g297595-c22-Raipur_Raipur_District_Chhattisgarh.html">International</a>, <a href="/Restaurants-g297595-zfz10665-Raipur_Raipur_District_Chhattisgarh.html">Vegetarian Friendly</a>
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Meals
                </div>
                <div class="content">
                    Breakfast, Lunch, Dinner, Brunch
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Restaurant features
                </div>
                <div class="content">
                    Reservations, Seating, Takeout, Private Dining, Waitstaff
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Good for
                </div>
                <div class="content">
                    Groups, Business meetings, Child-friendly
                </div>
            </div>
            <div class="row">
                <div class="hours title">
                    Open Hours
                </div>
                <div class="hours content">
                    <div class="detail">
                        <span class="day">Sunday</span>
                        <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
                    </div>
                    <div class="detail">
                        <span class="day">Monday</span>
                        <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
                    </div>
                    <div class="detail">
                        <span class="day">Tuesday</span>
                        <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
                    </div>
                    <div class="detail">
                        <span class="day">Wednesday</span>
                        <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
                    </div>
                    <div class="detail">
                        <span class="day">Thursday</span>
                        <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
                    </div>
                    <div class="detail">
                        <span class="day">Friday</span>
                        <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
                    </div>
                    <div class="detail">
                        <span class="day">Saturday</span>
                        <span class="hours"><div class="hoursRange">07:00 - 23:00</div></span>
                    </div>
                </div>
            </div>
        </div>
        <div class="additional_info">
            <div class="title">
                Location and Contact Information </div>
            <div class="content">
                <ul class="detailsContent">
                    <li>
                        <div class="detail">Address:
                            <span> <span class="format_address"><span class="street-address">G.E. Road</span> | <span class="extended-address">Mayura Hotel</span>, <span class="locality">Raipur 492001, </span><span class="country-name">India</span> </span>
                            </span>
                        </div>
                    </li>
                    <li>
                        <div class="detail">Location:
                            <span> Asia</span>
                            <span> &nbsp;&gt;&nbsp; India</span>
                            <span> &nbsp;&gt;&nbsp; Chhattisgarh</span>
                            <span> &nbsp;&gt;&nbsp; Raipur District</span>
                            <span> &nbsp;&gt;&nbsp; Raipur</span>
                        </div>
                    </li>
                    <li>
                        <div class="detail">Phone Number:
                            <span>+91 77142 00500</span>
                        </div>
                    </li>
                    <li>
                        <span class="ui_icon email"></span>
                        <a target="_blank&quot;" href="mailto:banquet@themayurahotels.com" onclick="ta.trackEventOnPage('Eatery_Listing','Email','6384395')">
E-mail </a>
                    </li>
                    <!--trkP:waypoint_for_poi_2-->
                    <!-- PLACEMENT waypoint_for_poi -->
                    <div id="taplc_waypoint_for_poi_1" class="ppr_rup ppr_priv_waypoint_for_poi" data-placement-name="waypoint_for_poi">
                    </div>
                    <!--etk-->
                </ul>
            </div>
        </div>
        <!--[if lte IE 9]>
            <style>
                .details_block .threeColumnList{
                    height: 350px;
                    overflow: auto;
                }
            </style>
            <![endif]-->
    </div>
</div>

从第二个html我想烹饪,膳食,retaurant\u功能的内容值像上面的html。但在小时的额外内容值中,不存在平均价格

<div id="RESTAURANT_DETAILS" class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS">
    <div class="header_with_improve wrap">
        <a href="/UpdateListing-g297595-d8595502-Barbeque_Nation-Raipur_Raipur_District_Chhattisgarh.html" onclick="ta.setEvtCookie('UpdateListing', 'entry-detail-moreinfo', null, 0, '/UpdateListingRedesign')">
            <div class="improve_listing_btn ui_button primary">Improve this listing</div>
        </a>
        <h3 class="tabs_header">Restaurant Details</h3> </div>
    <div class="details_tab">
        <div class="table_section">
            <div class="row">
                <div class="ratingSummary wrap">
                    <div class="histogramCommon bubbleHistogram wrap">
                        <div class="colTitle">
                            Rating summary
                        </div>
                        <ul class="barChart">
                            <li>
                                <div class="ratingRow wrap">
                                    <div class="label part ">
                                        <span class="text">Food</span>
                                    </div>
                                    <div class="wrap row part ">
                                        <span class="ui_bubble_rating bubble_45" alt="4.5 of 5 bubbles"></span>
                                    </div>
                                </div>
                                <div class="ratingRow wrap">
                                    <div class="label part ">
                                        <span class="text">Service</span>
                                    </div>
                                    <div class="wrap row part ">
                                        <span class="ui_bubble_rating bubble_45" alt="4.5 of 5 bubbles"></span>
                                    </div>
                                </div>
                            </li>
                            <li>
                                <div class="ratingRow wrap">
                                    <div class="label part ">
                                        <span class="text">Value</span>
                                    </div>
                                    <div class="wrap row part ">
                                        <span class="ui_bubble_rating bubble_40" alt="4.0 of 5 bubbles"></span>
                                    </div>
                                </div>
                            </li>
                        </ul>
                    </div>
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Cuisine
                </div>
                <div class="content">
                    <a href="/Restaurants-g297595-c24-Raipur_Raipur_District_Chhattisgarh.html">Indian</a>, <a href="/Restaurants-g297595-c6-Raipur_Raipur_District_Chhattisgarh.html">Barbecue</a>, <a href="/Restaurants-g297595-c3-Raipur_Raipur_District_Chhattisgarh.html">Asian</a>, <a href="/Restaurants-g297595-zfz10665-Raipur_Raipur_District_Chhattisgarh.html">Vegetarian Friendly</a>, <a href="/Restaurants-g297595-zfz10697-Raipur_Raipur_District_Chhattisgarh.html">Vegan Options</a>, <a href="/Restaurants-g297595-zfz10992-Raipur_Raipur_District_Chhattisgarh.html">Gluten Free Options</a>
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Meals
                </div>
                <div class="content">
                    Lunch, Dinner
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Restaurant features
                </div>
                <div class="content">
                    Reservations, Seating, Waitstaff, Wheelchair Accessible, Validated Parking
                </div>
            </div>
            <div class="row">
                <div class="title">
                    Good for
                </div>
                <div class="content">
                    Groups, Special Occasion Dining, Kids, Child-friendly
                </div>
            </div>
        </div>
        <div class="additional_info">
            <div class="title">
                Location and Contact Information </div>
            <div class="content">
                <ul class="detailsContent">
                    <li>
                        <div class="detail">Address:
                            <span> <span class="format_address"> | <span class="extended-address">Magneto The Mall, 2nd Floor</span>, <span class="locality">Raipur 429010, </span><span class="country-name">India</span> </span>
                            </span>
                        </div>
                    </li>
                    <li>
                        <div class="detail">Location:
                            <span> Asia</span>
                            <span> &nbsp;&gt;&nbsp; India</span>
                            <span> &nbsp;&gt;&nbsp; Chhattisgarh</span>
                            <span> &nbsp;&gt;&nbsp; Raipur District</span>
                            <span> &nbsp;&gt;&nbsp; Raipur</span>
                        </div>
                    </li>
                    <li>
                        <div class="detail">Phone Number:
                            <span>+91 77160 60008</span>
                        </div>
                    </li>
                    <li>
                        <span class="ui_icon email"></span>
                        <a target="_blank&quot;" href="mailto:feedback@barbeque-nation.com" onclick="ta.trackEventOnPage('Eatery_Listing','Email','8595502')">
    E-mail </a>
                    </li>
                    <!--trkP:waypoint_for_poi_2-->
                    <!-- PLACEMENT waypoint_for_poi -->
                    <div id="taplc_waypoint_for_poi_1" class="ppr_rup ppr_priv_waypoint_for_poi" data-placement-name="waypoint_for_poi">
                    </div>
                    <!--etk-->
                </ul>
            </div>
        </div>
        <!--[if lte IE 9]>
                <style>
                    .details_block .threeColumnList{
                        height: 350px;
                        overflow: auto;
                    }
                </style>
                <![endif]-->
    </div>
</div>

Tags: divhtmllicontentclassrowhrefspan
1条回答
网友
1楼 · 发布于 2024-05-23 13:59:23

您可以找到包含标题和相应内容的所有<div class="row">,而不是获取所有<div class="content">块的列表并按它们的索引选择几个块(从第一页更改为第二页)。你知道吗

rows = details_container.findAll('div', {'class': 'row'})

# used to store data extracted from HTML <div class="row"> elements
data = {}

for row in rows:
  title = row.find('div', {'class': 'title'})
  content = row.find('div', {'class': 'content'})

  if title and content:
    # here I am just formatting the dict key to be more python-ish. totally optional
    title = title.text.strip().lower().replace(' ', '-')
    data[title] = content

# tested with the HTML from the first page
print data.keys()
#=> [u'cuisine', u'restaurant-features', u'average-prices', u'good-for', u'open-hours', u'meals']
print type(data['cuisine'])
#=> <class 'bs4.element.Tag'>

现在,您可以从HTML网页中提取内容项,而无需考虑它们的显示顺序。此代码应适用于与您提供的两个页面具有相同常规结构的任何HTML。我希望这有帮助!你知道吗

相关问题 更多 >