beauthulsoup提取<li>和<ul>标记并将结果写入CSV

2024-06-10 19:54:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从下面的所有行中提取文本(“li”):

<ul id="tco_detail_data">
        <li>
            <ul class="list-title">
                <li class="first"> </li>
                <li>Year 1</li>
                <li>Year 2</li>
                <li>Year 3</li>
                <li>Year 4</li>
                <li>Year 5</li>
                <li class="last">5 Yr Total</li>
            </ul>
        </li>
        <hr class="loose-dotted" />
        <li class="first">
            <ul class="first">
                <li class="first">Depreciation</li>
                <li>$5,390</li>
                <li>$1,658</li>
                <li>$1,459</li>
                <li>$1,293</li>
                <li>$1,161</li>
                <li class="last">$10,961</li>
            </ul>
        </li>
        <hr class="loose-dotted" />
        <li>
            <ul>
                <li class="first">Taxes &amp; Fees</li>
                <li>$1,424</li>
                <li>$61</li>
                <li>$61</li>
                <li>$61</li>
                <li>$61</li>
                <li class="last">$1,668</li>
            </ul>
        </li>
        <hr class="loose-dotted" />
        <li>
            <ul>
                <li class="first">Financing</li>
                <li>$1,022</li>
                <li>$817</li>
                <li>$603</li>
                <li>$375</li>
                <li>$135</li>
                <li class="last">$2,952</li>
            </ul>

为了达到这一点,我使用了以下方法:

^{pr2}$

现在,提取^{cl1}下的所有行$

details = soup.find_all("li", {"class":"first"})

但是,它只得到firs的父li标签和子li标签。如何重复这个过程来选择每个li类的“第一”部分并将结果写入CSV? 如果有任何指导,我将不胜感激。在


Tags: 文本iddatahrli标签ulyear
2条回答

下面是一个与前面的答案类似的方法,它将以嵌套列表形式(即[[table row], [table row], ...')从网页中获得表:

data = soup.find_all("ul", {"id": "tco_detail_data"})

# get all list elements
lis = data[0].find_all('li')

# add a helper lambda, just for readability
find_ul = lambda x: x.find_all('ul')
uls = [find_ul(elem) for elem in lis if find_ul(elem) != []]

# use a nested list comprehension to iterate over the <ul> tags
# and extract text from each <li> into sublists
text = [[li.text.encode('utf-8') for li in ul[0].find_all('li')] for ul in uls]

# [
#   ['\xc2\xa0', 'Year 1', 'Year 2', 'Year 3', 'Year 4', 'Year 5', '5 Yr Total'],
#   ['Depreciation', '$4,853', '$1,658', '$1,459', '$1,293', '$1,161', '$10,424'],
#   ['Taxes & Fees', '$2,057', '$21', '$66', '$21', '$66', '$2,231'],
#   ['Financing', '$1,026', '$821', '$605', '$376', '$136', '$2,964'],
#   ['Fuel', '$1,606', '$1,654', '$1,704', '$1,755', '$1,808', '$8,527'],
#   ['Insurance', '$764', '$791', '$818', '$847', '$877', '$4,097'],
#   ['Maintenance', '$230', '$601', '$385', '$1,653', '$1,504', '$4,373'],
#   ['Repairs', '$0', '$0', '$109', '$257', '$374', '$740'],
#   ['Tax Credit', '$0', '', '', '', '', '$0'],
#   ['True Cost to Own \xc2\xae', '$10,536', '$5,546', '$5,146', '$6,202', '$5,926', '$33,356']
# ]

# write "text" list to csv
with open('ford_escape_2017.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerows(text)

我不确定我得到的输出是否是您想要的,因为您没有提供示例输出。在

代码:

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.edmunds.com/ford/escape/2017/cost-to-own/').text
soup = BeautifulSoup(page, 'html.parser')
uls = soup.find_all('ul', id='tco_detail_data')
for ul in uls:
    newsoup = BeautifulSoup(str(ul), 'html.parser')
    lis = newsoup.find_all('li')
    for li in lis:
        print(li.text)

输出:

^{pr2}$

为了能够将结果保存到csv文件中,我使用了cmaher的答案,因为它有助于创建csv文件。我的代码只是给您带来li标记之间所有文本的数据。 请注意,我使用管道而不是逗号作为csv文件内容的分隔符,因为数据包含逗号。在

代码:

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.edmunds.com/ford/escape/2017/cost-to-own/').text
soup = BeautifulSoup(page, 'html.parser')
data = soup.find_all("ul", {"id": "tco_detail_data"})
lis = data[0].find_all('li')
find_ul = lambda x: x.find_all('ul')
uls = [find_ul(elem) for elem in lis if find_ul(elem) != []]
text = [[li for li in ul[0].find_all('li')] for ul in uls]
with open('csvfile.csv', 'w') as file:
    for lis in text:
        temp = ''
        for li in lis:
            temp += li.text + '|'
        temp += '\n'
        file.write(temp)

输出:

 |Year 1|Year 2|Year 3|Year 4|Year 5|5 Yr Total|
Depreciation|$5,219|$1,658|$1,459|$1,293|$1,161|$10,790|
Taxes & Fees|$2,257|$195|$184|$175|$166|$2,977|
Financing|$1,051|$842|$620|$386|$139|$3,038|
Fuel|$1,906|$1,963|$2,022|$2,083|$2,146|$10,120|
Insurance|$1,160|$1,201|$1,243|$1,286|$1,331|$6,221|
Maintenance|$274|$716|$447|$1,849|$1,637|$4,923|
Repairs|$0|$0|$134|$318|$465|$917|
Tax Credit|$0|||||$0|
True Cost to Own ®|$11,867|$6,575|$6,109|$7,390|$7,045|$38,986|

相关问题 更多 >