从一些<div class=“xxx”>

2024-06-16 15:04:58 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在做网页抓取,到目前为止已经做过了-

page = requests.get('http://abcdefgh.in')
print(page.status_code)
soup = BeautifulSoup(page.content, 'html.parser')
all_p = soup.find_all(class_="p-list-sec")
print((all_p))

这样做之后,当我打印所有

<div class = "p-list-sec">
<UI> <li>  < a href = "link1", title = "tltle1">title1<a/></li>
     <li>  < a href = "link2", title = "tltle2">title2<a/></li>
     <li>  < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div>

<div class = "p-list-sec">
<UI> <li>  < a href = "link1", title = "tltle1">title1<a/></li>
     <li>  < a href = "link2", title = "tltle2">title2<a/></li>
     <li>  < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div>

<div class = "p-list-sec">
<UI> <li>  < a href = "link1", title = "tltle1">title1<a/></li>
     <li>  < a href = "link2", title = "tltle2">title2<a/></li>
     <li>  < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div> and so on up to around 40 div classes. 

现在我想提取类p-list-sec中的所有a href和title,并将它们存储到文件中。我知道如何将它们存储到文件中,但从all p-list-sec类中提取所有a href和title对我来说是个问题。 我正在使用Python3.9,并使用命令提示符在Windows10中使用请求和美化组库

谢谢, 阿克希


Tags: divuititlepagelisecalllist
3条回答

如果您不介意div名称,这里有一条单行线:

import re

with open("data.html", "r") as msg:
    data = msg.readlines()

data = [tuple(re.sub(r'.+href = "(.+)",.+title = "(.+)".+',r'\1'+' '+r'\2',v).split()) for v in [v.strip() for v in data if "href" in v]]

输出:

[('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3'), ('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3'), ('link1', 'tltle1'), ('link2', 'tltle2'), ('link3', 'tltle3')]

否则:

with open("data.html", "r") as msg:
    data = msg.readlines()

div_write = False
href_write = False

wdata = []; odata = []

for line in data:
    if '<div class =' in line:
        class_name = line.split("<div class =")[1].split(">")[0].strip()
        div_write = True
    if "</div>" in line and div_write == True:
        odata.append(wdata)
        wdata = []
        div_write = False

    if div_write == True and "< a href" in line:
        href = line.strip().split("< a href =")[1].split(",")[0].strip()
        title = line.strip().split("title =")[1].split(">")[0].strip()
        wdata.append(class_name+" "+href+" "+title)

with open("out.dat", "w") as msg:
    for wdata in odata:
        msg.write("\n".join(wdata)+"\n\n")

这样,您就可以保存一个文件,在其中跟踪信息和节名

输出:

"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"

"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"

"p-list-sec" "link1" "tltle1"
"p-list-sec" "link2" "tltle2"
"p-list-sec" "link3" "tltle3"

这样行吗

...

for p in all_p:
    for link in p.find_all('a'):
        print(link['href'])
        print(link.text) # or link['title']

以防万一

为了避免两次循环,还可以使用BeautifulSoup css选择器和链class<a>。因此,拿起你的汤,像这样选择:

soup.select('.p-list-sec a')

要形成您想要处理的信息,您可以使用一个for循环或一个列表来理解一行中的所有内容

[{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]

输出

[{'url': 'link1', 'title': 'tltle1'},
 {'url': 'link2', 'title': 'tltle2'},
 {'url': 'link3', 'title': 'tltle3'},
 {'url': 'link1', 'title': 'tltle1'},
 {'url': 'link2', 'title': 'tltle2'},
 {'url': 'link3', 'title': 'tltle3'},
 {'url': 'link1', 'title': 'tltle1'},
 {'url': 'link2', 'title': 'tltle2'},
 {'url': 'link3', 'title': 'tltle3'}]

要将其存储在csv中,请随意将其推入pandascsv

熊猫:

import pandas as pd

pd.DataFrame([{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]).to_csv('url.csv', index=False)

CSV:

import csv
data_list = [{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]

keys = data_list[0].keys()

with open('url.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(data_list)

相关问题 更多 >