如何将Python(BeaufitulSoup)中的新行附加到CSV中以用于多URL刮板?

2024-04-25 12:54:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我对Python非常陌生,我正在测试我的第一个scraper(使用我在这里和那里找到的一些代码)。我可以写CSV和所有需要的信息,但现在我试图输入超过1个URL,脚本只是写我在数组中插入的最后一个URL,就像没有附加新的URL,只是在相同的第一个原始URL上重新写入

我到处找,试了很多东西,但我想我需要一些帮助,谢谢

from bs4 import BeautifulSoup
import requests
from csv import writer

urls = ['https://example.com/1', 'https://example.com/2']

for url in urls:
    my_url = requests.get(url)
    html = my_url.content
    soup = BeautifulSoup(html,'html.parser')

    info = []

print (urls)

lists = soup.find_all('div', class_="profile-info-holder")
links = soup.find_all('a', class_="intercept")

with open('multi.csv', 'w', encoding='utf8', newline='') as f:
    thewriter = writer(f)
    header = ['Name', 'Location', 'Link', 'Link2', 'Link3']
    thewriter.writerow(header)

    for list in lists:
        name = list.find('div', class_="profile-name").text
        location = list.find('div', class_="profile-location").text

        social1 = links[0]
        social2 = links[1]
        social3 = links[2]

        info = [name, location, social1.get('href'),social2.get('href'),social3.get('href')]
        thewriter.writerow(info)

1条回答
网友
1楼 · 发布于 2024-04-25 12:54:44

基本方法

  • 以附加模式(“a”)打开文件
  • 写入光标指向文件的末尾
  • 使用write()函数在文件末尾追加“\n”
  • 使用write()函数将给定行追加到文件中
  • 关闭文件

with open('multi.csv', 'a', encoding='utf8', newline='') as f:

您可能必须以另一种方式排列循环,但如果没有urls,则很难描述:

from bs4 import BeautifulSoup
import requests
from csv import writer

urls = ['https://example.com/1', 'https://example.com/2']


with open('multi.csv', 'a', encoding='utf8', newline='') as f:
    thewriter = writer(f)
    header = ['Name', 'Location', 'Link', 'Link2', 'Link3']
    thewriter.writerow(header)

    
    for url in urls:
        my_url = requests.get(url)
        html = my_url.content
        soup = BeautifulSoup(html,'html.parser')

        info = []

        lists = soup.find_all('div', class_="profile-info-holder")
        
        for l in lists:
            name = l.find('div', class_="profile-name").text
            location = l.find('div', class_="profile-location").text
            links = l.find_all('a', class_="intercept")
            social1 = links[0]
            social2 = links[1]
            social3 = links[2]

            info = [name, location, social1.get('href'),social2.get('href'),social3.get('href')]
            thewriter.writerow(info)

相关问题 更多 >