如何用python从web上抓取数据编写csv文件

2024-05-13 21:30:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我试着从网页上抓取数据,也能抓取。 在使用下面的脚本获取所有的div类数据后,我很困惑如何在CSV文件中编写数据。在

名字列中的第一个数据 姓氏列中的姓氏数据 . . 在

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'

page = urlopen(html)

data = BeautifulSoup(page, 'html.parser')

name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b

for i in range(len(name_box)):
    data = name_box[i].text.strip()

数据:

^{pr2}$

以上是运行以上代码后得到的数据。在

编辑

for i in range(len(name_box)):
    data = name_box[i].text.strip()
    print (data)
    fname = 'out.csv'
    with open(fname) as f:
        next(f)
        for line in f:
            head = []
            value = []
            for row in line:
                head.append(row)
            print (row)

应输入

Information Type | First  | Middle Name | Last Name | ......
Individual       | KACHAM |             | RAJESHWAR | .....

我有200个网址,但所有网址数据不一样,意味着其中一些丢失。我想这样写,如果数据不可用,那就写一个空白。在

请提出建议。提前谢谢你


Tags: 数据nameinfromimportdivboxfor
2条回答

要写入csv,您需要知道head和body中的值,在本例中,head值应该是html元素contain <label

from urllib2 import urlopen
from bs4 import BeautifulSoup

html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'

page = urlopen(html)

data = BeautifulSoup(page, 'html.parser')

name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b

heads = []
values = []

for i in range(len(name_box)):
    data = name_box[i].text.strip()
    dataHTML = str(name_box[i])
    if 'PInfoType' in dataHTML:
        # <div class="col-md-3 col-sm-3" id="PInfoType">
        # empty value, maybe additional data for "Information Type"
        continue

    if 'for="2"' in dataHTML:
        # <label for="2">No</label>
        # it should be head but actually value
        values.append(data)

    elif '<label' in dataHTML:
        # <label for="PersonalInfoModel_InfoTypeValue">Information Type</label>
        # head or top row
        heads.append(data)

    else:
        # <div class="col-md-3 col-sm-3">Individual</div>
        # value for second row
        values.append(data)

csvData = ', '.join(heads) + '\n' + ', '.join(values)    
with open("results.csv", 'w') as f:
    f.write(csvData)

print "finish."

Question: How to write csv file from scraped data

Data读入dict,并使用csv.DictWriter(...写入CSV文件。
有关以下内容的文档: csv.DictWriterwhilenextbreakMapping Types — dict

  1. 跳过第一行,因为这是标题
  2. 循环Data
    1. key = next(data)
    2. value = next(data)
    3. 如果没有更多数据,则中断循环
    4. 生成dict[key] = value
  3. 完成循环后,将dict写入CSV文件

Output:

{'Individual': '', 'Father Full Name': 'RAMAIAH', 'First Name': 'KACHAM', 'Middle Name': '', 'Last Name': 'RAJESHWAR',... (omitted for brevity)

相关问题 更多 >