Python web抓取和保存到pandas datafram

import pandas as pd import requests from bs4 import BeautifulSoup url = 'https://www.remax.ca/ab/calgary-real-estate/720-37-st-nw-wp_id251536557-lst' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') detail_title = soup.find_all(class_='detail-title') details_t = pd.DataFrame(detail_title)

2条回答

网友

1楼 · 编辑于 2024-05-26 21:50:44

你可以试试这个。{cd1>假设你只想在文本中。但请随意根据我的例子进行调整。在

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.remax.ca/ab/calgary-real-estate/720-37-st-nw-wp_id251536557-lst'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
detail_title = soup.find_all(class_='detail-title')

ls = []

for _ in detail_title:
  ls.append(_.text)

df = pd.DataFrame(data=ls)

print(df)

输出

^{pr2}$

编辑： print(type(detail_title))给出{}，它不是可接受的数据类型。来自https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame

网友

2楼 · 编辑于 2024-05-26 21:50:44

detail_title不包含可以放入数据帧中的内容：它是beautifulGroup“bs4”的列表。元素.标记“对象（请参见type(detail_title[0])提供的内容）。尝试以下操作：

第1步。提取列标题

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.remax.ca/ab/calgary-real-estate/720-37-st-nw-wp_id251536557-lst'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
detail_title = soup.find_all(class_='detail-title')

headings = [d.text for d in detail_title]
details_t = pd.DataFrame(columns = headings)

第二步。在html中向上一级，得到成对的详细名称和值。（细节名称是在步骤1中提取的名称）。编写一个helper函数来返回给定名称的值。在

^{pr2}$

这有点奇怪，如果你只是刮一页。我想你要做的是运行一次步骤1得到详细名称，然后在所有你想刮的页面上执行第2步。在

第三步。对于所刮取的每个页面，将找到的详细信息值附加到数据帧中。在

details_t = details_t.append({deet:get_detail_value(deet, details) for deet in details_t.columns}, ignore_index = True)

相关问题更多 >

编程相关推荐

热门问题

热门文章