Python webscraping非对象失败破坏HTML？

import requests import csv from bs4 import BeautifulSoup from lxml import html url = 'http://financials.morningstar.com/ratios/r.html?t=SBUX&region=USA&culture=en_US' response = requests.get(url) html = response.content soup = BeautifulSoup(html) table = soup.find('table', attrs={'class': 'r_table1 text2'}) print table.prettify() #debugging list_of_rows = [] for row in table.findAll('tr'): list_of_cells =[] for cell in row.findAll(['th','td']): text = cell.text.replace(' ', '') list_of_cells.append(text) list_of_rows.append(list_of_cells) print list_of_rows #debugging outfile = open("./test.csv", "wb") writer = csv.writer(outfile) writer.writerows(list_of_rows)

1条回答

网友

1楼 · 发布于 2024-05-17 00:12:48

该表是通过对端点的单独XHR调用动态加载的，该端点将返回JSONP响应。模拟该请求，从JSONP响应中提取JSON字符串，用json加载它，从componentData键提取HTML并用BeautifulSoup加载：

import json
import re

import requests
from bs4 import BeautifulSoup

# make a request
url = 'http://financials.morningstar.com/financials/getFinancePart.html?&callback=jsonp1450279445504&t=XNAS:SBUX&region=usa&culture=en-US&cur=&order=asc&_=1450279445578'
response = requests.get(url)

# extract the HTML under the "componentData"
data = json.loads(re.sub(r'([a-zA-Z_0-9\.]*\()|(\);?$)', '', response.content))["componentData"]

# parse HTML
soup = BeautifulSoup(data, "html.parser")
table = soup.find('table', attrs={'class': 'r_table1 text2'})
print(table.prettify())

相关问题更多 >

编程相关推荐

热门问题

热门文章