如何在Python中抓取网页中的嵌入脚本
举个例子,我有一个网页 http://www.amazon.com/dp/1597805483。
我想用xpath来抓取这句话 在全球所有的运动中,没有哪项运动比棒球更充满诅咒和迷信,它是美国的国民运动。
page = requests.get(url)
tree = html.fromstring(page.text)
feature_bullets = tree.xpath('//*[@id="iframeContent"]/div/text()')
print feature_bullets
上面的代码没有返回任何结果。原因是浏览器解读的xpath和源代码中的不同。但我不知道如何从源代码中获取xpath。
1 个回答
4
在构建你要抓取的网页时,有很多事情需要考虑。
具体来说,网页的底层HTML是通过一个javascript函数生成的:
<script type="text/javascript">
P.when('DynamicIframe').execute(function (DynamicIframe) {
var BookDescriptionIframe = null,
bookDescEncodedData = "%3Cdiv%3E%3CB%3EA%20Fantastic%20Anthology%20Combining%20the%20Love%20of%20Science%20Fiction%20with%20Our%20National%20Pastime%3C%2FB%3E%3CBR%3E%3CBR%3EOf%20all%20the%20sports%20played%20across%20the%20globe%2C%20none%20has%20more%20curses%20and%20superstitions%20than%20baseball%2C%20America%26%238217%3Bs%20national%20pastime.%3Cbr%3E%3CBR%3E%3CI%3EField%20of%20Fantasies%3C%2FI%3E%20delves%20right%20into%20that%20superstition%20with%20short%20stories%20written%20by%20several%20key%20authors%20about%20baseball%20and%20the%20supernatural.%20%20Here%20you%27ll%20encounter%20ghostly%20apparitions%20in%20the%20stands%2C%20a%20strangely%20charming%20vampire%20double-play%20combination%2C%20one%20fan%20who%20can%20call%20every%20shot%20and%20another%20who%20can%20see%20the%20past%2C%20a%20sad%20alternate-reality%20for%20the%20game%27s%20most%20famous%20player%2C%20unlikely%20appearances%20on%20the%20field%20by%20famous%20personalities%20from%20Stephen%20Crane%20to%20Fidel%20Castro%2C%20a%20hilariously%20humble%20teenage%20phenom%2C%20and%20much%20more.%20In%20this%20wonderful%20anthology%20are%20stories%20from%20such%20award-winning%20writers%20as%3A%3CBR%3E%3CBR%3EStephen%20King%20and%20Stewart%20O%26%238217%3BNan%3Cbr%3EJack%20Kerouac%3CBR%3EKaren%20Joy%20Fowler%3CBR%3ERod%20Serling%3CBR%3EW.%20P.%20Kinsella%3CBR%3EAnd%20many%20more%21%3CBR%3E%3CBR%3ENever%20has%20a%20book%20combined%20the%20incredible%20with%20great%20baseball%20fiction%20like%20%3CI%3EField%20of%20Fantasies%3C%2FI%3E.%20This%20wide-ranging%20collection%20reaches%20from%20some%20of%20the%20earliest%20classics%20from%20the%20pulp%20era%20and%20baseball%27s%20golden%20age%2C%20all%20the%20way%20to%20material%20appearing%20here%20for%20the%20first%20time%20in%20a%20print%20edition.%20Whether%20you%20love%20the%20game%20or%20just%20great%20fiction%2C%20these%20stories%20will%20appeal%20to%20all%2C%20as%20the%20writers%20in%20this%20anthology%20bring%20great%20storytelling%20of%20the%20strange%20and%20supernatural%20to%20the%20plate%2C%20inning%20after%20inning.%3CBR%3E%3C%2Fdiv%3E",
bookDescriptionAvailableHeight,
minBookDescriptionInitialHeight = 112,
options = {};
...
</script>
这里的思路是获取这个脚本标签里的文本,使用正则表达式提取出描述的内容,然后去掉HTML的引号,接着用 lxml.html
进行解析,最后获取 .text_content()
的内容:
import re
from urlparse import unquote
from lxml import html
import requests
url = "http://rads.stackoverflow.com/amzn/click/1597805483"
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'})
tree = html.fromstring(page.content)
script = tree.xpath('//script[contains(., "bookDescEncodedData")]')[0]
match = re.search(r'bookDescEncodedData = "(.*?)",', script.text)
if match:
description_html = html.fromstring(unquote(match.group(1)))
print description_html.text_content()
输出结果是:
A Fantastic Anthology Combining the Love of Science Fiction with Our National Pastime.
Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime.Field of Fantasies delves right into that superstition with short stories written by several key authors about baseball and the supernatural.
Here you'll encounter ghostly apparitions in the stands, a strangely charming vampire double-play combination, one fan who can call every shot and another who can see the past, a sad alternate-reality for the game's most famous player, unlikely appearances on the field by famous personalities from Stephen Crane to Fidel Castro, a hilariously humble teenage phenom, and much more.
In this wonderful anthology are stories from such award-winning writers as:Stephen King and Stewart O’NanJack KerouacKaren Joy FowlerRod SerlingW. P. KinsellaAnd many more!Never has a book combined the incredible with great baseball fiction like Field of Fantasies.
This wide-ranging collection reaches from some of the earliest classics from the pulp era and baseball's golden age, all the way to material appearing here for the first time in a print edition. Whether you love the game or just great fiction, these stories will appeal to all, as the writers in this anthology bring great storytelling of the strange and supernatural to the plate, inning after inning.
还有一种类似的解决方案,但使用的是 BeautifulSoup
:
import re
from urlparse import unquote
from bs4 import BeautifulSoup
import requests
url = "http://rads.stackoverflow.com/amzn/click/1597805483"
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'})
soup = BeautifulSoup(page.content)
script = soup.find('script', text=lambda x:'bookDescEncodedData' in x)
match = re.search(r'bookDescEncodedData = "(.*?)",', script.text)
if match:
description_html = BeautifulSoup(unquote(match.group(1)))
print description_html.text
另外,你也可以采取更高层次的方法,借助 selenium
使用真实的浏览器:
from selenium import webdriver
url = "http://rads.stackoverflow.com/amzn/click/1597805483"
driver = webdriver.Firefox()
driver.get(url)
iframe = driver.find_element_by_id('bookDesc_iframe')
driver.switch_to.frame(iframe)
print driver.find_element_by_id('iframeContent').text
driver.close()
这样可以得到格式更好看的输出:
A Fantastic Anthology Combining the Love of Science Fiction with Our National Pastime
Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime.
Field of Fantasies delves right into that superstition with short stories written by several key authors about baseball and the supernatural. Here you'll encounter ghostly apparitions in the stands, a strangely charming vampire double-play combination, one fan who can call every shot and another who can see the past, a sad alternate-reality for the game's most famous player, unlikely appearances on the field by famous personalities from Stephen Crane to Fidel Castro, a hilariously humble teenage phenom, and much more. In this wonderful anthology are stories from such award-winning writers as:
Stephen King and Stewart O’Nan
Jack Kerouac
Karen Joy Fowler
Rod Serling
W. P. Kinsella
And many more!
Never has a book combined the incredible with great baseball fiction like Field of Fantasies. This wide-ranging collection reaches from some of the earliest classics from the pulp era and baseball's golden age, all the way to material appearing here for the first time in a print edition. Whether you love the game or just great fiction, these stories will appeal to all, as the writers in this anthology bring great storytelling of the strange and supernatural to the plate, inning after inning.