我正在使用python(特别是jupyter笔记本)开发一个web抓取工具,它可以抓取一些不动产页面,并保存价格、地址等数据
对于我挑选的一个页面来说,它工作得很好,但是当我试图抓取这个页面时:sreality.cz(抱歉,该页面是捷克语,但实际内容现在不是那么重要)使用reguests.获取()我得到这个结果:
<!doctype html>
<html lang="{{ html.lang }}" ng-app="sreality" ng-controller="MainCtrl">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width,initial-scale=1,minimal-ui">
<!--- Nastaveni meta pres JS a ne pres Angular, aby byla nastavena default hodnota pro agenty co nezvladaji PhantomJS --->
<title ng:bind-template="{{metaSeo.title}}">Sreality.cz • reality a nemovitosti z celé ČR</title>
<meta name="description" content="Největší nabídka nemovitostí v ČR. Nabízíme byty, domy, novostavby, nebytové prostory, pozemky a další reality k prodeji i pronájmu. Sreality.cz">
<meta property="og:title" content="Sreality.cz • reality a nemovitosti z celé ČR">
<meta property="og:type" content="website">
<meta property="og:image" content="https://www.sreality.cz/img/sreality-logo-og.png">
<meta property="og:description" content="Největší nabídka nemovitostí v ČR. Nabízíme byty, domy, novostavby, nebytové prostory, pozemky a další reality k prodeji i pronájmu. Sreality.cz">
<meta property="og:url" content="https://www.sreality.cz/">
<meta ng-if="metaStatus.value" name="szn:status" content="{{metaStatus.value}}">
<meta http-equiv="imagetoolbar" content="no">
<link rel="icon" sizes="16x16 32x32 48x48 64x64" href="/img/icons/favicon.ico">
<link rel="apple-touch-icon" sizes="57x57" href="/img/icons/apple-touch-icon-57x57.png?3">
<link rel="apple-touch-icon" sizes="60x60" href="/img/icons/apple-touch-icon-60x60.png?3">
<link rel="apple-touch-icon" sizes="72x72" href="/img/icons/apple-touch-icon-72x72.png?3">
<link rel="apple-touch-icon" sizes="76x76" href="/img/icons/apple-touch-icon-76x76.png?3">
<link rel="apple-touch-icon" sizes="114x114" href="/img/icons/apple-touch-icon-114x114.png?3">
<link rel="apple-touch-icon" sizes="120x120" href="/img/icons/apple-touch-icon-120x120.png?3">
<link rel="apple-touch-icon" sizes="144x144" href="/img/icons/apple-touch-icon-144x144.png?3">
<link rel="apple-touch-icon" sizes="152x152" href="/img/icons/apple-touch-icon-152x152.png?3">
<link rel="apple-touch-icon" sizes="180x180" href="/img/icons/apple-touch-icon-180x180.png?3">
<link rel="icon" type="image/png" sizes="192x192" href="/img/icons/android-chrome-192x192.png">
<link rel="icon" type="image/png" sizes="32x32" href="/img/icons/favicon-32x32.png">
<link rel="icon" type="image/png" sizes="96x96" href="/img/icons/favicon-96x96.png">
<link rel="icon" type="image/png" sizes="16x16" href="/img/icons/favicon-16x16.png">
<link rel="manifest" href="/img/icons/android-chrome-manifest.json">
<meta name="msapplication-TileColor" content="#2b5797">
<meta name="msapplication-TileImage" content="/img/icons/ms-icon-144x144.png">
<meta name="msapplication-config" content="/img/icons/browserconfig.xml" />
<link rel="alternate" type="application/rss+xml" ng-href="{{ rss.url }}" ng-if="rss.url">
<link ng-repeat="lang in metaSeo.languages" rel="alternate" hreflang="{{lang.code}}" ng-href="{{lang.url}}">
<link rel="stylesheet" href="/css/all.css?2e96626">
<!-- Begin Inspectlet Embed Code -->
<script type="text/javascript" id="inspectletjs">
window.__insp = window.__insp || [];
__insp.push(['wid', 821249485]);
__insp.push(["virtualPage"]);
(function() {
function ldinsp(){if(typeof window.__inspld != "undefined") return; window.__inspld = 1; var insp = document.createElement('script'); insp.type = 'text/javascript'; insp.async = true; insp.id = "inspsync"; insp.src = ('https:' == document.location.protocol ? 'https' : 'http') + '://cdn.inspectlet.com/inspectlet.js'; var x = document.getElementsByTagName('script')[0]; x.parentNode.insertBefore(insp, x); };
setTimeout(ldinsp, 500); document.readyState != "complete" ? (window.attachEvent ? window.attachEvent('onload', ldinsp) : window.addEventListener('load', ldinsp, false)) : ldinsp();
})();
</script>
<!-- End Inspectlet Embed Code -->
<!--[if lte IE 8]>
<script>
document.createElement('popover');
document.createElement('mortgage');
document.createElement('vendor');
document.createElement('hp-signpost');
document.createElement('category-switcher');
document.createElement('feedback');
document.createElement('bottom');
document.createElement('panorama');
document.createElement('panorama-prev');
document.createElement('sphere-viewer');
document.createElement('sphere-viewer-prev');
document.createElement('save-filter');
</script>
<![endif]-->
<!-- Statistiky -->
<script src="https://h.imedia.cz/js/dot-small.js" type="text/javascript"></script>
<script type="text/javascript">
(function() {
try {
// Při přesměrování na hashbang URL (IE8-9) ztrácíme referrer,
// který je potřeba pro správné počítání statistik.
if (window.sessionStorage) { // někdo může mít DOM storage zakázaný
var l = document.createElement('a');
l.href = document.referrer;
var referrerHostname = l.hostname;
if (window.location.hostname != referrerHostname) {
window.sessionStorage.setItem('referrer', l.href);
}
}
// Starý android (< 4.0) v kombinaci s angularem špatně pracuje s hashem v URL.
// Považuje ho za součást query případně path.
// Na takových zařízech se budeme tvářit, že žádný hash nebyl.
if (parseInt((/android (\d+)/.exec(window.navigator.userAgent.toLowerCase()) || [])[1], 10) < 4) {
var hrefWithoutHashbang = window.location.href.replace('/#!', '');
var hashIndex = hrefWithoutHashbang.indexOf('#');
if (hashIndex != -1) {
window.location.replace(hrefWithoutHashbang.substring(0, hashIndex));
}
}
} catch (e) {}
})();
</script>
<!-- API mapy.cz -->
<script type="text/javascript" src="https://api4.mapy.cz/loader.js"></script>
<script type="text/javascript">Loader.load(null, {poi: true, pano: true})</script>
<!-- Login reklama -->
<script src="https://i.imedia.cz/js/im3.js" type="text/javascript"></script>
<script src="https://1.im.cz/software/promo/promo-sbrowser.js"></script>
<!-- Rozkopírování SID cookie -->
<script src="https://h.imedia.cz/js/sid.js"></script>
<!-- Login -->
<script src="https://login.szn.cz/js/api/login.js"></script>
<script>
login.cfg({
serviceId: "sreality"
});
</script>
<!-- KONFIGURACE -->
<script src="/js/conf/config.js?2e96626"></script>
<script src="/js/advert.js"></script>
<script src="/js/all.js?2e96626"></script>
<script type="text/javascript">
if (window.DOT) {
var dotCfg = {
service: 'sreality'
};
if (window.SrealityABTest && window.SrealityABTest.getVariant()) {
dotCfg.abtest = window.SrealityABTest.getVariant();
}
DOT.cfg(dotCfg);
}
</script>
<noscript>
<meta http-equiv="refresh" content="0;url=?_escaped_fragment_="/>
</noscript>
<meta name="fragment" content="!" ng-if="metaSeo.showMetaFragment" />
</head>
<!--[if IE 8]> <body class="ie8"> <![endif]-->
<!--[if IE 9]> <body class="notie8 ie9"> <![endif]-->
<!--[if gt IE 9]><!-->
<body class="notie8 notie9 lang-{{html.lang}}">
<!--<![endif]-->
<div loading-line></div>
<div page-layout>
<div ng-view></div>
</div>
</body>
</html>
虽然这与我在Chrome的开发工具中看到的页面不同,但有一部分代码在这里(整个代码不适合这里,uploadtext由于某些原因无法正常工作):
我可以从第一个html代码中看到请求.get下载页面运行的一些脚本可能会导致html不同。在
我已经尝试过使用urllib,但是结果html doc还是一样的。在
有没有办法下载我在Chromes的开发工具中打开页面时看到的html,这样我就可以抓取它了?在
如果最终数据来自您所追求的那个页面,那么使用selenium和BeautifulSoup可以非常容易地获得它。它给你所有公寓的链接。在
^{1}$相关问题 更多 >
编程相关推荐