Python脚本响应中请求的Webscrapping

2024-04-18 23:27:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试着整理这个链接,这是我写的代码

import requests
from bs4 import BeautifulSoup
rlink = requests.get('http://videohost.site/play/A11QStEaNdVZfvV/')
print(rlink.content)

现在,当我在浏览器中运行链接时,我得到了一个格式良好的HTML,可以从中选择标记。 示例:

^{pr2}$

但是请求模块返回一个在浏览器中执行的脚本

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
      <meta charset="UTF-8" />
      <title>Banjo HD</title>
      <meta property="og:image" content="https://lh6.googleusercontent.com/Eo6aYbkMPiltQ1HE8QXK-2RvCOB8wCgzvqiJqIYEu9DJMSodJwd24g=w1200-h630-p" />
      <link rel="stylesheet" type="text/css" href="http://videohost.site/player/jwplayer/assets/style.css">
      <script src="http://videohost.site/player/jwplayer/assets/jwplayer.js"></script> <script>jwplayer.key = "qCeaX98IpNerwNN2Vlz69NLXFAyMM5a4dyK7Pw==";</script>
   </head>
   <body>
      <div id="player"></div>
      <script type="text/javascript"> eval(function(p,a,c,k,e,d){e=function(c){return(c<a?'':e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--){d[e(c)]=k[c]||e(c)}k=[function(e){return d[e]}];e=function(){return'\\w+'};c=1};while(c--){if(k[c]){p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c])}}return p}('1k 5=v("5");5.1l({1m:"14%",1i:"14%",1h:"1n",1q:"w",1p:17,1o:w,1r:"O://15.19/5/v/1a/v.1b.1g",1f:"16:9",1c:"17",1e:"1d",1j:"O",1w:w,1G:[{"3":"t:\\/\\/s.p.q\\/r?0=x&y=D&E=1F&C=B&o=A&F=m&c=d:e:G:7::b&a=6&8=f&g=0%h%i%n%j%k%l%z%P%10%X%U%V&W=1I.1K&Y=Z&13=12&11=S-L-K&J=H&I=M&T=u&N=R","Q":"1J","2":"1\\/4"},{"3":"t:\\/\\/s.p.q\\/r?0=x&y=D&E=1E&C=B&o=A&F=m&c=d:e:G:7::b&a=6&8=f&g=0%h%i%n%j%k%l%z%P%10%X%U%V&W=1C.1D&Y=Z&13=12&11=S-L-K&J=H&I=M&T=u&N=R","Q":"1s","2":"1\\/4"},{"3":"t:\\/\\/s.p.q\\/r?0=x&y=D&E=18&C=B&o=A&F=m&c=d:e:G:7::b&a=6&8=f&g=0%h%i%n%j%k%l%z%P%10%X%U%V&W=1z.1A&Y=Z&13=12&11=S-L-K&J=H&I=M&T=u&N=R","Q":"1B","2":"1\\/4"}],2:"1/4",1y:{3:"",1x:"",},1t:"1u 1v",1H:"O://15.19"});',62,109,'requiressl|video|type|file|mp4|player|32||expire||ipbits|3e0|ip|2001|67c|1483730468|sparams|2Cid|2Citag|2Cttl|2Cip|2Cipbits|explorer|2Csource|ttl|googlevideo|com|videoplayback|redirector|https||jwplayer|false|yes|id|2Cexpire|transient|webdrive|source|99e7c0d36ff950d2|itag|app|2db8|au|mt|ms|vu2e|bungvh5op5|1483715949|pl|http|2Cmm|label|48|sn|mv|2Cmv|2Cpl|signature|2Cms|key|ck2|2Cmn|mn|31|mm|100|videohost||true||site|assets|flash|fullscreen|html5|primary|aspectratio|swf|skin|height|provider|var|setup|width|seven|displaytitle|controls|preload|flashplayer|480P|abouttext|Video|Host|autostart|link|logo|3648867A489010D7BFA1A2E6C64F4035FDEB3814|6617735E622564ACA4793459986706DA936E58DE|360P|9FBCFB9752833B2DD83BFD6547551604AA6A340D|A55D1440195C2AF6945EE4A20DB8147CDC50F337|59|22|sources|aboutlink|7EFB542F7CE372D5DAD8376254F577926AF8CBEA|720P|857A11ACEB6C65D5D075759B557CE1E114F94F03'.split('|'),0,{})) </script><!-- Code --><script type="text/javascript" data-cfasync="false"> var _pop = _pop || []; _pop.push(['siteId', 1630926]); _pop.push(['minBid', 0]); _pop.push(['popundersPerIP', 0]); _pop.push(['delayBetween', 0]); _pop.push(['default', false]); _pop.push(['defaultPerDay', 0]); _pop.push(['topmostLayer', false]); (function() { var pa = document.createElement('script'); pa.type = 'text/javascript'; pa.async = true; var s = document.getElementsByTagName('script')[0]; pa.src = '//c1.popads.net/pop.js'; pa.onerror = function() { var sa = document.createElement('script'); sa.type = 'text/javascript'; sa.async = true; sa.src = '//c2.popads.net/pop.js'; s.parentNode.insertBefore(sa, s); }; s.parentNode.insertBefore(pa, s); })();</script><!-- Code End --><script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-88363984-1', 'auto'); ga('send', 'pageview');</script>
   </body>
</html>

任何关于如何继续获得最终HTML的指示都将得到高度赞赏。在

关于PhantomJS的任何想法,我运行的方式和下面建议的一样,但是使用PhantomJS驱动程序,搜索voideo标记会超时,因为我认为脚本不会像FireFox那样执行。在

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS()
driver.get('http://videohost.site/play/A11QStEaNdVZfvV/')
# driver.execute_script('')

# wait for "video" to be present
video = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "video")))

# get the src value
print(video.get_attribute("src"))

driver.close()

Tags: textfromimportsrchttpvideotypescript
2条回答

为了扩展Emett的答案,下面是一个使用^{}的示例工作代码,该代码将打开Firefox(您不必使用Firefox-支持多种浏览器,包括headless^{}),等待video元素出现并获得src值:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get('http://videohost.site/play/A11QStEaNdVZfvV/')

# wait for "video" to be present
video = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "video")))

# get the src value
print(video.get_attribute("src"))

driver.close()

请求和webscrapping不会呈现JavaScript。您需要运行类似Selenium的程序。唯一的问题是它会打开一个浏览器,而且速度会相当慢。为了进一步解决这个问题,您需要使用无头浏览器系统,比如ghost.py。在

相关问题 更多 >