python中的Web抓取urlopen问题的回答

python中的Web抓取urlopen

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

<p>我个人认为：</p> <pre><code># Python 2.7 import urllib url = 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS' sock = urllib.urlopen(url) content = sock.read() sock.close() print content </code></pre> <p>我想你应该去法国，。。欢迎访问stackoverflow.com！</p> <h3>更新1</h3> <p>实际上，我现在更喜欢使用以下代码，因为它更快：</p> <pre><code># Python 2.7 import httplib conn = httplib.HTTPConnection(host='www.boursorama.com',timeout=30) req = '/includes/cours/last_transactions.phtml?symbole=1xEURUS' try: conn.request('GET',req) except: print 'echec de connexion' content = conn.getresponse().read() print content </code></pre> <p>在这段代码中将<code>httplib</code>更改为<code>http.client</code>应该足以使其适应Python 3。</p> <p>是的。</p> <p>我确认，通过这两个代码，我获得了我看到您感兴趣的数据的源代码：</p> <pre><code> <td class="L20" width="33%" align="center">11:57:44</td> <td class="L20" width="33%" align="center">1.4486</td> <td class="L20" width="33%" align="center">0</td> </tr> <tr> <td width="33%" align="center">11:57:43</td> <td width="33%" align="center">1.4486</td> <td width="33%" align="center">0</td> </tr> </code></pre> <h3>更新2</h3> <p>将以下代码片段添加到上述代码将允许您提取所需的数据：</p> <pre><code>for i,line in enumerate(content.splitlines(True)): print str(i)+' '+repr(line) print '\n\n' import re regx = re.compile('\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d\d:\d\d:\d\d)</td>\r\n' '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">([\d.]+)</td>\r\n' '\t\t\t\t\t\t<td class="(?:gras )?L20" width="33%" align="center">(\d+)</td>\r\n') print regx.findall(content) </code></pre> <p>结果（仅结尾）</p> <pre><code>....................................... ....................................... ....................................... ....................................... 98 'window.config.graphics = {};\n' 99 'window.config.accordions = {};\n' 100 '\n' 101 "window.addEvent('domready', function(){\n" 102 '});\n' 103 '</script>\n' 104 '<script type="text/javascript">\n' 105 '\t\t\t\tsas_tmstp = Math.round(Math.random()*10000000000);\n' 106 '\t\t\t\tsas_pageid = "177/(includes/cours/last_transactions)"; // Page : boursorama.com/smartad_test\n' 107 '\t\t\t\tvar sas_formatids = "8968";\n' 108 '\t\t\t\tsas_target = "symb=1xEURUS#"; // TargetingArray\n' 109 '\t\t\t\tdocument.write("<scr"+"ipt src=\\"http://ads.boursorama.com/call2/pubjall/" + sas_pageid + "/" + sas_formatids + "/" + sas_tmstp + "/" + escape(sas_target) + "?\\"></scr"+"ipt>");\t\t\t\t\n' 110 '\t\t\t</script><div id="_smart1"><script language="javascript">sas_script(1,8968);</script></div><script type="text/javascript">\r\n' 111 "\twindow.addEvent('domready', function(){\r\n" 112 'sas_move(1,8968);\t});\r\n' 113 '</script>\n' 114 '<script type="text/javascript">\n' 115 'var _gaq = _gaq || [];\n' 116 "_gaq.push(['_setAccount', 'UA-1623710-1']);\n" 117 "_gaq.push(['_setDomainName', 'www.boursorama.com']);\n" 118 "_gaq.push(['_setCustomVar', 1, 'segment', 'WEB-VISITOR']);\n" 119 "_gaq.push(['_setCustomVar', 4, 'version', '18']);\n" 120 "_gaq.push(['_trackPageLoadTime']);\n" 121 "_gaq.push(['_trackPageview']);\n" 122 '(function() {\n' 123 "var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;\n" 124 "ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';\n" 125 "var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);\n" 126 '})();\n' 127 '</script>\n' 128 '</body>\n' 129 '</html>' [('12:25:36', '1.4478', '0'), ('12:25:33', '1.4478', '0'), ('12:25:31', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:29', '1.4478', '0')] </code></pre> <p>我希望你不要打算“玩”外汇交易：这是一个最好的方式，以迅速宽松货币。</p> <h2>更新3</h2> <p>对不起！我忘了你和Python3在一起。所以我认为您必须这样定义regex：</p> <blockquote> <p>regx = re.compile(<strong>b</strong>'\t\t\t\t\t......)</p> </blockquote> <p>也就是说，在字符串前面加上<strong>b</strong>，否则会出现类似于<a href="https://stackoverflow.com/questions/7139225/typeerror-str-does-not-support-the-buffer-interface">this question</a>的错误</p>

python中的Web抓取urlopen

1 个回答

相关Python问题