使用urllib2时缺失源页面信息

1 投票

2 回答

761 浏览

提问于 2025-04-18 01:07

我正在尝试从数字游戏分发网站Steam（store.steampowered.com）上列出的游戏中抓取“游戏标签”数据（这和HTML标签不一样）。据我了解，这些信息在Steam的API中是找不到的。

一旦我获得了某个页面的原始源数据，我想把它传入beautifulsoup进行进一步解析，但我遇到了一个问题——urllib2似乎没有读取到我想要的信息（request也不行），尽管在浏览器中查看源页面时，这些信息显然是存在的。
举个例子，我可能会下载游戏“7 Days to Die”的页面（http://store.steampowered.com/app/251570/）。在Chrome浏览器中查看源页面时，我可以看到关于游戏“标签”的相关信息，位于页面末尾，从第1615行开始：

<script type="text/javascript">
      $J( function() {
          InitAppTagModal( 251570,    
          {"tagid":1662,"name":"Survival","count":283,"browseable":true},
          {"tagid":1659,"name":"Zombies","count":274,"browseable":true},
          {"tagid":1702,"name":"Crafting","count":248,"browseable":true},...

在initAppTagModal中，有“生存”、“僵尸”、“制作”等标签，这些都是我想收集的信息。

但是当我使用urllib2获取页面源代码时：

import urllib2  
url = "http://store.steampowered.com/app/224600/" #7 Days to Die page  
page = urllib2.urlopen(url).read()

我感兴趣的源页面部分没有保存在我的“page”变量中，而是从第1555行开始，下面的内容都是空白，直到结束的body和html标签。结果是这样的（包括换行符）：

</div><!-- End Footer -->





</body>  
</html>

在这个空白区域里，应该有我需要的源代码（还有其他代码）。
我在几台不同的电脑上尝试过，使用不同的Python 2.7安装（Windows和Mac），结果都是一样的。

我该如何获取我想要的数据呢？

谢谢你的关注。

urllib2 html解析网络爬虫 beautifulsoup 数据抓取 API限制 steam 游戏标签

2 个回答

-1

当你使用 urllib2 和 read() 的时候，你需要不断地分块读取，直到到达文件的末尾（EOF），这样才能读取完整的HTML源代码。

import urllib2  
url = "http://store.steampowered.com/app/224600/" #7 Days to Die page
url_handle = urllib2.urlopen(url)
data = ""
while True:
    chunk = url_handle.read()
    if not chunk:
        break
    data += chunk

另外一个选择是使用 requests 模块，可以这样做：

import requests
r = requests.get('http://store.steampowered.com/app/251570/')
soup = BeautifulSoup(r.text)

回答于 2025-04-18 由 Python大师

分享举报

嗯，我不知道我是不是漏掉了什么，但我用requests这个库是可以正常工作的：

import requests

# Getting html code
url = "http://store.steampowered.com/app/251570/"
html = requests.get(url).text

而且，更重要的是，请求的数据是json格式的，所以这样提取数据很简单：

# Extracting javscript object (a json like object)
start_tag = 'InitAppTagModal( 251570,'
end_tag = '],'
startIndex = html.find(start_tag) + len(start_tag)
endIndex = html.find(end_tag, startIndex) + len(end_tag) - 1
raw_data = html[startIndex:endIndex]

# Load raw data as python json object
data = json.loads(raw_data)

你会看到一个漂亮的json对象，像这样（这就是你需要的信息，对吧？）：

[
  {
    "count": 283,
    "browseable": true,
    "tagid": 1662,
    "name": "Survival"
 },
 {
    "count": 274,
    "browseable": true,
    "tagid": 1659,
    "name": "Zombies"
 },
 {
   "count": 248,
   "browseable": true,
   "tagid": 1702,
   "name": "Crafting"
 }......

希望这对你有帮助……

更新：

好的，我现在看到了你的问题，似乎问题出在第224600页。在这种情况下，网页要求你确认你的年龄，然后才会显示游戏信息。不过，这个问题很容易解决，只需要提交一个确认年龄的表单。这里是更新后的代码（我还创建了一个函数）：

def extract_info_games(page_id):
    # Create session
    session = requests.session()

    # Get initial html
    html = session.get("http://store.steampowered.com/app/%s/" % page_id).text

    # Checking if I'm in the check age page (just checking if the check age form is in the html code)
    if ('<form action="http://store.steampowered.com/agecheck/app/%s/"' % page_id) in html:
            # I'm being redirected to check age page
            # let's confirm my age with a POST:
            post_data = {
                     'snr':'1_agecheck_agecheck__age-gate',
                     'ageDay':1,
                     'ageMonth':'January',
                     'ageYear':'1960'
            }
            html = session.post('http://store.steampowered.com/agecheck/app/%s/' % page_id, post_data).text


    # Extracting javscript object (a json like object)
    start_tag = 'InitAppTagModal( %s,' % page_id
    end_tag = '],'
    startIndex = html.find(start_tag) + len(start_tag)
    endIndex = html.find(end_tag, startIndex) + len(end_tag) - 1
    raw_data = html[startIndex:endIndex]

    # Load raw data as python json object
    data = json.loads(raw_data)
    return data

使用方法如下：

extract_info_games(224600)
extract_info_games(251570)

祝你玩得开心！

回答于 2025-04-18 由 Python大师

分享举报

使用urllib2时缺失源页面信息

2 个回答

撰写回答