从带有特殊格式的URL结果中提取数据

4 投票

2 回答

21748 浏览

提问于 2025-04-16 06:18

我有一个网址：
http://somewhere.com/relatedqueries?limit=2&query=seedterm

在这个网址中，修改输入的参数 limit 和 query 可以生成我们想要的数据。limit 是可以获取的最大条目数，而 query 是我们用来搜索的关键词。

这个网址返回的结果是这样的格式：
oo.visualization.Query.setResponse({version:'0.5',reqId:'0',status:'ok',sig:'1303596067112929220',table:{cols:[{id:'score',label:'Score',type:'number',pattern:'#,##0.###'},{id:'query',label:'Query',type:'string',pattern:''}],rows:[{c:[{v:0.9894380670262618,f:'0.99'},{v:'newterm1'}]},{c:[{v:0.9894380670262618,f:'0.99'},{v:'newterm2'}]}],p:{'totalResultsCount':'7727'}}});

我想写一个 Python 脚本，接受两个参数（限制数量和查询关键词），去网上获取数据，解析结果，并返回一个包含新词的列表，比如 ['newterm1','newterm2']。

我希望能得到一些帮助，特别是在获取网址数据这部分，因为我之前从来没有做过这件事。

参数化查询数据处理数据提取 URL解析数据格式关键词搜索 api请求结果解析

2 个回答

我不太明白你的问题，因为从你的代码来看，似乎你在使用可视化API（顺便说一下，这是我第一次听说这个东西）。

不过，如果你只是想找个方法从网页上获取数据，你可以使用urllib2，这个库就是用来获取数据的。如果你想对获取到的数据进行处理，就需要用到一个更合适的库，比如BeautifulSoup。

如果你处理的是其他网络服务（比如RSS、Atom、RPC），而不是普通网页，你可以找到很多Python库，它们可以很好地处理这些服务。

import urllib2

from BeautifulSoup import BeautifulSoup

result =  urllib2.urlopen('http://somewhere.com/relatedqueries?limit=%s&query=%s' % (2, 'seedterm'))

htmletxt = resul.read()

result.close()

soup = BeautifulSoup(htmltext, convertEntities="html" )

# you can parse your data now check BeautifulSoup API.

回答于 2025-04-16 由 Python大师

分享举报

听起来你可以把这个问题拆分成几个小问题来解决。

小问题

在完成整个脚本之前，有几个问题需要解决：

构建请求的URL： 从模板创建一个配置好的请求URL
获取数据： 实际发起请求
解包 JSONP： 返回的数据看起来像是被一个JavaScript函数包裹的JSON
遍历对象图： 在结果中找到你想要的信息

构建请求的URL

这其实就是简单的字符串格式化。

url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
url = url_template.format(limit=2, seedterm='seedterm')

Python 2 注意事项

你需要在这里使用字符串格式化操作符 (%)。
url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
url = url_template % dict(limit=2, seedterm='seedterm')

获取数据

你可以使用内置的 urllib.request 模块来完成这个操作。

import urllib.request
data = urllib.request.urlopen(url) # url from previous section

这会返回一个类似文件的对象，叫做 data。你也可以在这里使用 with 语句：

with urllib.request.urlopen(url) as data:
    # do processing here

Python 2 注意事项

要导入 urllib2 而不是 urllib.request。

解包 JSONP

你粘贴的结果看起来像是 JSONP。因为被调用的包装函数 (oo.visualization.Query.setResponse) 是不变的，我们可以直接把这个方法调用去掉。

result = data.read()

prefix = 'oo.visualization.Query.setResponse('
suffix = ');'

if result.startswith(prefix) and result.endswith(suffix):
    result = result[len(prefix):-len(suffix)]

解析 JSON

得到的 result 字符串就是 JSON 数据。用内置的 json 模块来解析它。

import json

result_object = json.loads(result)

遍历对象图

现在，你有了一个 result_object，它代表了 JSON 响应。这个对象本身是一个 dict，里面有像 version、reqId 这样的键。根据你的问题，这里是你需要做的事情来创建你的列表。

# Get the rows in the table, then get the second column's value for
# each row
terms = [row['c'][2]['v'] for row in result_object['table']['rows']]

把所有内容整合在一起

#!/usr/bin/env python3

"""A script for retrieving and parsing results from requests to
somewhere.com.

This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python3 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""

import urllib.request
import json
import sys

E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2

def parse_result(result):
    """Parse a JSONP result string and return a list of terms"""
    prefix = 'oo.visualization.Query.setResponse('
    suffix = ');'

    # Strip JSONP function wrapper
    if result.startswith(prefix) and result.endswith(suffix):
        result = result[len(prefix):-len(suffix)]

    # Deserialize JSON to Python objects
    result_object = json.loads(result)

    # Get the rows in the table, then get the second column's value
    # for each row
    return [row['c'][2]['v'] for row in result_object['table']['rows']]

def retrieve_terms(limit, seedterm):
    """Retrieves and parses data and returns a list of terms"""
    url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
    url = url_template.format(limit=limit, seedterm=seedterm)

    try:
        with urllib.request.urlopen(url) as data:
            data = perform_request(limit, seedterm)
            result = data.read()
    except:
        print('Could not request data from server', file=sys.stderr)
        exit(E_OPERATION_ERROR)

    terms = parse_result(result)
    print(terms)

def main(limit, seedterm):
    """Retrieves and parses data and prints each term to standard output"""
    terms = retrieve_terms(limit, seedterm)
    for term in terms:
        print(term)

if __name__ == '__main__'
    try:
        limit = int(sys.argv[1])
        seedterm = sys.argv[2]
    except:
        error_message = '''{} limit seedterm

limit must be an integer'''.format(sys.argv[0])
        print(error_message, file=sys.stderr)
        exit(2)

    exit(main(limit, seedterm))

Python 2.7 版本

#!/usr/bin/env python2.7

"""A script for retrieving and parsing results from requests to
somewhere.com.

This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python2.7 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""

import urllib2
import json
import sys

E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2

def parse_result(result):
    """Parse a JSONP result string and return a list of terms"""
    prefix = 'oo.visualization.Query.setResponse('
    suffix = ');'

    # Strip JSONP function wrapper
    if result.startswith(prefix) and result.endswith(suffix):
        result = result[len(prefix):-len(suffix)]

    # Deserialize JSON to Python objects
    result_object = json.loads(result)

    # Get the rows in the table, then get the second column's value
    # for each row
    return [row['c'][2]['v'] for row in result_object['table']['rows']]

def retrieve_terms(limit, seedterm):
    """Retrieves and parses data and returns a list of terms"""
    url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
    url = url_template % dict(limit=2, seedterm='seedterm')

    try:
        with urllib2.urlopen(url) as data:
            data = perform_request(limit, seedterm)
            result = data.read()
    except:
        sys.stderr.write('%s\n' % 'Could not request data from server')
        exit(E_OPERATION_ERROR)

    terms = parse_result(result)
    print terms

def main(limit, seedterm):
    """Retrieves and parses data and prints each term to standard output"""
    terms = retrieve_terms(limit, seedterm)
    for term in terms:
        print term

if __name__ == '__main__'
    try:
        limit = int(sys.argv[1])
        seedterm = sys.argv[2]
    except:
        error_message = '''{} limit seedterm

limit must be an integer'''.format(sys.argv[0])
        sys.stderr.write('%s\n' % error_message)
        exit(2)

    exit(main(limit, seedterm))