如何使用Scrapy获取结构化JSON输出？

def parse(self, response): sel = Selector(response) tuples = sel.xpath('//*[td[@class = "caption"]]') items = [] for tuple in tuples: item = DataTuple() keyTemp = tuple.xpath('td[1]').extract()[0] key = html2text.html2text(keyTemp).rstrip() valueTemp = tuple.xpath('td[2]').extract()[0] value = html2text.html2text(valueTemp).rstrip() item[key] = value items.append(item) return items

1条回答

网友

1楼 · 发布于 2024-06-16 18:47:35

下面我做了一个快速的模型，我会建议你这样做，只要你知道每页TD的数量。你可以随意取一部分或全部。对于您的问题，这可能是设计过度了（抱歉！）；你可以只取块数位，然后完成。。。。在

有几点需要注意：

1）避免使用“tuple”作为变量名，因为它也是一个内部关键字

2）学会使用生成器/内置，因为如果你同时在做很多站点，它们会更快、更轻（参见下面的parse_-to-_-kv和chunk-u-by-hunk-number）

3）尝试隔离解析逻辑，以便如果它发生变化，您可以轻松地在一个位置交换（参见下面的extract_td）

4）您的函数不使用“self”，因此应该使用@staticmethod decorator并从函数中删除此参数

5）目前输出是dict，但是如果需要json对象，可以导入json并转储它

def extract_td(item, index):
    # extract logic for my websites which allows extraction
    # of either a key or value from a table data
    # returns a string representation of item[index]
    # this is very page/tool specific!
    td_as_str = "td[%i]" % index
    val = item.xpath(td_as_str).extract()[0]
    return html2text.html2text(val).rstrip()

def parse_to_kv(xpaths):
    # returns key, value pairs from the given
    # this is also page specific
    for xpath in xpaths:
        yield extract_td(xpath, 0), extract_td(xpath, 1)

def chunk_by_number(alist, num):
    # splices alist into chunks of num size.
    # This is a very generic, reusable operation
    for chunk in list(zip(*(iter(alist),) * num)):
        yield chunk

def parse(response, td_per_page):
    # extracts key/value pairs based on the table datas in response
    # yields lists of length td_per_page which contain these key/value extractions
    # this is very specific based on our parse patterns
    sel = Selector(response)
    tuples = sel.xpath('//*[td[@class = "caption"]]')
    kv_generator = parse_to_kv(tuples)

    for page in chunk_by_number(kv_generator, td_per_page):
        print dict(page)

相关问题更多 >

编程相关推荐

热门问题

热门文章