python在抓取时解析html元素

import urllib2 from BeautifulSoup import BeautifulSoup import re listofads = [] page = urllib2.urlopen("http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1").read() soup = BeautifulSoup(page) for a in soup.findAll("div", {"class":re.compile("lista")}): for i in a: c = soup.findAll('h2') y = soup.findAll("span", {"class":re.compile("right")}) listofads.append(c) listofads.append(y) print listofads

<div class="listofads"> <div class="lista " style="cursor: pointer;"> <div class="lista " style="cursor: pointer;"> <div class="li_image"> <div class="li_desc"> <a href="http://www.custojusto.pt/Lisboa/Laptops/Macbook+pro+15-11018054.htm?xtcr=2&" name="11018054"> <h2> Macbook pro 15 </h2> </a> <div class="clear"></div> <span class="li_date largedate listline"> Informática & Acessórios - Loures </span> <span class="li_date largedate listline"> </div> <div class="li_categoria"> <span class="li_price"> <ul> <li> <span class="right">1 199 €</span> <div class="clear"></div> </li> <li class="excep"> </li> </ul> </span> </div> <div class="clear"></div> </div>

1条回答

网友

1楼 · 发布于 2024-04-16 23:46:09

我不知道如何使用BeautifulSoup来实现它，因为它不支持xpath，但是下面是如何使用lxml很好地实现它：

import urllib2
from lxml import etree
from lxml.cssselect import CSSSelector

url =  "http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)

my_products = []
# Here, we harvet all the results into a list of dictionaries, containing the items we want.
for product_result in CSSSelector(u'div.lista')(tree):
    # Now, we can select the children element of each div.lista.
    this_product = {
        u'name': product_result.xpath('div[2]/a/h2'),  # first h2 of the second child div
        u'category': product_result.xpath('div[2]/span[1]'),  # first span of the second child div
        u'price': product_result.xpath('div[3]/span/ul/li[1]/span'),  # Third div, span, ul, first li, span tag.
    }
    print this_product.get(u'name')[0].text
    my_products.append(this_product)

# Let's inspect a product result now:
for product in my_products:
    print u'Product Name: "{0}", costs: "{1}"'.format(
        product.get(u'name')[0].text.replace(u'Procura:', u'').strip() if product.get(u'name') else 'NONAME!',
        product.get(u'price')[0].text.strip() if product.get(u'price') else u'NO PRICE!',
    )

下面是一些输出：

^{pr2}$

有些项目不包含价格，所以在输出每个项目之前需要检查结果。在

相关问题更多 >

编程相关推荐

热门问题

热门文章