python在抓取时解析html元素

2024-04-16 23:46:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个网站:

http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1

我想得到所有广告的名称和数组中项目的值,我现在得到的是:

import urllib2
from BeautifulSoup import BeautifulSoup
import re


listofads = []

page = urllib2.urlopen("http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1").read()
soup = BeautifulSoup(page)
for a in soup.findAll("div", {"class":re.compile("lista")}):
            for i in a:
                c = soup.findAll('h2')
                y = soup.findAll("span", {"class":re.compile("right")})
                listofads.append(c)
                listofads.append(y)


print listofads

我得到的是这样的:

                      </h2>, <h2>
                          Procura:  Macbook Pro i7, 15'

                      </h2>], [<span class="right">50  &euro;</span>

看起来很糟糕。。。。我想得到:

^{pr2}$

等等

网站的html如下:

<div class="listofads">
<div class="lista " style="cursor: pointer;">
<div class="lista " style="cursor: pointer;">
<div class="li_image">
<div class="li_desc">
<a href="http://www.custojusto.pt/Lisboa/Laptops/Macbook+pro+15-11018054.htm?xtcr=2&" name="11018054">
<h2> Macbook pro 15 </h2>
</a>
<div class="clear"></div>
<span class="li_date largedate listline"> Informática & Acessórios - Loures </span>
<span class="li_date largedate listline">
</div>
<div class="li_categoria">
<span class="li_price">
<ul>
<li>
<span class="right">1 199 €</span>
<div class="clear"></div>
</li>
<li class="excep"> </li>
</ul>
</span>
</div>
<div class="clear"></div>
</div>

如您所见,我只需要类“liu-desc”的div上的H2值(文本)和类“right”上span的价格。在


Tags: importrightdivpthttpwwwlih2
1条回答
网友
1楼 · 发布于 2024-04-16 23:46:09

我不知道如何使用BeautifulSoup来实现它,因为它不支持xpath,但是下面是如何使用lxml很好地实现它:

import urllib2
from lxml import etree
from lxml.cssselect import CSSSelector

url =  "http://www.custojusto.pt/Lisboa?ca=14_s&th=1&q=macbook&cg=0&w=1"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)

my_products = []
# Here, we harvet all the results into a list of dictionaries, containing the items we want.
for product_result in CSSSelector(u'div.lista')(tree):
    # Now, we can select the children element of each div.lista.
    this_product = {
        u'name': product_result.xpath('div[2]/a/h2'),  # first h2 of the second child div
        u'category': product_result.xpath('div[2]/span[1]'),  # first span of the second child div
        u'price': product_result.xpath('div[3]/span/ul/li[1]/span'),  # Third div, span, ul, first li, span tag.
    }
    print this_product.get(u'name')[0].text
    my_products.append(this_product)

# Let's inspect a product result now:
for product in my_products:
    print u'Product Name: "{0}", costs: "{1}"'.format(
        product.get(u'name')[0].text.replace(u'Procura:', u'').strip() if product.get(u'name') else 'NONAME!',
        product.get(u'price')[0].text.strip() if product.get(u'price') else u'NO PRICE!',
    )

下面是一些输出:

^{pr2}$

有些项目不包含价格,所以在输出每个项目之前需要检查结果。在

相关问题 更多 >