美化组：从定位标记中提取文本

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1343628292&sr=1-1&keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

for div in soup.findAll('div', attrs={'class':'image'}): print "\n" for data in div.findNextSibling('div', attrs={'class':'data'}): for a in data.findAll('a', attrs={'class':'title'}): print a.text for img in div.findAll('img'): print img['src']

3条回答

网友

1楼 · 编辑于 2024-06-03 10:59:33

我建议使用lxml路径并使用xpath。

from lxml import etree
# data is the variable containing the html
data = etree.HTML(data)
anchor = data.xpath('//a[@class="title"]/text()')

网友

2楼 · 编辑于 2024-06-03 10:59:33

这将有助于：

from bs4 import BeautifulSoup

data = '''<div class="image">
        <a href="http://www.example.com/eg1">Content1<img  
        src="http://image.example.com/img1.jpg" /></a>
        </div>
        <div class="image">
        <a href="http://www.example.com/eg2">Content2<img  
        src="http://image.example.com/img2.jpg" /> </a>
        </div>'''

soup = BeautifulSoup(data)

for div in soup.findAll('div', attrs={'class':'image'}):
    print(div.find('a')['href'])
    print(div.find('a').contents[0])
    print(div.find('img')['src'])

如果您正在研究亚马逊产品，那么您应该使用官方API。至少有one Python package可以缓解您的刮擦问题，并将您的活动保持在使用条款内。

网友

3楼 · 编辑于 2024-06-03 10:59:33

就我而言，它的工作原理是：

from BeautifulSoup import BeautifulSoup as bs

url="http://blabla.com"

soup = bs(urllib.urlopen(url))
for link in soup.findAll('a'):
        print link.string

希望有帮助！

相关问题更多 >

编程相关推荐

热门问题

热门文章