刮板wiki+lxml。如何获取具有类的元素的子元素的href属性？

for i in range(1, 11): # The HTML Scraper for the 20 pages that list all the exhibitors url = 'http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm?alpha=%40&type=alpha&page=' + str(i) + '#GotoResults' print url list_html = scraperwiki.scrape(url) root = lxml.html.fromstring(list_html) href_element = root.cssselect('td.mys-elastic mys-left a') for element in href_element: # Convert HTMl to lxml Object href = href_element.get('href') print href page_html = scraperwiki.scrape('http://ahr13.mapyourshow.com' + href) print page_html

2条回答

网友

1楼 · 编辑于 2024-06-07 14:59:15

import lxml.html as lh
from itertools import chain

URL = 'http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm?alpha=%40&type=alpha&page='
BASE = 'http://ahr13.mapyourshow.com'
path = '//table[2]//td[@class="mys-elastic mys-left"]//@href'

results = []   
for i in range(1,21):     
    doc=lh.parse(URL+str(i)) 
    results.append(BASE+i for i in doc.xpath(path))

print list(chain(*results))

网友

2楼 · 编辑于 2024-06-07 14:59:15

不需要在javascript上浪费时间-它都在html中：

import scraperwiki
import lxml.html

html = scraperwiki.scrape('http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm?  alpha=%40&type=alpha&page=1')

root = lxml.html.fromstring(html)
# get the links
hrefs = root.xpath('//td[@class="mys-elastic mys-left"]/a')

for href in hrefs:
   print 'http://ahr13.mapyourshow.com' + href.attrib['href']

相关问题更多 >

编程相关推荐

热门问题

热门文章