如何用Python提取web页面的某些部分

网友
1楼 · 编辑于 2024-04-29 05:41:46

"Beau--ootiful Soo--oop!
Beau--ootiful Soo--oop!
Soo--oop of the e--e--evening,
Beautiful, beauti--FUL SOUP!"
——刘易斯·卡罗尔，Alice's Adventures in Wonderland
我想这正是他想要的！
素甲鱼可能会这样做：
>>> from BeautifulSoup import BeautifulSoup >>> import urllib2 >>> url = 'http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm' >>> page = urllib2.urlopen(url) >>> soup = BeautifulSoup(page) >>> for row in soup.html.body.findAll('tr'): ... data = row.findAll('td') ... if data and 'subclass 885online' in data[0].text: ... print data[4].text ... 15 May 2011
但我不确定这会有什么帮助，因为那个日期已经过去了！
祝你申请顺利！

网友
2楼 · 编辑于 2024-04-29 05:41:46

有一个图书馆叫美丽汤，它做你要求的工作。http://www.crummy.com/software/BeautifulSoup/

网友
3楼 · 编辑于 2024-04-29 05:41:46

您可能希望以此为起点：

Python 2.6.7 (r267:88850, Jun 13 2011, 22:03:32) 
[GCC 4.6.1 20110608 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2, re
>>> from BeautifulSoup import BeautifulSoup
>>> urllib2.urlopen('http://www.immi.gov.au/skilled/general-skilled-migration/estimated-allocation-times.htm')
<addinfourl at 139158380 whose fp = <socket._fileobject object at 0x84aa2ac>>
>>> html = _.read()
>>> soup = BeautifulSoup(html)
>>> soup.find(text = re.compile('\\bsubclass 885\\b')).parent.parent.find('td', text = re.compile(' [0-9]{4}$'))
u'15 May 2011'

相关问题更多 >

编程相关推荐

热门问题

热门文章