使用Python获取<a>标签的内容
假设我已经把HTML内容读入我的程序,像这样:
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T & P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html">Optical Sales Position</a> - <font size="-1"> (New Westminster)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817709780.html">Sales Clerk</a> - <font size="-1"> (Kits)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817676850.html">MARINE SALES</a> - <font size="-1"> (VANCOUVER ( KITS ))</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817608506.html">Retail Sales Associate</a> - <font size="-1"> (Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817573985.html">Retail with small parts appliance background</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817540938.html">Manager *Enjoyable work atmosphere</a> - <font size="-1"> (Langley Centre)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html">Team Member - Retail Store - FT</a> - <font size="-1"> (Burnaby South)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817459155.html">STORE MANAGER-SHOE WAREHOUSE</a> - <font size="-1"> (South Surrey-Semiahmoo)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/pml/ret/1817448777.html">Retail Sales</a> - <font size="-1"> (Coquitlam)</font></p>
我该怎么获取文本节点的内容呢?我想在终端上打印出类似下面的内容:
http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - TRAVEL AGENT
到目前为止,我有以下代码可以很好地提取href链接,但我不太确定怎么提取数据本身。我在考虑重写handle_data(self, data)
这个方法,来自sgmllib.py模块,但到现在为止我还想不出办法来实现。
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k == "href"]
if href:
self.urls.extend(href)
谢谢!
5 个回答
2
SGMLParser在Python 2.6中已经被弃用,并且在3.0版本中将会被移除。你可能更想使用HTMLParser模块。我之前从来没有用过这个模块(我通常都是用BeautifulSoup来处理这些事情),所以我想学习一下它是怎么工作的。这里有一个我整理的示例脚本,应该能满足你的需求。
#!/usr/bin/env python
from HTMLParser import HTMLParser
class URLParser(HTMLParser):
def __init__(self):
self.in_link = False
self.links = []
self.current_link = ''
HTMLParser.__init__(self)
def handle_starttag(self, tag, attrs):
if tag == 'a':
self.current_link = self.get_href_from_attrs(attrs)
self.in_link = True
def handle_endtag(self, tag):
if tag == 'a':
self.links.append(self.current_link)
self.in_link = False
def handle_data(self, data):
if self.in_link:
self.current_link = '%s - %s' % (self.current_link, data)
def get_href_from_attrs(self, attrs):
# The attrs dict is a list of tuples like:
# [('href', 'www.google.com'), ('class', 'some-class')]
for prop, val in attrs:
if prop == 'href':
return val
return ''
if __name__ == '__main__':
the_html = '''
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T & P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html">Optical Sales Position</a> - <font size="-1"> (New Westminster)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817709780.html">Sales Clerk</a> - <font size="-1"> (Kits)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817676850.html">MARINE SALES</a> - <font size="-1"> (VANCOUVER ( KITS ))</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817608506.html">Retail Sales Associate</a> - <font size="-1"> (Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817573985.html">Retail with small parts appliance background</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817540938.html">Manager *Enjoyable work atmosphere</a> - <font size="-1"> (Langley Centre)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html">Team Member - Retail Store - FT</a> - <font size="-1"> (Burnaby South)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817459155.html">STORE MANAGER-SHOE WAREHOUSE</a> - <font size="-1"> (South Surrey-Semiahmoo)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/pml/ret/1817448777.html">Retail Sales</a> - <font size="-1"> (Coquitlam)</font></p>
'''
url_parser = URLParser()
url_parser.feed(the_html)
print '\n'.join(url_parser.links)
输出
http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - F/T - P/T Sales Associate - Caliente Fashions
http://vancouver.en.craigslist.ca/van/ret/1817804151.html - IMMEDIATE EMPLOYMENT WANTED!
http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html - TRAVEL AGENT
http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html - Optical Sales Position
http://vancouver.en.craigslist.ca/van/ret/1817709780.html - Sales Clerk
http://vancouver.en.craigslist.ca/van/ret/1817676850.html - MARINE SALES
http://vancouver.en.craigslist.ca/van/ret/1817608506.html - Retail Sales Associate
http://vancouver.en.craigslist.ca/van/ret/1817573985.html - Retail with small parts appliance background
http://vancouver.en.craigslist.ca/rds/ret/1817540938.html - Manager *Enjoyable work atmosphere
http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html - Team Member - Retail Store - FT
http://vancouver.en.craigslist.ca/rds/ret/1817459155.html - STORE MANAGER-SHOE WAREHOUSE
http://vancouver.en.craigslist.ca/pml/ret/1817448777.html - Retail Sales
更新: 在尝试了这个小练习后,我觉得这个接口用起来很糟糕,所以我决定还是继续使用更干净的BeautifulSoup库。可以看看Alex的示例,了解一下怎么做。
4
我个人推荐使用lxml这个库。安装好之后,想要获取你需要的内容非常简单:
from lxml import html
tree = html.fromstring(open("data.html").read())
print [e.text_content() for e in tree.xpath("//a")]
8
最简单的选择可能就是BeautifulSoup(记得使用3.0.8或更高版本的3.0.*
,不要使用3.1.*
,除非你在用Python 3 -- 具体问题可以查看这里!)。
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(thehtmlstring)
for anchor in soup.findAll('a'):
print anchor['href'], anchor.string
BeautifulSoup会生成unicode字符串——如果这对你来说有问题,记得按照你想要的方式对它们进行编码,以便得到你想要的字节字符串!