使用Python读取网页

0 投票

2 回答

924 浏览

提问于 2025-04-16 02:32

我正在尝试用Python读取和处理一个网页，网页里有这样的内容：

              <div class="or_q_tagcloud" id="tag1611"></div></td></tr><tr><td class="or_q_artist"><a title="[Artist916]" href="http://rateyourmusic.com/artist/ac_dc" class="artist">AC/DC</a></td><td class="or_q_album"><a title="[Album374717]" href="http://rateyourmusic.com/release/album/ac_dc/live_f5/" class="album">Live</a></td><td class="or_q_rating" id="rating374717">4.0</td><td class="or_q_ownership" id="ownership374717">CD</td><td class="or_q_tags_td">

我现在只对艺术家名字（比如AC/DC）和专辑名字（比如Live）感兴趣。我可以用libxml2dom读取并打印它们，但我搞不清楚怎么区分这些链接，因为每个链接的节点值都是None。

一种明显的方法是逐行读取输入，但有没有更聪明的方法来处理这个HTML文件，这样我可以创建两个独立的列表，让它们的索引相互对应，或者用一个结构体来存储这些信息呢？

import urllib
import sgmllib
import libxml2dom

def collect_text(node):
  "A function which collects text inside 'node', returning that text."

  s = ""
  for child_node in node.childNodes:
    if child_node.nodeType == child_node.TEXT_NODE:
        s += child_node.nodeValue
    else:
        s += collect_text(child_node)
  return s

  f = urllib.urlopen("/home/x/Documents/rym_list.html")

  s = f.read()

  doc = libxml2dom.parseString(s, html=1)

  links = doc.getElementsByTagName("a")
  for link in links:
    print "--\nNode " , artist.childNodes
    if artist.localName == "artist":
      print "artist"
    print collect_text(artist).encode('utf-8')

  f.close()

2 个回答

看看你能不能用JavaScript和jQuery风格的选择器来解决这个问题，这样可以方便地获取你想要的元素或文本。
如果可以的话，下载一个BeautifulSoup库给Python用，这样你很快就能上手了。

回答于 2025-04-16 由 Python大师

分享举报

给定这段小小的HTML代码，我不知道在整个页面上这样做是否有效，但我可以告诉你怎么用 lxml.etree 和 xpath 来提取出 'AC/DC' 和 'Live'。

>>> from lxml import etree
>>> doc = etree.HTML("""<html>
... <head></head>
... <body>
... <tr>
... <td class="or_q_artist"><a title="[Artist916]" href="http://rateyourmusic.com/artist/ac_dc" class="artist">AC/DC</a></td>
... <td class="or_q_album"><a title="[Album374717]" href="http://rateyourmusic.com/release/album/ac_dc/live_f5/" class="album">Live</a></td>
... <td class="or_q_rating" id="rating374717">4.0</td><td class="or_q_ownership" id="ownership374717">CD</td>
... <td class="or_q_tags_td">
... </tr>
... </body>
... </html>
... """)
>>> doc.xpath('//td[@class="or_q_artist"]/a/text()|//td[@class="or_q_album"]/a/text()')
['AC/DC', 'Live']

回答于 2025-04-16 由 Python大师

分享举报

使用Python读取网页

2 个回答

撰写回答