如何使用lxml提取文本？

1条回答

网友

1楼 · 发布于 2024-06-10 20:35:40

一般来说，要解决这些问题，首先必须下载感兴趣的页面作为文本（使用urllib.urlopen或其他任何东西，甚至外部实用程序，如curl或wget，但不要使用浏览器，因为您希望在任何Javascript都有机会运行之前看到页面的外观），并研究它以了解其结构。在这种情况下，经过一番研究，您会发现相关的部分是（在head中剪去一些不相关的部分，并为可读性将行分开）…：

<body onload=nx_init();>
 <dl>
 <dt>
<a href="http://news.naver.com/main/read.nhn?mode=LSD&mid=sec&sid1=&oid=091&aid=0002497340"
 [[snipping other attributes of this tag]]>
JAPAN TOKYO INTERNATIONAL FILM FESTIVAL</a>
</dt>
 <dd class="txt_inline">
EPA¿¬ÇÕ´º½º ¼¼°è <span class="bar">
|</span>
 2009.10.25 (ÀÏ) ¿ÀÈÄ 7:21</dd>
 <dd class="sh_news_passage">
 Japan, 25 October 2009. Gayet won the Best Actress Award for her role in the film 'Eight <b>
Times</b>
 Up' directed by French filmmaker Xabi Molia. EPA/DAI KUROKAWA</dd>

等等。因此，您希望将<a>标记中<a>的内容作为“subject”，并将其后面的<dd>标记的内容作为“content”（在相同的<dl>中）。在

您得到的标题包含：

^{pr2}$

因此，您还必须找到一种将编码解释为Unicode的方法我相信编码也被称为'euc_kr'，我的Python安装似乎附带了一个编解码器，但是您也应该检查一下您的编码。在

一旦你确定了所有这些方面，你就试着lxml.etree.parse这个URL，就像其他很多网页一样，它不解析它并没有真正呈现格式良好的HTML（试试w3c的验证器，看看它是怎么被破坏的）。在

由于格式错误的HTML在web上很常见，所以存在“容忍解析器”，试图补偿常见错误。Python中最流行的是beauthoulsoup，实际上lxml是在lxml2.0.3或更高版本中附带的，您可以使用beauthoulsoup作为底层解析器，然后继续“就像”文档已经正确解析一样，但是我发现直接使用beauthulsoup更简单。在

例如，这里有一个脚本在该URL发出前几个subject/content对（它们目前已经更改，最初它们与您给出的相同；-）。您需要一个支持Unicode输出的终端（例如，我在Mac上运行时没有问题）终端应用程序设置为utf-8）当然，除了print之外，您还可以收集Unicode片段（例如，将它们附加到一个列表中，''.join当您有所有需要的片段时，''.join），对它们进行任意编码，等等

from BeautifulSoup import BeautifulSoup
import urllib

def getit(pagetext, howmany=0):
  soup = BeautifulSoup(pagetext)
  results = []
  dls = soup.findAll('dl')
  for adl in dls:
    thedt = adl.dt
    while thedt:
      thea = thedt.a
      if thea:
        print 'SUBJECT:', thea.string
      thedd = thedt.findNextSibling('dd')
      if thedd:
        print 'CONTENT:',
        while thedd:
          for x in thedd.findAll(text=True):
            print x,
          thedd = thedd.findNextSibling('dd')
        print
      howmany -= 1
      if not howmany: return
      print
      thedt = thedt.findNextSibling('dt')

theurl = ('http://news.search.naver.com/search.naver?'
          'sm=tab%5Fhty&where=news&query=times&x=0&y=0')
thepage = urllib.urlopen(theurl).read()
getit(thepage, 3)

lxml中的逻辑或“lxml服装中的beauthulsoup”并没有太大区别，只是各种导航操作的拼写和大小写都有一些变化。在

相关问题更多 >

编程相关推荐

热门问题

热门文章