LXML网页刮取，格式错误

2024-04-19 10:17:51 发布

男 | 程序猿一只，喜欢编程写python代码。

我试图从这个网站http://sana.sy/eng/21/2013/01/07/pr-460536.htm中提取文章文本，但是它的HTML格式不正确。谁能告诉我怎么做对吗。你知道吗

this is the code
import urllib2
from lxml import etree
import StringIO

speachesurls = ["http://sana.sy/eng/21/2013/01/07/pr-460536.htm", "http://sana.sy/eng/21/2012/06/04/pr-423234.htm", "http://sana.sy/eng/21/2012/01/12/pr-393338.htm"]


# scrape the speaches

for url in speachesurls:
    result = urllib2.urlopen(url)
    html = result.read()
    parser = etree.HTMLParser()
    tree = etree.parse(StringIO.StringIO(html), parser)
    xpath = "//html/body/table[3]/tbody/tr[3]/td[4]/table[2]/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr[2]/td/div/table/tbody/tr[2]/td/div/p"
    a = tree.find(xpath)
    print a.text_content()

Tags： the import http html table pr eng tr

1条回答

网友

1楼 · 发布于 2024-04-19 10:17:51

lxml或格式错误的html没有问题，lxml的html解析器可以处理这个问题。你知道吗

代码运行良好，只是xpath表达式与任何内容都不匹配，所以a将是None。你知道吗

LXML网页刮取，格式错误

相关问题更多 >

编程相关推荐

热门问题

热门文章

LXML网页刮取，格式错误

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >