在此Python脚本中用其他（标准）HTML解析模块替换BeautifulSoup

0 投票

3 回答

685 浏览

提问于 2025-04-17 01:17

我写了一个用BeautifulSoup的脚本，运行得很好，而且代码也很容易读懂。不过，我希望将来能分享这个脚本，而BeautifulSoup是一个外部依赖，我想尽量避免使用它，特别是在Windows系统上。

下面是代码，它可以从给定的谷歌地图用户那里获取每个用户地图的链接。用#######标记的行就是使用BeautifulSoup的部分：

# coding: utf-8

import urllib, re
from BeautifulSoup import BeautifulSoup as bs

uid = '200931058040775970557'
start = 0
shown = 1

while True:
    url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
    source = urllib.urlopen(url).read()
    soup = bs(source)  ####
    maptables = soup.findAll(id=re.compile('^map[0-9]+$'))  #################
    for table in maptables:
        for line in table.findAll('a', 'maptitle'):  ################
            mapid = re.search(uid+'\.([^"]*)', str(line)).group(1)
            mapname = re.search('>(.*)</a>', str(line)).group(1).strip()[:-3]
            print shown, mapid, '\t', mapname
            shown += 1

            urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) +
                               '&msa=0&output=kml', mapname + '.kml')


    if '<span>Next</span>' in str(source):
        start += 5
    else:
        break

如你所见，只有三行代码在用BeautifulSoup，但我并不是程序员，尝试用其他标准的HTML和XML解析工具时遇到了很多困难，可能是因为我尝试的方法不对吧。

补充说明：这个问题主要是想替换掉脚本中的那三行代码，而不是寻找解决通用HTML解析问题的方法。

任何帮助都非常感谢，感谢你的阅读！

编程实践依赖管理标准库 html解析脚本优化 xml解析数据抓取 Windows兼容性

3 个回答

-1

我试过下面的代码，它会显示一系列链接。因为我没有安装beautiful soup，也不想去安装，所以我很难把结果和你们的代码给的结果进行对比。没有使用任何“soup”的“纯”python代码甚至更短、更易读。无论如何，这就是我的代码。告诉我你们的看法！友好的，路易。

#coding: utf-8

import urllib, re

uid = '200931058040775970557'
start = 0
shown = 1

while True:
    url = 'http://maps.google.com/maps/user?uid='+uid+'&ptab=2&start='+str(start)
    source = urllib.urlopen(url).read()
    while True:
        endit = source.find('maptitle')
        mapid = re.search(uid+'\.([^"]*)', str(source)).group(1)
        mapname = re.search('>(.*)</a>', str(source)).group(1).strip()[:-3]
        print shown, mapid, '\t', mapname
        shown += 1
        urllib.urlretrieve('http://maps.google.com.br/maps/ms?msid=' + uid + '.' + str(mapid) + '&msa=0&output=kml', mapname + '.kml')

    if '<span>Next</span>' in str(source):
        start += 5
    else:
        break

回答于 2025-04-17 由 Python大师

分享举报

要解析HTML代码，我看到有三种解决方案：

使用简单的字符串搜索方法（比如.find()等），速度快！
使用正则表达式（也叫做regex）
使用HTML解析器（HTMLParser）

回答于 2025-04-17 由 Python大师

分享举报

很遗憾，Python的标准库里没有好用的HTML解析工具，所以解析HTML的合理方法就是使用一些第三方模块，比如lxml.html或者BeautifulSoup。这并不意味着你必须依赖这些外部工具——这些模块都是免费的软件，如果你不想要外部依赖，你可以把它们和你的代码打包在一起，这样它们就和你自己写的代码一样，不算是额外的依赖。

回答于 2025-04-17 由 Python大师

分享举报

在此Python脚本中用其他（标准）HTML解析模块替换BeautifulSoup

3 个回答

撰写回答