beautifulSoup 网页抓取嵌套不当的 <ul> 列表

3 投票

1 回答

571 浏览

提问于 2025-04-17 07:36

我刚开始接触BeautifulSoup，最近三天一直在尝试从这个链接获取教堂的列表：http://www.ucanews.com/diocesan-directory/html/ordinary-of-philippine-cagayandeoro-parishes.html。

看起来这些数据的结构并不是很规范，只是为了展示而加了标签。理论上，它应该有一个层级结构：

Parishes
    District
    (data)
        Vicariate
        (data)
            Church
            (data)

不过我看到的每个教堂都是以一个项目符号开始的，每条记录之间用两个换行符分隔。我想要的字段名称是斜体的，并且和实际数据之间用“:”分开。每个单位条目（区|教区|堂区）可能有一个或多个数据字段。

到目前为止，我能提取出一些数据，但就是无法显示实体的名称。

soup=BeautifulSoup(page)
for e in soup.table.tr.findAll('i'):
    print e.string, e.nextSibling

最后，我希望能把数据按列转换成：区, 教区, 堂区, 地址, 电话, 主保, 堂区神父, <字段8>, <字段9>, <字段99>

希望能得到一些好的建议，指引我朝正确的方向前进。

数据提取数据解析网页抓取信息提取 beautifulsoup 列表处理爬虫技术 HTML结构

1 个回答

很遗憾，这个问题有点复杂，因为这个格式里有些你需要的数据没有明确的标记。

数据模型

另外，你对嵌套结构的理解也不完全正确。实际上，天主教会的结构（不是这个文档的结构）更像是：

District (also called deanery or vicariate. In this case they all seem to be Vicariates Forane.)
    Cathedral, Parish, Oratory

需要注意的是，教区并不一定要归属于某个区/教区，虽然通常是这样。我觉得这个文档的意思是，区后面列出的所有内容都属于那个区，但你不能完全确定。

里面还有一个条目不是教堂，而是一个社区（圣洛伦佐菲律宾华人社区）。这些社区在教会里没有独立的身份或管理（也就是说，它不是一座建筑）——而是一个没有特定地域的群体，专门有一位牧师负责照顾。

解析

我觉得你应该采取逐步的方法：

找到所有的 li 元素，每个都是一个“项目”
项目的名称是第一个文本节点
找到所有的 i 元素：这些是键、属性值、列行等
所有文本直到下一个 i（用 br 分隔）都是那个键的值。

这个页面有一个特别的问题，就是它的 HTML 实在是太糟糕了，你需要使用 MinimalSoup 来正确解析。 特别是，BeautifulSoup 认为 li 元素是嵌套的，因为文档里根本没有 ol 或 ul！

这段代码会给你一个包含元组的列表的列表。每个元组都是一个 ('key','value') 对应于一个项目。

一旦你有了这个数据结构，你可以根据需要进行标准化、转换、嵌套等操作，完全不需要再考虑 HTML。

from BeautifulSoup import MinimalSoup
import urllib

fp = urllib.urlopen("http://www.ucanews.com/diocesan-directory/html/ordinary-of-philippine-cagayandeoro-parishes.html")
html = fp.read()
fp.close()

soup = MinimalSoup(html);

root = soup.table.tr.td

items = []
currentdistrict = None
# this loops through each "item"
for li in root.findAll(lambda tag: tag.name=='li' and len(tag.attrs)==0):
    attributes = []
    parishordistrict = li.next.strip()
     # look for string "district" to determine if district; otherwise it's something else under the district
    if parishordistrict.endswith(' District'):
        currentdistrict = parishordistrict
        attributes.append(('_isDistrict',True))
    else:
        attributes.append(('_isDistrict',False))

    attributes.append(('_name',parishordistrict))
    attributes.append(('_district',currentdistrict))

    # now loop through all attributes of this thing
    attributekeys = li.findAll('i')

    for i in attributekeys:
        key = i.string # normalize as needed. Will be 'Address:', 'Parochial Victor:', etc
        # now continue among the siblings until we reach an <i> again.
        # these are "values" of this key
        # if you want a nested key:[values] structure, you can use a dict,
        # but beware of multiple <i> with the same name in your logic
        next = i.nextSibling
        while next is not None and getattr(next, 'name', None) != 'i':
            if not hasattr(next, 'name') and getattr(next, 'string', None):
                value = next.string.strip()
                if value:
                    attributes.append((key, value))
            next = next.nextSibling
    items.append(attributes)

from pprint import pprint
pprint(items)

回答于 2025-04-17 由 Python大师

分享举报

beautifulSoup 网页抓取嵌套不当的 <ul> 列表

1 个回答

数据模型

解析

撰写回答