使用BeautifulGroup查找所有链接并按链接目标分组（href）

2024-05-15 13:16:06 发布

您现在位置：Python中文网/ 问答频道 /正文

1059

网友

男 | 程序猿一只，喜欢编程写python代码。

我正在使用BeautifulSoup包解析一个HTML主体来搜索所有<a>标记。我要做的是收集所有链接，并按<a>目标（href）对它们进行分组。在

例如：如果在HTML正文中两次列出http://www.google.com，那么我需要将这些链接组合在一起并列出<a>的data-name属性。（data-name是我的编辑器为用户命名链接时添加的内容）。在

def extract_links_from_mailing(mailing):
    content = "%s %s" % (mailing.html_body, mailing.plaintext)
    pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    links = []

    soup = BeautifulSoup(content, "html5lib")

    for link in soup.findAll('a'):
        if not link.get('no_track'):
            target = link.get('href')
            name = link.get('data-name')
            link_text = unicode(link)

            if any([
                not target,
                'example.net' in target,
                target.startswith('mailto'),
                '{' in target,
                target.startswith('#')
            ]):
                continue

            target = pattern.search(target)

            # found a target and the target isn't already apart of the list
            if target and not any(l['target'] == target.group() for l in links):
                links.append({
                    'name': name,
                    'target': target.group()
                })

    return links

上述输出如下：

^{pr2}$

我的目标是：

[
    {
        "target": "https://www.google.com",
        "names": ["Goog 1", "Goog 2"]
    },
    {
        "target": "http://www.yahoo.com",
        "names": ["Yahoo!"]
    },
]

Tags： name in com http target data get if

1条回答

网友

1楼 · 发布于 2024-05-15 13:16:06

可以使用^{}对目标进行分组：

from collections import defaultdict 

links = defaultdict(set)
for link in soup.findAll('a'):
    ...

    if target:
        links[target.group()].add(name)

因此，links将包含一个字典，其中的键是targets和{}s的值集

使用BeautifulGroup查找所有链接并按链接目标分组（href）

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用BeautifulGroup查找所有链接并按链接目标分组（href）

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >