在Beautifull soup中执行findAll（）时跳过特定元素的内容

<div class ='article'> <span class='author'> John doe</span> <h3>title</h3> (...) <div class='comments'> <div class='row'> <span class='author'>Whining anon</span> <div class='content'> (...) </div> </div> </div> </div>

2条回答

网友

1楼 · 编辑于 2024-04-19 00:28:56

def AuthorNotInComments(tag):
    c = tag.get('class')
    if not c:
        return False
    if 'author' in c:
        if tag.findParents(class_='comments'):
            return False
        return True

 soup.findAll(AuthorNotInComments)

或“不区分大小写包含”版本：

def AuthorNotInComments(tag):
    c=tag.get('class')
    if not c:
        return False
    p=re.compile('author', re.IGNORECASE)
    str = " ".join(c)
    if p.match(str) and not tag.findParents(class_=re.compile('comments'),
    re.IGNORECASE):
        return True
    return False

soup.findAll(AuthorNotInComments)

我欢迎任何关于代码等方面的建议/清理。如果有人能想出如何使其可重用的方法，那就太好了，比如findAll(class_="test", not_under="junk")

网友

2楼 · 编辑于 2024-04-19 00:28:56

我认为一种方法是使用for循环和if语句来使用.parent进行过滤。这可以清理你的需要，但它的工作使用项。父项['class']获取包含的divs类以进行比较。你知道吗

from bs4 import BeautifulSoup

soup = BeautifulSoup(someHTML, 'html.parser')

results = soup.findAll(class_="author")

for item in results:
    if 'comments' in item.parent['class']:
        pass
    else:
        print item

或者作为一种理解：

clean_results = [item for item in results if 'comments' not in item.parent['class']]

相关问题更多 >

编程相关推荐

热门问题

热门文章

在Beautifull soup中执行findAll（）时跳过特定元素的内容

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >