使用b4进行刮削时排除隐藏标记

1条回答

网友

1楼 · 发布于 2024-05-15 23:21:48

使用selenium将使任务变得更容易，因为它知道哪些元素是隐藏的，哪些元素是不隐藏的

但是，无论如何，这里有一个基本代码，您可能需要进一步改进。这里的想法是解析style标记并获得要排除的类的列表，有一个要排除的标记列表，并检查tr中每个子元素的style属性：

import re
from bs4 import BeautifulSoup

data = """ your html here """

soup = BeautifulSoup(data)
tr = soup.tr

# get classes to exclude
classes_to_exclude = []
for line in tr.style.text.split():
    match = re.match(r'^\.(.*?)\{display:none\}', line)
    if match:
        classes_to_exclude.append(match.group(1))

tags_to_exclude = ['style', 'script']

texts = []
for item in tr.find_all(text=True):
    if item.parent.name in tags_to_exclude:
        continue

    class_ = item.parent.get('class')
    if class_ and class_[0] in classes_to_exclude:
        continue

    if item.parent.get('style') == 'display:none':
        continue

    texts.append(item)

print ''.join(texts.strip())

印刷品：

^{pr2}$

另请参见：

BeautifulSoup Grab Visible Webpage Text

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用b4进行刮削时排除隐藏标记

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >