python中如何删除字符串形式的表标记

1条回答

网友

1楼 · 发布于 2024-04-20 13:59:25

您可以使用HTMLParser，如下所示：

from HTMLParser import HTMLParser

s = \
"""
<html>
<p>Hi Team from the following Server :</p>

<table border="0" cellpadding="0" cellspacing="0" style="width:203pt">
    <tbody>
        <tr>
            <td style="height:15.0pt; width:203pt">ratsuite.sby.ibm.com</td>
        </tr>
    </tbody>
</table>

<p>&nbsp;</p>

<p>Please archive the following Project Areas :</p>

<table border="0" cellpadding="0" cellspacing="0" style="width:1436pt">
    <tbody>
        <tr>
            <td style="height:15.0pt; width:505pt">UNIT TEST - IBM OPAL 3.3 RC3</td>
            <td style="width:328pt">https://ratsuite.sby.ibm.com:9460/ccm</td>
            <td style="width:603pt">https://ratsuite.sby.ibm.com:9460/ccm/process/project-areas/_ckR-QJiUEeOXmZKjKhPE4Q</td>
        </tr>
    </tbody>
</table>
</html>
"""

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._last_tag = ''

    def handle_starttag(self, tag, attrs):
        #print "Encountered a start tag:", tag
        self._last_tag = tag

    def handle_endtag(self, tag):
        #print "Encountered an end tag :", tag
        self._last_tag = ''

    def handle_data(self, data):
        #print "Encountered some data  :", data
        if self._last_tag == 'p':
            print("<%s> tag data: %s" % (self._last_tag, data))

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed(s)

输出：

<p> tag data: Hi Team from the following Server :
<p> tag data: Please archive the following Project Areas :

网友

2楼 · 发布于 2024-04-20 13:59:25

如果不想使用外部库，可以使用re模块删除表：

output = re.sub('<table.+?</table>','',text,flags=re.DOTALL)

打印输出：

Hi Team from the following Server :



<p>&nbsp;</p>

<p>Please archive the following Project Areas :</p>

（还有两条不可见的空行）。你知道吗

关于模式，请注意+后面紧跟着?，这意味着使用非贪婪匹配—否则它将清除第一个表的开始和最后一个表的结束之间的任何内容。re.DOTALL是必需的，因为我们的子字符串包含换行符（\n）

网友

3楼 · 发布于 2024-04-20 13:59:25

使用BeautifulSoup解析HTML

例如：

from bs4 import BeautifulSoup

text="""<p>Hi Team from the following Server :</p>

<table border="0" cellpadding="0" cellspacing="0" style="width:203pt">
    <tbody>
        <tr>
            <td style="height:15.0pt; width:203pt">ratsuite.sby.ibm.com</td>
        </tr>
    </tbody>
</table>

<p>&nbsp;</p>

<p>Please archive the following Project Areas :</p>

<table border="0" cellpadding="0" cellspacing="0" style="width:1436pt">
    <tbody>
        <tr>
            <td style="height:15.0pt; width:505pt">UNIT TEST - IBM OPAL 3.3 RC3</td>
            <td style="width:328pt">https://ratsuite.sby.ibm.com:9460/ccm</td>
            <td style="width:603pt">https://ratsuite.sby.ibm.com:9460/ccm/process/project-areas/_ckR-QJiUEeOXmZKjKhPE4Q</td>
        </tr>
    </tbody>
</table>"""

soup = BeautifulSoup(text, "html.parser")
for p in soup.find_all("p"):
    print(p.text)

输出：

Hi Team from the following Server :

Please archive the following Project Areas :

相关问题更多 >

编程相关推荐

热门问题

热门文章

python中如何删除字符串形式的表标记

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >