python中如何删除字符串形式的表标记

2024-04-20 13:59:25 发布

您现在位置:Python中文网/ 问答频道 /正文

输入如下:

text="""Hi Team from the following Server :

<table border="0" cellpadding="0" cellspacing="0" style="width:203pt">
    <tbody>
        <tr>
            <td style="height:15.0pt; width:203pt">ratsuite.sby.ibm.com</td>
        </tr>
    </tbody>
</table>

<p>&nbsp;</p>

<p>Please archive the following Project Areas :</p>

<table border="0" cellpadding="0" cellspacing="0" style="width:1436pt">
    <tbody>
        <tr>
            <td style="height:15.0pt; width:505pt">UNIT TEST - IBM OPAL 3.3 RC3</td>
            <td style="width:328pt">https://ratsuite.sby.ibm.com:9460/ccm</td>
            <td style="width:603pt">https://ratsuite.sby.ibm.com:9460/ccm/process/project-areas/_ckR-QJiUEeOXmZKjKhPE4Q</td>
        </tr>
    </tbody>
</table>"""

在输出中,我只需要这两行,要删除python中带有数据的表标记:

来自以下服务器的Hi团队:

请将以下项目区域存档:


Tags: thecomstyletablehiwidthibmtr
1条回答
网友
1楼 · 发布于 2024-04-20 13:59:25

您可以使用HTMLParser,如下所示:

from HTMLParser import HTMLParser

s = \
"""
<html>
<p>Hi Team from the following Server :</p>

<table border="0" cellpadding="0" cellspacing="0" style="width:203pt">
    <tbody>
        <tr>
            <td style="height:15.0pt; width:203pt">ratsuite.sby.ibm.com</td>
        </tr>
    </tbody>
</table>

<p>&nbsp;</p>

<p>Please archive the following Project Areas :</p>

<table border="0" cellpadding="0" cellspacing="0" style="width:1436pt">
    <tbody>
        <tr>
            <td style="height:15.0pt; width:505pt">UNIT TEST - IBM OPAL 3.3 RC3</td>
            <td style="width:328pt">https://ratsuite.sby.ibm.com:9460/ccm</td>
            <td style="width:603pt">https://ratsuite.sby.ibm.com:9460/ccm/process/project-areas/_ckR-QJiUEeOXmZKjKhPE4Q</td>
        </tr>
    </tbody>
</table>
</html>
"""

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._last_tag = ''

    def handle_starttag(self, tag, attrs):
        #print "Encountered a start tag:", tag
        self._last_tag = tag

    def handle_endtag(self, tag):
        #print "Encountered an end tag :", tag
        self._last_tag = ''

    def handle_data(self, data):
        #print "Encountered some data  :", data
        if self._last_tag == 'p':
            print("<%s> tag data: %s" % (self._last_tag, data))

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed(s)

输出:

<p> tag data: Hi Team from the following Server :
<p> tag data: Please archive the following Project Areas :
网友
2楼 · 发布于 2024-04-20 13:59:25

如果不想使用外部库,可以使用re模块删除表:

output = re.sub('<table.+?</table>','',text,flags=re.DOTALL)

打印输出:

Hi Team from the following Server :



<p>&nbsp;</p>

<p>Please archive the following Project Areas :</p>

(还有两条不可见的空行)。你知道吗

关于模式,请注意+后面紧跟着?,这意味着使用非贪婪匹配—否则它将清除第一个表的开始和最后一个表的结束之间的任何内容。re.DOTALL是必需的,因为我们的子字符串包含换行符(\n

网友
3楼 · 发布于 2024-04-20 13:59:25

使用BeautifulSoup解析HTML

例如:

from bs4 import BeautifulSoup

text="""<p>Hi Team from the following Server :</p>

<table border="0" cellpadding="0" cellspacing="0" style="width:203pt">
    <tbody>
        <tr>
            <td style="height:15.0pt; width:203pt">ratsuite.sby.ibm.com</td>
        </tr>
    </tbody>
</table>

<p>&nbsp;</p>

<p>Please archive the following Project Areas :</p>

<table border="0" cellpadding="0" cellspacing="0" style="width:1436pt">
    <tbody>
        <tr>
            <td style="height:15.0pt; width:505pt">UNIT TEST - IBM OPAL 3.3 RC3</td>
            <td style="width:328pt">https://ratsuite.sby.ibm.com:9460/ccm</td>
            <td style="width:603pt">https://ratsuite.sby.ibm.com:9460/ccm/process/project-areas/_ckR-QJiUEeOXmZKjKhPE4Q</td>
        </tr>
    </tbody>
</table>"""

soup = BeautifulSoup(text, "html.parser")
for p in soup.find_all("p"):
    print(p.text)

输出:

Hi Team from the following Server :

Please archive the following Project Areas :

相关问题 更多 >