如何用python从这个例子中删除<table>结构？

paragraph = ''' <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br /> <table> <tr> <td> text title </td> <td> text title 2 </td> </tr> </table> <p> lorem ipsum</p> '''

3条回答

网友

1楼 · 编辑于 2024-06-16 13:45:12

您也可以尝试这种基本的字符串格式

paragraph = paragraph[:paragraph.find('<table>')] +     # Find the starting letter of '<table>'
            paragraph[paragraph.find('</table>')+       # Find the starting letter of </table>
            (len('<\table>')+1):]                       # Add 1 because length starts from zero

print(paragraph)

甚至这种方法也可以用于基本的文本提取

网友

2楼 · 编辑于 2024-06-16 13:45:12

使用regex很复杂，这是我建议的一种愚蠢的方式：

def remove_table(s):
    left_index = s.find('<table>')
    if -1 == left_index:
        return s
    right_index = s.find('</table>', left_index)
    return s[:left_index] + remove_table(s[right_index + 8:])

结果中可能有一些空行。你知道吗

网友

3楼 · 编辑于 2024-06-16 13:45:12

你可以使用^{}尤其是^{}

In [16]: from bs4 import BeautifulSoup

In [17]: soup = BeautifulSoup("""<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br />
   ....: <table>
   ....: <tr>
   ....: <td>
   ....:     text title or some
   ....: </td>
   ....: </tr>
   ....: </table>
   ....: <p> lorem ipsum</p>""")

In [18]: _ = soup.table.extract()

In [19]: soup
Out[19]: 
<html><body><p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br/><br/>
</p>
<p> lorem ipsum</p></body></html>

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何用python从这个例子中删除<table>结构？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >