漂亮的汤忽略标记中的多个空格

2022-12-01 04:33:08 发布

您现在位置:Python中文网/ 问答频道 /正文

当我使用BeautifulSoup从html获取文本时,我发现它会忽略多个空格。对于下面的示例,在</seg><seg>之间有2个空格,但是输出只有一个空格,不管有多少空格,它都会输出一个空格

import bs4

text = "<line><seg>aaa</seg>  <seg>bbb</seg></line>"
soup = bs4.BeautifulSoup(text)
print(soup.text)
print(soup.find_all(text=True))

输出为:

aaa bbb
['aaa', ' ', 'bbb']

但我真正想要的是:

aaa  bbb
['aaa', '  ', 'bbb']

有什么想法吗

javascript中是否有等效的方法?获取文本但忽略标记外的多个空格


Tags: text文本import示例htmllinefind空格printbbbsoupsegaaabs4beautifulsoup
1条回答
网友
1楼 · 发布于 2022-12-01 04:33:08

这是html解析器的正常行为

见:

https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Whitespace

引用一些相关部分:

HTML largely ignores whitespace?

In the case of HTML, whitespace is largely ignored — whitespace in between words is treated as a single character, and whitespace at the start and end of elements and outside elements is ignored.

Creating space around and inside elements is the job of CSS.

What does happen to whitespace?

They don't just disappear, however.

Any whitespace characters that are outside of HTML elements in the original document are represented in the DOM. This is needed internally so that the editor can preserve formatting of documents. This means that:

There will be some text nodes that contain only whitespace, and Some text nodes will have whitespace at the beginning or end.

How does CSS process whitespace?

Most whitespace characters are ignored, not all of them are.....There are rules in the browser engine that decide which whitespace characters are useful and which aren’t — these are specified at least in part in CSS Text Module Level 3, and especially the parts about the CSS white-space property and whitespace processing details.