java避免在使用jsoup解析html时删除空格和换行符

3 周，1 日 Questions & Answers 1587

我在下面有一个示例代码

String sample = "<html>
<head>
</head>
<body>
This is a sample on              parsing html body using jsoup
This is a sample on              parsing html body using jsoup
</body>
</html>";

Document doc = Jsoup.parse(sample);
String output = doc.body().text();

我得到的输出是

This is a sample on parsing html body using jsoup This is a sample on `parsing html body using jsoup`

但我希望输出为

This is a sample on              parsing html body using jsoup
This is a sample on              parsing html body using jsoup

如何解析它以获得此输出？或者在Java中还有其他方法可以做到这一点吗

Tags:

共 (2) 个答案

# 1 楼答案

您可以禁用文档的漂亮打印以获得所需的输出。但是您还必须将.text()更改为.html()

Document doc = Jsoup.parse(sample);
doc.outputSettings(new Document.OutputSettings().prettyPrint(false));
String output = doc.body().html();

# 2 楼答案

HTML规范要求将多个空格字符折叠为一个空格。因此，在解析示例时，解析器会正确地消除多余的空白字符

我认为您无法更改解析器的工作方式。您可以添加一个预处理步骤，将多个空格替换为不可断开的空格（），这样不会折叠。不过，其副作用当然是，它们是不可破坏的（如果您真的只想使用呈现的文本，如doc.body（）中的文本，这并不重要）。text（））

有 Java 编程相关的问题?

java避免在使用jsoup解析html时删除空格和换行符

共 (2) 个答案

# 1 楼答案

# 2 楼答案