html当使用JSOUP库在Java中读取标签时，如何保留标签（如<br>、<ul>、<li>、<p>等）的含义？

2 月，4 周 Questions & Answers 96

我正在编写一个程序，从本地HTML文件中提取某些信息。然后，这些信息显示在JavaJFrame上，并导出到excel文件中。（我使用JSOUP1.9.2库进行HTML解析）

我遇到了这样一个问题：每当我从HTML文件中提取任何内容时，JSoup都没有考虑诸如中断标记、行标记等HTML标记，因此，所有信息都像一大块数据一样被提取出来，而没有任何适当的换行符或格式

举个例子，如果这是我想要读取的数据：

Title

Line 1

Line 2

element 1
element 2

返回的数据如下所示：

Title Line 1 Line 2 Unordered List element 1 element 2 (i.e. all the HTML tags are ignored)

这是我用来阅读的一段代码：

private String getTitle(Document doc) { // doc is the local HTML file Elements title = doc.select(".title"); for (Element id : title) { return id.text(); } return "No Title Available "; }

有人能给我建议一种方法来保存HTML标记背后的含义吗？通过这种方法，我既可以在JFrame上显示数据，也可以以更可读的格式将其导出到excel

谢谢

# 1 楼答案

为了给每个人一个更新，我找到了一个解决格式问题的方法（更像是一个变通方法）。我现在做的是使用id.html()提取完整的HTML，我将其存储在一个String对象中。然后，我将字符串函数replaceAll()与正则表达式一起使用，以去除所有HTML标记，而无需将所有内容都放在一行中。replaceAll()函数看起来像replaceAll("\\<[^>]*>","")。我的整个processHTML（）函数看起来像：

private String processHTML(String initial) { //initial is the String with all the HTML tags
        String modified = initial;
        modified = modified.replaceAll("\\<[^>]*>",""); //regular expression used
        modified = modified.trim(); //To get rid of any unwanted space before and after the needed data
        //All the replaceAll() functions below are to get rid of any HTML entities that might be left in the data extarcted from the HTML
        modified = modified.replaceAll("&nbsp;", " ");
        modified = modified.replaceAll("&lt;", "<");
        modified = modified.replaceAll("&gt;", ">");
        modified = modified.replaceAll("&amp;", "&");
        modified = modified.replaceAll("&quot;", "\"");
        modified = modified.replaceAll("&apos;", "\'");
        modified = modified.replaceAll("&cent;", "¢");
        modified = modified.replaceAll("&copy;", "©");
        modified = modified.replaceAll("&reg;", "®");
        return modified;
    }

再次感谢你们帮我做这件事

干杯

Python中文网

有 Java 编程相关的问题?

html当使用JSOUP库在Java中读取标签时，如何保留标签（如<br>、<ul>、<li>、<p>等）的含义？

Title

共 (1) 个答案

# 1 楼答案