<p>经过深思熟虑,我把我的解决方案张贴在这里。对于我所举的各种例子来说,它的效果非常好。如果我事先知道必须从中提取文本的标记,那么使用<code>BeautifulSoup</code>的方法就可以工作了(这样我就可以应用<code>soup.findAll(specific_tag)</code>),但我的情况并非如此。他们也可以是多个标签,我必须从那里提取文本。例如-</p>
<pre><code><p>Science</p><div> Biology </div><div>Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. <span style=\"text-decoration: underline;\">Nature Methods</span> 2017,</div>
</code></pre>
<p>在上面的例子中,我想从<code><p></code>标记和<code><div></code>标记中提取文本。在</p>
<p>我修改了上面的代码来处理这种情况-</p>
^{pr2}$
<p>在上面的例子中运行代码</p>
<pre><code>parser = HTMLStripper()
parser.feed(mystr)
l1 = parser.get_tree()
feed = parser.get_data()
print(l1)
print("\n", mystr)
print("\n", feed)
print("\n\n")
</code></pre>
<p>而输出-</p>
<pre><code>[['ul'], ['li', 'li'], ['li', 'li'], ['li', 'li'], ['li', 'li'], ['ul']]
<ul><li>Registered Nurse in <font>Missouri</font>, License number <font>xxxxxxxx</font>, <font>2017</font></li><li>AHA Advanced Cardiac Life Support (ACLS) Certification <font>2016-2018</font></li><li>AHA PALS - Pediatric Advanced Life Support 2017-2019</li><li>AHA Basic Life Support 2016-2018</li></ul>
['Registered Nurse in Missouri , License number xxxxxxxx , 2017', 'AHA Advanced Cardiac Life Support (ACLS) Certification 2016-2018', 'AHA PALS - Pediatric Advanced Life Support 2017-2019', 'AHA Basic Life Support 2016-2018']
</code></pre>
<p>也适用于混合标记html字符串-</p>
<pre><code>[['p', 'p'], ['div', 'div'], ['div', 'span', 'span', 'div']]
<p>Science</p><div> Biology </div><div>Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. <span style="text-decoration: underline;">Nature Methods</span> 2017,</div>
['Science', 'Biology', 'Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. Nature Methods 2017,']
</code></pre>
<p>很想看到一个角落的情况,这样我可以改进文字提取逻辑。在</p>