从自由流动的文本中删除html标记以形成单独的句子问题的回答

从自由流动的文本中删除html标记以形成单独的句子

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我想从一大块课文中摘录句子。我的短信大概是 <pre><code><ul><li>Registered Nurse in Missouri, License number xxxxxxxx, 2017</li><li>AHA Advanced Cardiac Life Support (ACLS) Certification 2016-2018</li><li>AHA PALS - Pediatric Advanced Life Support 2017-2019</li><li>AHA Basic Life Support 2016-2018</li></ul> </code></pre> 我想从上面的课文中摘录适当的句子。因此，预期产出将是一个列表 ^{pr2}$ 我使用python内置的<code>HTMLParser</code>模块从上面的文本中剥离htmls。这是我的密码。在 <pre><code>class HTMLStripper(HTMLParser): def __init__(self): super().__init__() self.reset() self.strict = False self.convert_charrefs= True self.fed = [] def handle_data(self, chunk): #import pdb; pdb.set_trace() self.fed.append(chunk.strip()) def get_data(self): return [x for x in self.fed if x] def strip_html_tags(html): try: s = HTMLStripper() s.feed(html) return s.get_data() except Exception as e: # Remove html strings from the given string p = re.compile(r'<.*?>') return p.sub('', html) </code></pre> 它给出了对上面的文本调用<code>strip_html_tags</code>函数的以下结果（这实际上是当前实现应该产生的输出） <pre><code>['Registered Nurse in', 'Missouri', ', License number', 'xxxxxxx', ',', '2017', 'AHA Advanced Cardiac Life Support (ACLS) Certification', '2016-2018', 'AHA PALS - Pediatric Advanced Life Support 2017-2019', 'AHA Basic Life Support 2016-2018'] </code></pre> 我不能严格检查<code><ul> or <li> tags</code>，因为不同的文本可能有不同的html标记。我有一种方法可以像上面那样在外部<code>html-tags</code>上拆分文本，而不是在遇到的每个<code>html-tag</code>上进行拆分 提前谢谢。在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

经过深思熟虑，我把我的解决方案张贴在这里。对于我所举的各种例子来说，它的效果非常好。如果我事先知道必须从中提取文本的标记，那么使用<code>BeautifulSoup</code>的方法就可以工作了（这样我就可以应用<code>soup.findAll(specific_tag)</code>），但我的情况并非如此。他们也可以是多个标签，我必须从那里提取文本。例如- <pre><code>Science<div> Biology </div><div>Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. Nature Methods 2017,</div> </code></pre> 在上面的例子中，我想从<code></code>标记和<code><div></code>标记中提取文本。在 我修改了上面的代码来处理这种情况- ^{pr2}$ 在上面的例子中运行代码 <pre><code>parser = HTMLStripper() parser.feed(mystr) l1 = parser.get_tree() feed = parser.get_data() print(l1) print("\n", mystr) print("\n", feed) print("\n\n") </code></pre> 而输出- <pre><code>[['ul'], ['li', 'li'], ['li', 'li'], ['li', 'li'], ['li', 'li'], ['ul']] <ul><li>Registered Nurse in Missouri, License number xxxxxxxx, 2017</li><li>AHA Advanced Cardiac Life Support (ACLS) Certification 2016-2018</li><li>AHA PALS - Pediatric Advanced Life Support 2017-2019</li><li>AHA Basic Life Support 2016-2018</li></ul> ['Registered Nurse in Missouri , License number xxxxxxxx , 2017', 'AHA Advanced Cardiac Life Support (ACLS) Certification 2016-2018', 'AHA PALS - Pediatric Advanced Life Support 2017-2019', 'AHA Basic Life Support 2016-2018'] </code></pre> 也适用于混合标记html字符串- <pre><code>[['p', 'p'], ['div', 'div'], ['div', 'span', 'span', 'div']] Science<div> Biology </div><div>Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. Nature Methods 2017,</div> ['Science', 'Biology', 'Generation of mature T cells from human hematopoietic stem and progenitor cells in artificial thymic organoids. Nature Methods 2017,'] </code></pre> 很想看到一个角落的情况，这样我可以改进文字提取逻辑。在

从自由流动的文本中删除html标记以形成单独的句子

1 个回答

相关Python问题