python - 如何将html文件转换为可读的txt文件?

3 投票
3 回答
2308 浏览
提问于 2025-04-17 00:09

我有很多这样的html文件:

<font face="Garmond,Helvetica,Times" size="2" color="#330066">
<b>
 Summary:
</b>
&nbsp;According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.
<br />
<br />
On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations.  On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed.  The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.
<br />
<hr width="50%" align="left" />
INDUSTRY CLASSIFICATION:
<br />
<b>
 SIC Code:
</b>
0000
<br />
<b>
 Sector:
</b>
N/A
<br />
<b>
 Industry:
</b>
N/A
<br />
</font>

我想做的是把文件中间的文字提取出来,并转换成易于阅读的格式。在这个例子中,就是:

根据2011年1月4日提交的投诉,在2007年12月和2008年1月的六周内,被告FrontPoint Partners LLC(“FrontPoint”)管理的六个与医疗相关的对冲基金,出售了超过六百万股人类基因组科学公司(“HGSI”)的普通股,而他们的投资组合经理掌握了关于HGSI药物白蛋白干扰素α 2-a临床试验的重大负面非公开信息。

2011年3月2日,原告提交了第一份修正集体诉讼投诉,修改了被告和证券违规行为。2011年3月22日,提交了担任首席原告和批准首席律师选择的动议。被告于2011年3月28日对第一份修正投诉提出了驳回动议。

我知道我需要做三件事,它们是:

  1. 提取文件中间的文字
  2. "<br />" 替换成 "\n"
  3. "&nbsp;" 替换成 " "(一个空格)

我知道后面两件事很简单,只需要在Python中使用替换方法,但我不知道怎么实现第一个目标。

我对正则表达式和BeautifulSoup有一点了解,但不知道怎么应用到这个问题上。

有人能帮我吗?

谢谢,抱歉我的英语不好。

@Paul:我只想要一个总结部分。我的老师(对电脑不太了解)给了我很多html文件,让我把它们转换成适合数据挖掘的格式(我的老师想用SAS来做这个)。我不懂SAS,但我觉得它可能用于处理很多txt文件,所以我想把这些html文件转换成普通的txt文件。

@Owen:我需要处理很多html文件,我觉得这个问题不太难,所以我想直接用Python来解决。

3 个回答

1

最接近的做法是把HTML转换成reStructureText格式。你可以在线试试,点击这里,它会输出以下内容。

 **Summary:** According to the complaint filed January 04, 2011, over a
six-week period in December 2007 and January 2008, six healthcare
related hedge funds managed by Defendant FrontPoint Partners LLC
(“FrontPoint”) sold more than six million shares of Human Genome
Sciences, Inc. (“HGSI”) common stock while their portfolio manager
possessed material negative non-public information concerning the HGSI’s
clinical trial for the drug Albumin Interferon Alfa 2-a.
 On March 2, 2011, the plaintiffs filed a First Amended Class Action
Complaint, amending the named defendants and securities violations. On
March 22, 2011, a motion for appointment as lead plaintiff and for
approval of selection of lead counsel was filed. The defendants
responded to the First Amended Complaint by filing a motion to dismiss
on March 28, 2011.

--------------

INDUSTRY CLASSIFICATION:
 **SIC Code:** 0000
 **Sector:** N/A
 **Industry:** N/A
3

你可以使用Scrapely。

Scrapely是一个用来从HTML页面提取结构化数据的工具库。它会根据一些示例网页和你想提取的数据,自动生成一个解析器,这样就能处理所有类似的网页了。

http://github.com/scrapy/scrapely

2

要完成这个任务,你可以使用一个叫做Lxml的Python库。

  • 首先,下载并安装Lxml

现在试着运行以下代码:

from lxml.html import fromstring

html = '''
<font face="Garmond,Helvetica,Times" size="2" color="#330066">
<b>
 Summary:
</b>
&nbsp;According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.
<br />
<br />
On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations.  On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed.  The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.
<br />
<hr width="50%" align="left" />
INDUSTRY CLASSIFICATION:
<br />
<b>
 SIC Code:
</b>
0000
<br />
<b>
 Sector:
</b>
N/A
<br />
<b>
 Industry:
</b>
N/A
<br />
</font>
'''

htmlElement = fromstring(html)
textContent = htmlElement.text_content()
result = textContent.split('\n\n Summary:\n\n')[1].split('\n\nINDUSTRY CLASSIFICATION:\n\n')[0]

print result

这段代码会有效,如果'\n\n Summary:\n\n'出现在你想要的文本之前,而'\n\n INDUSTRY CLASSIFICATION:\n\n'出现在你想要的文本之后

撰写回答