python正则表达式从html收益报告文档中提取段落？

2024-06-16 08:37:54 发布

您现在位置：Python中文网/ 问答频道 /正文

3358

网友

男 | 程序猿一只，喜欢编程写python代码。

如何从这样的页面中提取段落？ https://www.sec.gov/Archives/edgar/data/81318/000165495416004006/yuma_10q.htm

我试着得到文本：

from bs4 import BeautifulSoup
import re, requests
link='https://www.sec.gov/Archives/edgar/data/81318/000165495416004006/yuma_10q.htm'
html=BeautifulSoup(requests.get(link).content,'html.parser')
text = ' '.join([s for s in html.strings if s.parent.name not in ('style', 'script', 'head', 'title', 'meta', '[document]')])
print(text)

然而，它是非常混乱和一些不同的段落连接在一起没有任何一致的模式，他们应该如何分开。有没有一个更干净的解决方案来从中有组织地抓取文本

Tags： https 文本 import data html www sec requests

0条回答

目前没有回答

python正则表达式从html收益报告文档中提取段落？

相关问题更多 >

编程相关推荐

热门问题

热门文章

python正则表达式从html收益报告文档中提取段落？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >