我抓到了一个网站,它有数百页组织不良的HTML。我使用BeautifulSoup捕获每个页面上div的所有内容。该清单摘录如下:
mylist = [['<div id="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/30/2019<br/>09:00:00 AM<br/>12/31/2019<br/>09:00:00 AM<br/>92112<br/>Initiate<br/>Capacity Constraint<br/>12/29/2019<br/>03:02:38 PM<br/> <br/><br/>No response required<br/> <br/> <br/>AGT Pipeline Conditions for 12/30/2019<br/></div>'],
['<div id="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/29/2019<br/>09:00:00 AM<br/>12/30/2019<br/>09:00:00 AM<br/>92086<br/>Initiate<br/>Capacity Constraint<br/>12/28/2019<br/>02:55:39 PM<br/> <br/><br/>No response required<br/> <br/> <br/>AGT Pipeline Conditions for 12/29/2019<br/></div>'],
['<div id="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/28/2019<br/>09:00:00 AM<br/>12/29/2019<br/>09:00:00 AM<br/>92074<br/>Initiate<br/>Capacity Constraint<br/>12/27/2019<br/>03:14:16 PM<br/> <br/><br/>No response required<br/> <br/> <br/>AGT Pipeline Conditions for 12/28/2019<br/></div>']]
如何捕获<br/>
标记之间的内容,包括它们之间没有内容时的空白
我应该补充的是,输出应该成为一个列表列表,每个列表项都由一个<br/>
标记分开,标记是列表中的一个项目。例如:
[['"006951446", "Algonquin Gas Transmission, LLC", "Critical notice", "12/30/2019", "09:00:00 AM", "12/31/2019", "09:00:00 AM", "92112", "Initiate", "Capacity Constraint", "12/29/2019", "03:02:38 PM", "No response required", "AGT Pipeline Conditions for 12/30/2019"'],
['"006951446", "Algonquin Gas Transmission, LLC", "Critical notice", "12/29/2019", "09:00:00 AM", "12/30/2019", "09:00:00 AM", "92086", "Initiate", "Capacity Constraint", "12/28/2019", "02:55:39 PM", "No response required", "AGT Pipeline Conditions for 12/29/2019"'],
['"006951446", "Algonquin Gas Transmission, LLC", "Critical notice", "12/28/2019", "09:00:00 AM", "12/29/2019", "09:00:00 AM", "92074", "Initiate", "Capacity Constraint", "12/27/2019", "03:14:16 PM", "No response required", "AGT Pipeline Conditions for 12/28/2019"']]
通常,当您在
BeatifulSoup
对象上使用select时,会得到一个Tag
的列表。您可以在
Tag
上再次使用select
/getText
。对于exsample:
使用库SimplifiedDoc的解决方案
结果:
如果没有看到代码的其余部分,可能很难给出准确的答案,但是beautifulsoup是一个很好的解决方案。您应该能够继续使用
bs4
包,使用BeautifulSoup
方法的组合来梳理HTML(例如find
/find_all
/select
等)See this answer for help on br tags
相关问题 更多 >
编程相关推荐