使用BeautifulSoup从html文本中查找和检索内容

2024-05-16 19:18:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我有下面的html代码(或者至少我认为是html),我正在Python上与BeautifulSoup一起工作

我已经正确地使用Beautiful soup解析了html。我接下来要做的是检索与包含特定数据标签的'div'相关联的内容(例如,在代码的底部,data label=“Relation”)。特别地,我想获得一个字典,它以数据标签的文本作为键,即在我的示例“Relation”中,以同一个“div”的内容作为值,即在我的示例中href“http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010

我尝试过几种方法,但据我所知,数据标签似乎不是一个有效的属性,所以我不知道如何处理这个问题

(请注意,这只是一个例子,但我将不得不对成千上万,如果不是数百万,这些网页,与此类似的结构做同样的事情)

感谢您的帮助。谢谢你

<div id="directs"> <label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label"> rdfs:<span>label</span> </a></label> <div class="c2 value "> <div class="toMultiLine "> <div class="fixed"> <span class="dType">xsd:string</span> intervento di Fabrizio CICCHITTO </div> </div> </div> <label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title"> dc:<span>title</span> </a></label> <div class="c2 value "> <div class="toMultiLine "> <div class="fixed"> intervento di Fabrizio CICCHITTO </div> </div> </div> <label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified"> ods:<span>modified</span> </a></label> <div class="c2 value "> <div class="toMultiLine "> <div class="fixed"> <span class="dType">xsd:dateTime</span> 2016-07-05T12:26:02Z </div> </div> </div> <label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type"> rdf:<span>type</span> </a></label> <div class="c2 value"> <div class="toOneLine"> <a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="&lt;http://dati.camera.it/ocd/intervento&gt;"> ocd:intervento </a> </div> </div> <label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato"> ocd:<span>rif_deputato</span> </a></label> <div class="c2 value"> <div class="toOneLine"> <a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="&lt;http://dati.camera.it/ocd/deputato.rdf/d15080_17&gt;"> http://dati.camera.it/ocd/deputato.rdf/d15080_17 </a> </div> </div> <label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation"> dc:<span>relation</span> </a></label> <div class="c2 value"> <div class="toOneLine"> <a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010" target="_blank" title="&lt;http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010&gt;"> http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010 </a> </div> </div> </div>

Tags: divhttpdataitlabelclasscamerahref
1条回答
网友
1楼 · 发布于 2024-05-16 19:18:16

您可以在一个过程中找到data-labels,在另一个过程中找到div内容。然后,可以将结果压缩到一起以创建字典:

from bs4 import BeautifulSoup as soup
import re
d = soup(content, 'html.parser').find('div', {'id':'directs'})
_labels = [i.a['data-label'] for i in d.find_all('label')]
_content = [i.text for i in d.find_all('div', {'class':re.compile('c2 value\s*')})]
result = dict(zip(_labels, _content))

输出:

{'label': '\n\n\nxsd:string \n        intervento di Fabrizio CICCHITTO\n      \n\n', 
 'Title': '\n\n\n        intervento di Fabrizio CICCHITTO\n      \n\n', 
 '': '\n\n\nxsd:dateTime \n        2016-07-05T12:26:02Z\n      \n\n', 
 'type': '\n\n\n      ocd:intervento\n      \n\n', 
 'rierimento a deputato': '\n\n\n      http://dati.camera.it/ocd/deputato.rdf/d15080_17\n      \n\n', 
  'Relation': '\n\n\n         http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010\n        \n\n'}

相关问题 更多 >