我正在尝试刮一个网站,有一个部分,只是让我困惑。有一个由组织提供服务的无序位置列表,我似乎可以解析整个列表。在
下面是一个HTML外观的示例:
<div id="current_tab">
<p class="view_label_type_geoserved" id="view_label_field_geoserved">Geographies Served</p>
<ul>
<li class="view_type_geoserved" id="view_field_geoserved">
<p style="font-weight: bold; border-bottom: 1px dotted #CCC; font-size: .9em;">North Carolina (NC)<span style="float: right; font-size: 0.8em;">North Carolina (NC)</span></p>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Durham (serves entire county)<span style="float: right; font-size: 0.8em;">Durham</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Franklin (serves entire county)<span style="float: right; font-size: 0.8em;">Franklin</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Granville (serves entire county)<span style="float: right; font-size: 0.8em;">Granville</span>
</p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Orange (serves entire county)<span style="float: right; font-size: 0.8em;">Orange</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Person (serves entire county)<span style="float: right; font-size: 0.8em;">Person</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Vance (serves entire county)<span style="float: right; font-size: 0.8em;">Vance</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Wake (serves entire county)<span style="float: right; font-size: 0.8em;">Wake</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Warren (serves entire county)<span style="float: right; font-size: 0.8em;">Warren</span></p>
</li>
</ul>
</div>
这里是我用来解析元素的
^{pr2}$下面是我得到的结果,注意这只是列表的开始:
<p class="view_label_type_geoserved" id="view_label_field_geoserved">Geographies Served</p>
<p style="font-weight: bold; border-bottom: 1px dotted #CCC; font-size: .9em;">North Carolina (NC)<span style="float: right; font-size: 0.8em;">North Carolina (NC)</span></p>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Durham (serves entire county)<span style="float: right; font-size: 0.8em;">Durham</span></p>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Franklin (serves entire county)<span style="float: right; font-size: 0.8em;">Franklin</span></p>
一旦我得到了HTML,我有一些函数,将使用regex剥离文本,然后将它们连接成一个字符串,但建议也会很感激。在
问题是您正在处理的HTML需要一个宽松的解析器来解析。在
使用
lxml
,或html5lib
:对我有用,它可以打印:
^{pr2}$相关问题 更多 >
编程相关推荐