擅长:python、mysql、java
<p>我分析HTML或XML的首选解决方案是<code>lxml</code>和<code>xpath</code>。</p>
<p>关于如何使用<code>xpath</code>的一个简单而肮脏的示例:</p>
<pre><code>from lxml import etree
data = open('result.html','r').read()
doc = etree.HTML(data)
for tr in doc.xpath('//table/tr[@class="trmenu1"]'):
print tr.xpath('./td/text()')
</code></pre>
<p>收益率:</p>
<pre><code>['Registration Number: ', ' CS 2047103']
['Name of the Candidate: ', 'PATIL SANTOSH KUMARRAO ']
['Examination Paper: ', 'CS - Computer Science and Information Technology']
['Marks Obtained: ', '75.67 Out of 100']
['GATE Score: ', '911']
['All India Rank: ', '34']
['No of Candidates Appeared in CS: ', '156780']
['Qualifying Marks for CS: ', '\r\n\t\t\t\t\t']
['General', 'OBC ', '(Non-Creamy)', 'SC / ST / PD ']
['31.54', '28.39', '21.03 ']
</code></pre>
<p>这段代码从HTML数据中创建一个<code>ElementTree</code>。使用<code>xpath</code>,它选择所有有<code>class="trmenu1"</code>属性的<code><tr></code>元素。然后为每个<code><tr></code>选择并打印任何<code><td></code>子级的文本。</p>