<p>下面是一个解决方案,使用<a href="https://pypi.python.org/pypi/py-wikimarkup/" rel="noreferrer">py-wikimarkup</a>和<a href="https://pypi.python.org/pypi/pyquery/" rel="noreferrer">PyQuery</a>从wikimarkup字符串中提取所有表作为pandas数据帧,忽略非表内容。在</p>
<pre><code>import wikimarkup
import pandas as pd
from pyquery import PyQuery
def get_tables(wiki):
html = PyQuery(wikimarkup.parse(wiki))
frames = []
for table in html('table'):
data = [[x.text.strip() for x in row]
for row in table.getchildren()]
df = pd.DataFrame(data[1:], columns=data[0])
frames.append(df)
return frames
</code></pre>
<p>给出以下输入</p>
^{pr2}$
<p><code>get_tables</code>返回以下数据帧。在</p>
<pre><code> Model Mhash/s Mhash/J Watts Clock SP Comment
0 ION 1.8 0.067 27 16 poclbm; power consumption incl. CPU
1 8200 mGPU 1.2 1200 16 128 MB shared memory, "poclbm -w 128 -f 0"
2 8400 GS 2.3 "poclbm -w 128"
</code></pre>
<p>在</p>
<pre><code> A B C
0 0 1 2
1 3 4 5
</code></pre>