在pandas中读取Wikipedia表时数值呈现不正确

0 1 2 \ 0 City/Metropolitan area Country Geographical zone[1] 1 Aberdeen United Kingdom Northern Europe 2 Abidjan Côte d'Ivoire (Ivory Coast) Africa 3 Abu Dhabi United Arab Emirates Western Asia 4 Addis Ababa Ethiopia Africa 3 \ 0 Official est. Nominal GDP ($BN) 1 7001113000000000000♠11.3 (2008)[5] 2 NaN 3 7002119000000000000♠119 [6] 4 NaN 4 \ 0 Brookings Institution[2] 2014 est. PPP-adjuste... 1 NaN 2 NaN 3 7002178300000000000♠178.3 4 NaN 5 \ 0 PwC[3] 2008 est. PPP-adjusted GDP ($BN) 1 NaN 2 7001130000000000000♠13 3 NaN 4 7001120000000000000♠12 6 7 0 McKinsey[4] 2010 est. Nominal GDP ($BN) Other est. Nominal GDP ($BN) 1 NaN NaN 2 NaN NaN 3 7001671009999900000♠67.1 NaN 4 NaN NaN

2条回答

网友

1楼 · 编辑于 2024-04-18 23:55:32

这是因为（在浏览器中）不可见的“排序键”元素：

<td style="background:#79ff76;">
    <span style="display:none" class="sortkey">7001130000000000000♠</span> 
    13
</td>

也许有更好的方法来清理它，但是这里有一个有效的解决方案，它的思想是从表中找到这些“排序键”元素并removing它们，然后让pandas解析表HTML：

import requests
from bs4 import BeautifulSoup
import pandas as pd


response = requests.get("https://en.wikipedia.org/wiki/List_of_cities_by_GDP")
soup = BeautifulSoup(response.content, "html.parser")

table = soup.select_one("table.wikitable")
for span in table.select("span.sortkey"):
    span.decompose()

df = pd.read_html(str(table))[0]
print(df)

网友

2楼 · 编辑于 2024-04-18 23:55:32

如果您查看该页面的HTML源代码，您将看到许多单元格都有一个隐藏的<span>，其中包含一个“sortkey”。这些是你看到的奇怪数字。你知道吗

如果您查看the documentation中的read_html，您将看到：

Expect to do some cleanup after you call this function. [...] We try to assume as little as possible about the structure of the table and push the idiosyncrasies of the HTML contained in the table to the user.

把它们放在一起，你就会得到你的答案：垃圾进，垃圾出。您正在读取的表中有垃圾数据，您必须自己找出如何处理这些数据。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章