在pandas中读取Wikipedia表时数值呈现不正确

2024-04-18 23:55:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图在一个数据框中读取Wikipedia表的内容。你知道吗

In [110]: import pandas as pd

In [111]: df = pd.read_html("https://en.wikipedia.org/wiki/List_of_cities_by_GDP")[0]

但是,此数据帧在某些列中包含乱码值:

                        0                            1                     2  \
0  City/Metropolitan area                      Country  Geographical zone[1]   
1                Aberdeen               United Kingdom       Northern Europe   
2                 Abidjan  Côte d'Ivoire (Ivory Coast)                Africa   
3               Abu Dhabi         United Arab Emirates          Western Asia   
4             Addis Ababa                     Ethiopia                Africa   

                                    3  \
0     Official est. Nominal GDP ($BN)   
1  7001113000000000000♠11.3 (2008)[5]   
2                                 NaN   
3         7002119000000000000♠119 [6]   
4                                 NaN   

                                                   4  \
0  Brookings Institution[2] 2014 est. PPP-adjuste...   
1                                                NaN   
2                                                NaN   
3                          7002178300000000000♠178.3   
4                                                NaN   

                                         5  \
0  PwC[3] 2008 est. PPP-adjusted GDP ($BN)   
1                                      NaN   
2                   7001130000000000000♠13   
3                                      NaN   
4                   7001120000000000000♠12   

                                         6                             7  
0  McKinsey[4] 2010 est. Nominal GDP ($BN)  Other est. Nominal GDP ($BN)  
1                                      NaN                           NaN  
2                                      NaN                           NaN  
3                 7001671009999900000♠67.1                           NaN  
4                                      NaN                           NaN 

例如,在Official est. Nominal GDP列的上面的数据帧中,第一个条目是11.3(2008),但是我们在前面看到了一些大的数字。我认为这一定是编码的问题,我尝试传递ASCIIUTI编码:

In [113]: df = pd.read_html("https://en.wikipedia.org/wiki/List_of_cities_by_GDP", encoding = 'ASCII')[0]

然而,即使这样也不能解决问题。有什么想法吗?你知道吗


Tags: 数据inhttpsorgdfreadhtmlnan
2条回答

这是因为(在浏览器中)不可见的“排序键”元素:

<td style="background:#79ff76;">
    <span style="display:none" class="sortkey">7001130000000000000♠</span> 
    13
</td>

也许有更好的方法来清理它,但是这里有一个有效的解决方案,它的思想是从表中找到这些“排序键”元素并removing它们,然后让pandas解析表HTML:

import requests
from bs4 import BeautifulSoup
import pandas as pd


response = requests.get("https://en.wikipedia.org/wiki/List_of_cities_by_GDP")
soup = BeautifulSoup(response.content, "html.parser")

table = soup.select_one("table.wikitable")
for span in table.select("span.sortkey"):
    span.decompose()

df = pd.read_html(str(table))[0]
print(df)

如果您查看该页面的HTML源代码,您将看到许多单元格都有一个隐藏的<span>,其中包含一个“sortkey”。这些是你看到的奇怪数字。你知道吗

如果您查看the documentation中的read_html,您将看到:

Expect to do some cleanup after you call this function. [...] We try to assume as little as possible about the structure of the table and push the idiosyncrasies of the HTML contained in the table to the user.

把它们放在一起,你就会得到你的答案:垃圾进,垃圾出。您正在读取的表中有垃圾数据,您必须自己找出如何处理这些数据。你知道吗

相关问题 更多 >