从列中的文本中提取国家名称以创建另一列

2024-06-09 22:59:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我尝试了不同的组合,从列中提取国家名称,并创建一个仅包含国家的新列。我可以对选定的行(即df.address[9998])执行此操作,但不能对整个列执行此操作

import pycountry
Cntr = []
for country in pycountry.countries:
    for country.name in df.address:
        Cntr.append(country.name)

你知道这里出了什么问题吗

编辑:

地址是df中的一个对象,并且

df.address[:10]看起来像这样

       Address
0    Turin, Italy        
1    NaN                 
2    Zurich, Switzerland 
3    NaN                 
4    Glyfada, Greece     
5    Frosinone, Italy    
6    Dublin, Ireland     
7    NaN                 
8    Turin, Italy        
1    NaN                 
2    Zurich, Switzerland 
3    NaN                 
4    Glyfada, Greece     
5    Frosinone, Italy    
6    Dublin, Ireland     
7    NaN                 
8   ...                  
9    Kristiansand, Norway
Name: address, Length: 10, dtype: object

当我运行单个查询时,根据Petar的响应,我得到了正确的国家/地区,但当我尝试创建一个包含所有国家/地区(或像df.address[:5]这样的范围)的列时,我得到了一个空的Cntr

    import pycountry
    Cntr = []
    for country in pycountry.countries:
        if country.name in df['address'][1]:
            Cntr.append(country.name)
Cntr
Returns
[Italy]

and df.address[2] returns [ ] 
etc.

我也跑了 df['address'] = df['address'].astype('str')

以确保列中没有浮点或int


Tags: nameinimportdfforaddress国家nan
3条回答

你真的很接近。我们不能像这样循环for country.name in df.address。相反:

import pycountry
Cntr = []
for country in pycountry.countries:
    if country.name in df.address:
        Cntr.append(country.name)

如果这不起作用,请提供更多信息,因为我不确定df.address是什么样子

示例数据帧 df = pd.DataFrame({'address': ['Turin, Italy', np.nan, 'Zurich, Switzerland', np.nan, 'Glyfada, greece']})

df[['city', 'country']] = df['address'].str.split(',', expand=True, n=2)

               address     city       country
0         Turin, Italy    Turin         Italy
1                  NaN      NaN           NaN
2  Zurich, Switzerland   Zurich   Switzerland
3                  NaN      NaN           NaN
4      Glyfada, greece  Glyfada        greece

您可以使用库DataPrep中的函数^{}。用pip install dataprep安装它

from dataprep.clean import clean_country
df = pd.DataFrame({"address": ["Turin, Italy", np.nan, "Zurich, Switzerland", np.nan, "Glyfada, Greece"]})
df2 = clean_country(df, "address")
df2
               address address_clean
0         Turin, Italy         Italy
1                  NaN           NaN
2  Zurich, Switzerland   Switzerland
3                  NaN           NaN
4      Glyfada, Greece        Greece

相关问题 更多 >