根据其项的后缀创建新列(数据框)

2024-04-18 11:23:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我使用熊猫构建了以下数据集:

    URLS  \
0

1                   www.gene.eu   
2  www.cittametropolitana.me.it   
3     www.regione.basilicata.it   
4    www.bbc.co.uk   

                                               Paths  
0                                                     
1            /news-room/q-a-detail/ 
2                     /emergenza-sanitari/  
3                     /giunta/site/giunta/detail.jsp  
4  /focus/  

我想检查每个URL(eu、it、co.uk,…)的后缀,以指定以下值之一:

suffix=['.it','.uk','.eu'] # this should be used as set which includes all the suffix that I want to check
country=['Italy','United Kingdom','Europe'] # values to assign based on the suffix

zipped = list(zip(suffix, country)) # create a connection between suffix and country

我已经尝试了几种方法,也感谢一些帮助我解决这个问题的用户)在我的数据帧示例中添加这个带有后缀信息的新列,但是没有成功(请在这里找到一个与这个问题相关的问题,另一个示例:Adding new column with condition):

country = {k.lower() : v for (k,v) in zipped}
og = {k : v for (k,v) in suffix}
country.update(og)
# (1)
df['value'] = df['URLS'].str.split(".", expand=True).stack().reset_index(1).query(
    "level_1 == level_1.max()"
)[0].map(country)

# (2)
original_domain = {x: y for x, y  in zipped}

df['value'] = df['URLS'].apply(lambda sen : original_domain.get( sen[-1], 'Unknown') ) )

# (3)
df['value']=df['URLS'].map(lambda x: x[-3:] in zipped) 

#(4)
df['value'] = np.where(df['URLS'].str.endswith(suffix), pd.to_datetime(df['value'])) # it returns me errors and t needs another step to assign country

但这些代码都不起作用。URL是通过解析链接派生的列。我认为问题可能在于从计算项定义值列,而不创建列表,因此我需要根据URL创建它。 因此,我想问您如何添加这个新列,查找后缀结尾并指定相应的值(意大利、英国等)

我希望你能帮助我

谢谢

编辑:

df的定义如下:

df=pd.read_csv('path/text.csv', sep=';', engine='python')

我认为当我尝试应用sK500提出的代码时,这可能会导致错误


Tags: toinurldfforvaluewwwit
1条回答
网友
1楼 · 发布于 2024-04-18 11:23:20

如果我理解正确的话,你就明白了:

import pandas as pd


suffix = ['it', 'uk', 'eu']
country = ['Italy', 'United Kingdom', 'Europe']
mapping = dict(zip(suffix, country))
urls = ['www.gene.eu', 'www.cittametropolitana.me.it', 'www.regione.basilicata.it', 'www.bbc.co.uk']
paths = ['/news-room/q-a-detail/', '/emergenza-sanitari/', '/giunta/site/giunta/detail.jsp', '/focus/']
frame = pd.DataFrame(zip(urls, paths), columns=['urls', 'paths'])
for ext in mapping:
    frame.loc[frame['urls'].apply(lambda x: x.split('.')[-1]) == ext, 'Country'] = mapping[ext]
print(frame)

输出:

                           urls                           paths         Country
0                   www.gene.eu          /news-room/q-a-detail/          Europe
1  www.cittametropolitana.me.it            /emergenza-sanitari/           Italy
2     www.regione.basilicata.it  /giunta/site/giunta/detail.jsp           Italy
3                 www.bbc.co.uk                         /focus/  United Kingdom

请注意,为了使其正常工作,您需要事先添加要包含在映射中的所有扩展,并且数据必须是统一的(您必须确保每个url都有一个.并以映射中包含的扩展结束,否则您将获得不想要的nan

相关问题 更多 >