如何清理数据,以便为城市对提供正确的到达代码?

2024-05-23 19:21:16 发布

您现在位置:Python中文网/ 问答频道 /正文

如何清理数据,以便为城市对提供正确的到达代码

从图片上看,CSV就像第1列:城市对(出发-到达),第2列是出发代码,第3列是到达代码

正如您在第一列中看到的第319行,它是“伦敦-巴黎”,第2列是CDG,第3列是LHR,当它应该是相反的时候

同样在第324行中,杜塞尔多夫-巴黎在第2列中是CDG,在第3列中是DUS,而它应该是相反的

大多数CSV都是正确的,但也有很多类似的错误

有人能帮我对所有行进行排序吗?也许可以对第1列中的所有行进行索引,并确保“-”前后的文本与第2列和第3列中的代码匹配

Columns example in CSV

    old_index   airports                                arr dep
0   319         London, United Kingdom - Paris, France  CDG LHR
1   320         London, United Kingdom - Paris, France  CDG LHR
2   321         London, United Kingdom - Paris, France  CDG LHR
3   322         London, United Kingdom - Paris, France  CDG LHR
4   323         London, United Kingdom - Paris, France  CDG LHR
5   324         Dusseldorf, Germany - Paris, France     CDG DUS
6   325         Amsterdam, Netherlands - Paris, France  CDG AMS
7   326         Amsterdam, Netherlands - Paris, France  CDG AMS
8   327         Amsterdam, Netherlands - Paris, France  CDG AMS
9   328         Amsterdam, Netherlands - Paris, France  CDG AMS
10  329         Amsterdam, Netherlands - Paris, France  CDG AMS

Tags: csv数据代码图片cdgunitedkingdomams
1条回答
网友
1楼 · 发布于 2024-05-23 19:21:16

例如,使用清理器查看其他问题中的数据

给定:

enter image description here

…来自your other question

试试看:

import pandas as pd
import numpy as np
import math
from math import sin, cos, sqrt, atan2, radians

def get_distance(in_lat1, in_lon1, in_lat2, in_lon2):
    # approximate radius of earth in km
    R = 6373.0

    lat1 = radians(in_lat1)
    lon1 = radians(in_lon1)
    lat2 = radians(in_lat2)
    lon2 = radians(in_lon2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    return distance

df = pd.DataFrame({'Normalised City Pair': {0: 'London, United Kingdom - New York, United States',
  1: 'Johannesburg, South Africa - London, United Kingdom',
  2: 'London, United Kingdom - New York, United States',
  3: 'Johannesburg, South Africa - London, United Kingdom',
  4: 'London, United Kingdom - Singapore, Singapore'},
 'Departure Code': {0: 'LHR', 1: 'JNB', 2: 'LHR', 3: 'JNB', 4: 'SIN'},
 'Arrival Code': {0: 'JFK', 1: 'LHR', 2: 'JFK', 3: 'LHR', 4: 'LHR'},
 'Departure_lat': {0: 51.5, 1: -26.1, 2: 51.5, 3: -26.1, 4: 1.3},
 'Departure_lon': {0: -0.45, 1: 28.23, 2: -0.45, 3: 28.23, 4: 103.98},
 'Arrival_lat': {0: 40.64, 1: 51.47, 2: 40.64, 3: 51.47, 4: 51.47},
 'Arrival_lon': {0: -73.79, 1: -0.45, 2: -73.79, 3: -0.45, 4: -0.45}})

df_airports = pd.read_csv('https://ourairports.com/data/airports.csv')
df_airports = df_airports[['name', 'iata_code']].copy()
df_airports = df_airports[df_airports['iata_code'].notna()].reset_index(drop=True)
# df_airports.query('iata_code == "CDG" | iata_code == "LHR"')

df['Distance'] = df.apply(lambda x: get_distance(x['Departure_lat'], x['Departure_lon'], x['Arrival_lat'], x['Arrival_lon']), axis=1)

#df[['ap_dep', 'ap_arr']] = df['Normalised City Pair'].str.split(' - ', expand=True)

df_airports = df_airports.sort_values('name')

df_airports = df_airports.drop_duplicates(subset ='iata_code', keep='first')

df['dep_ap_name'] = df['Departure Code'].map(df_airports.set_index('iata_code')['name'])
df['arr_ap_name'] = df['Arrival Code'].map(df_airports.set_index('iata_code')['name'])

输出:

enter image description here

如果框架现在有太多的列,而您需要更干净的东西并对列重新排序,请尝试final_df = df[['a', 'b', 'c', 'd']],其中abc是您想要的列和排序

相关问题 更多 >