在数据框中创建条件搜索函数时出现空值错误

-1 投票
1 回答
36 浏览
提问于 2025-04-12 12:59

我写了一个函数,可以在数据表中根据多个条件进行搜索,使用的是 df.loc 方法,然后从得到的子集里提取数据。这个功能运行得很好,正如我所期望的那样。

不过,如果 df.loc 没有找到任何行(通常是因为数据没有从函数请求的年份开始),我就会遇到错误。

我的函数大概是这样的:

#import numpy as np
import pandas as pd

mm_df = pd.read_csv("mm_copy.csv", index_col=False, low_memory=False) # NB "index_col = False" is necessary because otherwise Python tries to use first column as index, \
                                                       #and throws everything out by one column

countries_df = pd.read_csv("countries.csv") 
south_america_df = countries_df.loc[countries_df.Continent == "South America"]
south_american_countries_list = south_america_df.Entity.tolist() # Creates complete list of South American countries, using local definitions

population_df = pd.read_csv("population.csv", low_memory=False)
population_df.set_index("Location")


def calculate_crude_maternal_mortality(year):
    
    # Establishes total WHO female population of South America
    who_female_pop_slice = population_df.loc[(population_df.Location == "South America") & (population_df.Time == year)]
    who_female_pop = who_female_pop_slice["TPopulationFemale1July"].values[0]*1000


    #Calc total South American Female population in relevant year
    south_american_female_pop_list = []
    south_american_mm_deaths_list = []


    mm_dict_for_year = {}

    for country in south_american_countries_list: # iterates through list of South American countries

        mm_dict_value_list = []

        if mm_df["Country_Name"].str.contains(country).any(): # checks country is in population dataframe
            mm_df_slice = mm_df.loc[(mm_df.Country_Name == country) & (mm_df.Year == year) & (mm_df.Sex == "Female") & (mm_df.Age_group_code=="Age_all")] # locates appropriate row which has country and year
            if mm_df_slice is not None:   #int(mm_df_slice.Number.values[0]) >= 0:
                country_mm_deaths = int(mm_df_slice.Number.values[0]) # finds cell with female population
                mm_dict_value_list.append(country_mm_deaths) # adds this population to the totals list
        else:
            mm_dict_value_list.append("No data")
        
        if population_df["Location"].str.contains(country).any(): # checks country is in population dataframe
            population_df_slice = population_df.loc[(population_df.Location == country) & (population_df.Time == year)] # locates appropriate row which has  country and year
            country_female_pop = int(population_df_slice["TPopulationFemale1July"].values[0] * 1000) # finds cell with female population, multiplies by 1000, and casts as int
            mm_dict_value_list.append(country_female_pop) # adds this population to the totals list
        else:
            mm_dict_value_list.append("No data")

        mm_dict_for_year[country] = mm_dict_value_list

    
    #print(mm_dict_for_year) #IMPORTANT TRACER

    # Create dictionary of actual mortalities and populations, country by country, for the requested year

    for country in mm_dict_for_year:
        if mm_dict_for_year[country][0] != "No data":
            if mm_dict_for_year[country][1] != "No data":
                south_american_mm_deaths_list.append(mm_dict_for_year[country][0])
                south_american_female_pop_list.append(mm_dict_for_year[country][1])

    data_coverage_percentage = round((sum(south_american_female_pop_list)/who_female_pop*100), 2)
    crude_maternal_mortality_rate = round((sum(south_american_mm_deaths_list)/sum(south_american_female_pop_list)*100000), 2)

    print("Data coverage = " + str(data_coverage_percentage * 100))

    if data_coverage_percentage < 80:
        print("Coverage below threshold of 80%")
    else:
        print(str(year) + ": Crude maternal mortality rate for all of South America was " + str(crude_maternal_mortality_rate) + " per 100000 females, and data coverage was " + str(data_coverage_percentage) + "%.")

 
    
calculate_crude_maternal_mortality(1993)

而我的数据 csv 文件(这里有几行示例数据)看起来是这样的:

Region_Code,Region_Name,Country_Code,Country_Name,Year,Sex,Age_group_code,Age_Group,Number,Percentage_cause-specific_deaths_of_total_deaths,Age-standardized_death_rate_per_100k_standard_pop,Death_rate_per_100k_pop
CSA,Central and South America,PRY,Paraguay,2014,Female,Age70_74,[70-74],0.00000000,0.00000000,,0.00000000,
CSA,Central and South America,PRY,Paraguay,2014,Female,Age75_79,[75-79],0.00000000,0.00000000,,0.00000000,
CSA,Central and South America,PRY,Paraguay,2014,Female,Age80_84,[80-84],0.00000000,0.00000000,,0.00000000,
CSA,Central and South America,PRY,Paraguay,2014,Female,Age85_over,[85+],0.00000000,0.00000000,,0.00000000,
CSA,Central and South America,PRY,Paraguay,2014,Female,Age_unknown,[Unknown],0.00000000,0.00000000,,,
CSA,Central and South America,PER,Peru,1995,All,Age_all,[All],301.00000000,0.32901569,1.27784199,1.23872595,
CSA,Central and South America,PER,Peru,1995,All,Age00,[0],0.00000000,0.00000000,,0.00000000,
CSA,Central and South America,PER,Peru,1995,All,Age01_04,[1-4],0.00000000,0.00000000,,0.00000000,
CSA,Central and South America,PER,Peru,1995,All,Age05_09,[5-9],0.00000000,0.00000000,,0.00000000,
CSA,Central and South America,PER,Peru,1995,All,Age10_14,[10-14],2.00000000,0.17346054,,0.07080373,
CSA,Central and South America,PER,Peru,1995,All,Age15_19,[15-19],32.00000000,1.63432074,,1.24696102,
CSA,Central and South America,PER,Peru,1995,All,Age20_24,[20-24],63.00000000,2.43055556,,2.72412878,
CSA,Central and South America,PER,Peru,1995,All,Age25_29,[25-29],52.00000000,2.07750699,,2.57749333,

举个例子,当我的循环查找 1993 年的数据时,如果没有 1993 年的数据,df.loc 就会是空的,这样就会报错。

有没有什么办法可以避免这个问题呢?我想过可以先查找年份列中的最小值,如果要查找的年份小于这个最小值,就返回“没有数据”,但我不知道怎么实现,尤其是当有多个条件(比如性别 = 女,年份 = 1993)的时候。

任何帮助都非常感谢。

1 个回答

0

这个问题似乎出现在包含以下内容的行:

if mm_df_slice is not None:

它永远不会是 None,因为 .loc 返回的是一个数据框的子集,这个子集可以是空的,但不会是空值。

与其使用以下逻辑:

if df.str.contains(x):
    if subset is not None:
        ...

不如使用更“符合 Python 风格”的方式:

# Get the dataframe, without checking the contains
mm_df_slice = mm_df.loc[(mm_df.Country_Name == country) & (mm_df.Year == year) & (mm_df.Sex == "Female") & (mm_df.Age_group_code=="Age_all")]
# And then, check if results are empty
if not mm_df_slice.empty:
    # Calculate stuff
else:
    mm_dict_value_list.append('No data')

撰写回答