清理Python中不一致的数据类别

1条回答

网友

1楼 · 发布于 2024-05-14 00:20:20

以下是我对你问题的评论：

第一个块为我提供了一个数据帧，类似于我想象中的您的数据帧：

import pandas as pd
import numpy as np

your_list = np.array(['Sc.B','S.B.','Dual Degree','BEng. 2master','M.Phil.','Masters degree'])
names = np.array([f"person_{ii}" for ii in range(len(your_list))])

df = pd.DataFrame({"names": names, "degree_title": your_list})
print(df)

现在我们可以对学位标题数据进行for循环，我们的第一个猜测如下达到学位课程的样子

new_classifications = [] # Make an empty list so we can keep track of what we classify the new degree as.

for degree in df["degree_title"]:
    if "bachelor" in degree.lower(): # lower() as we don't care if it's "Bachelor" or "bachelor"
        new_classifications.append("bachelor") # Anything here is good enough to be called "bachelor"
    elif "master" in degree.lower():
        new_classifications.append("master")
    elif "doctorate" in degree.lower():
        new_classification.append("phd")
    else:
        new_classifications.append("unclassified")
        print(f"no classification found for {degree}")

这告诉我们，我们缺少像B.Sc这样的大量结果，因此我们可以为第二次尝试中的结果添加检查-注意“学士”和“硕士”行中的添加

请注意，有一行是“边缘案例”-从标题中我无法猜测“专业化”是硕士水平的资格，因此我们必须“手动”完成此操作

new_classifications = [] 

for degree in df["degree_title"]:
    if "bachelor" in degree.lower() or degree.lower().startswith("b") or "b." in degree.lower():
        new_classifications.append("bachelor")
    elif "B" in degree and degree.isupper(): # Also require the whole title to be uppercase 
        new_classifications.append("bachelor")
    elif "master" in degree.lower() or degree.lower().startswith("m") or "m." in degree.lower():
        new_classifications.append("master")
    elif "M" in degree and degree.isupper():
        new_classifications.append("master")
    elif "doctorate" in degree.lower():
        new_classification.append("phd")
    elif degree in ["Diplom", "Fellowship", "CPA", "Specialisation", "Graduate Diploma"]:
        new_classifications.append("some_classification_that_you_write_for_these_edge_cases")
    else:
        new_classifications.append("unclassified")
        print(f"no classification found for {degree}")

当我们高兴时，我们可以向数据帧中添加分类良好的度

df["new_classification"] = new_classifications
print(df)

这是一种非常“蛮力”的方法来解决这个问题，但考虑到许多学位头衔将遵循类似的模式，这是一种非常简单的方法来开始，并删除大量的工作，留下更少的手工分类

相关问题更多 >

编程相关推荐

热门问题

热门文章

清理Python中不一致的数据类别

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >