数据分析:列中映射和更正拼写错误的问题
我最近在用Jupyter查看一个项目的数据,想把一些数据分类到特定的公司类别里。
最后,我用了一个很大的if循环,但问题是我无法用某一列来解析每个单独的单元格,所以想知道有没有更好的方法来做这件事。其实我一开始的代码就不太好用,所以我试着用我一点点的Python知识来改进它。
我想从SicCodes这一列中选一个值,然后把它和映射进行比较,最后得到一个名称作为输出。我最开始的想法是用if循环来简单解析数据,然后再慢慢改进。但实际上,我无法把数据框架放进我小小的to_code_range里,所以我考虑用for循环来做,但目前还没有成功。
有没有人能给我一些好的建议来改进这个问题呢?
mappings = [
(1000, 9990, 'Agriculture'),
(10000, 14990, 'Mining'),
(15000, 17990, 'Construction'),
(18000, 19990, 'not used'),
(20000, 39990, 'Manufacturing'),
(40000, 49990, 'Utility Services'),
(50000, 51990, 'Wholesale Trade'),
(52000, 59990, 'Retail Trade'),
(60000, 69200, 'Financials'),
(70000, 90040, 'Services'),
(91000, 97290, 'Public Administration'),
(98000, 99990, 'Nonclassifiable'),
]
"""errors = set()
def to_code_range(i):
if type(i) != int:
print("Pas un int")
if i=="None Supplied":
return np.nan
code = int(i)
for code_from, code_to, name in mappings:
if (code<=code_to)&(code>=code_from):
return name
errors.add(code)
return np.nan"""
def to_code_range(valeur):
if type(valeur) != int: print("Pas un int")
code = int(valeur)
if (code<1000): return np.nan
if (code>=1000)&(code<=9990): return "Agriculture"
if (code>=10000)&(code<=14990): return "Mining"
if (code>=10000)&(code<=14990): return "Mining"
if (code>=15000)&(code<=17990): return "Construction"
if (code>=18000)&(code<=19990): return "not used"
if (code>=20000)&(code<=39990): return "Manufacturing"
if (code>=40000)&(code<=49990): return "Utility Services"
if (code>=50000)&(code<=51990): return "Wholesale Trade"
if (code>=52000)&(code<=59990): return "Retail Trade"
if (code>=60000)&(code<=69200): return "Financials"
if (code>=70000)&(code<=90040): return "Services"
if (code>=91000)&(code<=97290): return "Public Administration"
if (code>=98000)&(code<=99990): return "Nonclassifiable"
else :return np.nan
#report['SICCode.SicText_1'] = to_code_range(report["SicCodes"])
for i in report['SicCodes']: report['SICCode.SicText_1'][i] = to_code_range(i)
我在用if循环和for循环,但输出时出现了错误。
1 个回答
1
我会这样做:
import pandas as pd
import numpy as np
mappings = [
(1000, 9990, 'Agriculture'),
(10000, 14990, 'Mining'),
(15000, 17990, 'Construction'),
(18000, 19990, 'not used'),
(20000, 39990, 'Manufacturing'),
(40000, 49990, 'Utility Services'),
(50000, 51990, 'Wholesale Trade'),
(52000, 59990, 'Retail Trade'),
(60000, 69200, 'Financials'),
(70000, 90040, 'Services'),
(91000, 97290, 'Public Administration'),
(98000, 99990, 'Nonclassifiable'),
]
def to_code_range(valeur):
if type(valeur) != int:
print("Pas un int")
return np.nan
for code_from, code_to, name in mappings:
if code_from <= valeur <= code_to:
return name
return np.nan
# Assuming 'report' is a DataFrame with a column 'SicCodes'
report = pd.DataFrame({
'SicCodes': [1000, 15000, 20000, 40000, 50000, 60000, 70000, 91000, 98000]
})
report['SICCode.SicText_1'] = report['SicCodes'].apply(to_code_range)
print(report)
解释器的输出结果
SicCodes SICCode.SicText_1
0 1000 Agriculture
1 15000 Construction
2 20000 Manufacturing
3 40000 Utility Services
4 50000 Wholesale Trade
5 60000 Financials
6 70000 Services
7 91000 Public Administration
8 98000 Nonclassifiable