SAS编程：如何使用一列替换多列中缺少的值？ - 问答 - Python中文网

SAS编程：如何使用一列替换多列中缺少的值？

2024-05-16 01:45:01 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

背景

我在SAS中有一个大型数据集，它有17个变量，其中4个是数字，13个字符/字符串。我使用的原始数据集可以在这里找到：https://www.kaggle.com/austinreese/craigslist-carstrucks-data

圆筒
状况
驱力
油漆颜色
类型
制造商
头衔和地位
模型
燃料
传输
描述
区域
陈述
价格（个）
过帐日期（num）
里程表（数字）
年份（个）

对数值列应用特定筛选器后，每个数值变量都不会缺少值。但是，对于剩余的14个字符/字符串变量，有数千到几十万个缺少变量

请求

与此处（https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regression-1b12386c69c8）所示的数据科学博文类似，特别是在功能工程部分下，我如何编写等效的SAS代码，在其中使用描述列上的正则表达式用分类值（如圆柱体、条件、驱动器、，油漆颜色等等

下面是博客文章中的Python代码

import re

manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'

keys =    ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer,   condition,   fuel,  title_status, transmission ,drive, size, type_, paint_color,   cylinders]

for i,column in zip(keys,columns):
    database[i] = database[i].fillna(
      database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()

database.drop('description', axis=1, inplace= True)

上面显示的Python代码的等效SAS代码是什么

Tags：代码 size title type status drive condition database

1条回答

网友
1楼 · 发布于 2024-05-16 01:45:01

它基本上只是做一些单词搜索
SAS中的一个简化示例：
data want; set have; array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric"); do i=1 to dim(_fuel); if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i); *does not deal with multiple finds so the last one found will be kept; end; run;
您可以通过为每个变量创建一个数组，然后在列表中循环来扩展它。我认为在SAS中也可以用REGEX命令替换循环，但是REGEX需要太多的思考，因此必须由其他人提供答案

相关问题更多 >

编程相关推荐

热门问题

热门文章