背景
我在SAS中有一个大型数据集,它有17个变量,其中4个是数字,13个字符/字符串。我使用的原始数据集可以在这里找到:https://www.kaggle.com/austinreese/craigslist-carstrucks-data
对数值列应用特定筛选器后,每个数值变量都不会缺少值。但是,对于剩余的14个字符/字符串变量,有数千到几十万个缺少变量
请求
与此处(https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regression-1b12386c69c8)所示的数据科学博文类似,特别是在功能工程部分下,我如何编写等效的SAS代码,在其中使用描述列上的正则表达式用分类值(如圆柱体、条件、驱动器、,油漆颜色等等
下面是博客文章中的Python代码
import re
manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'
keys = ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer, condition, fuel, title_status, transmission ,drive, size, type_, paint_color, cylinders]
for i,column in zip(keys,columns):
database[i] = database[i].fillna(
database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()
database.drop('description', axis=1, inplace= True)
上面显示的Python代码的等效SAS代码是什么
它基本上只是做一些单词搜索
SAS中的一个简化示例:
您可以通过为每个变量创建一个数组,然后在列表中循环来扩展它。我认为在SAS中也可以用REGEX命令替换循环,但是REGEX需要太多的思考,因此必须由其他人提供答案
相关问题 更多 >
编程相关推荐