提供tpot、auto-sklearn和openml之间兼容层的包装工具箱

arbok的Python项目详细描述


arbok(automl wrapper工具box用于openml c兼容性)为tpot和auto sklearn提供包装,作为 这些工具与openml之间的兼容层。

包装器扩展了sklearn的BaseSearchCV,并提供了 openml需要的内部参数,例如cv_results_best_index_best_params_best_score_classes_

安装

pip install arbok

简单示例

importopenmlfromarbokimportAutoSklearnWrapper,TPOTWrappertask=openml.tasks.get_task(31)dataset=task.get_dataset()# Get the AutoSklearn wrapper and pass parameters like you would to AutoSklearnclf=AutoSklearnWrapper(time_left_for_this_task=3600,per_run_time_limit=360)# Or get the TPOT wrapper and pass parameters like you would to TPOTclf=TPOTWrapper(generations=100,population_size=100,verbosity=2)# Execute the taskrun=openml.runs.run_model_on_task(task,clf)run.publish()print('URL for run: %s/run/%d'%(openml.config.server,run.run_id))

预处理数据

为了使包装器更加健壮,我们需要对数据进行预处理。我们可以 填充缺少的值,然后对分类数据进行一次热编码。

首先,我们得到一个掩码,它告诉我们一个特性是否是一个分类的 特征与否。

dataset=task.get_dataset()_,categorical=dataset.get_data(return_categorical_indicator=True)categorical=categorical[:-1]# Remove last index (which is the class)

接下来,我们为预处理设置一个管道。我们正在使用 ConditionalImputer,这是一个能够使用 分类(名词性)和数值数据的不同策略。

fromsklearn.pipelineimportmake_pipelinefromsklearn.preprocessingimportOneHotEncoderfromarbokimportConditionalImputerpreprocessor=make_pipeline(ConditionalImputer(categorical_features=categorical,strategy="mean",strategy_nominal="most_frequent"),OneHotEncoder(categorical_features=categorical,handle_unknown="ignore",sparse=False))

最后,我们把所有的东西都放在一个包装袋里。

clf=AutoSklearnWrapper(preprocessor=preprocessor,time_left_for_this_task=3600,per_run_time_limit=360)

限制

  • 目前只实现了分类器。回归是 因此不可能。
  • 对于tpot,无法设置config_dict变量,因为 导致API出现问题。

基准

安装arbok包包括arbenchcli工具。我们 可以生成这样的json文件:

fromarbok.benchimportBenchmarkbench=Benchmark()config_file=bench.create_config_file(# Wrapper parameterswrapper={"refit":True,"verbose":False,"retry_on_error":True},# TPOT parameterstpot={"max_time_mins":6,# Max total time in minutes"max_eval_time_mins":1# Max time per candidate in minutes},# Autosklearn parametersautosklearn={"time_left_for_this_task":360,# Max total time in seconds"per_run_time_limit":60# Max time per candidate in seconds})

然后,我们可以这样调用arbench:

arbench --classifier autosklearn --task-id 31 --config config.json

或者将arbok作为python模块调用:

python -m arbok --classifier autosklearn --task-id 31 --config config.json

在批处理系统上运行基准

要运行大规模基准测试,我们可以创建一个配置文件,如 生成作业并将其提交给批处理系统,如下所示。

# We create a benchmark setup where we specify the headers, the interpreter we# want to use, the directory to where we store the jobs (.sh-files), and we give# it the config-file we created earlier.bench=Benchmark(headers="#PBS -lnodes=1:cpu3\n#PBS -lwalltime=1:30:00",python_interpreter="python3",# Path to interpreterroot="/path/to/project/",jobs_dir="jobs",config_file="config.json",log_file="log.json")# Create the config file like we did in the section aboveconfig_file=bench.create_config_file(# Wrapper parameterswrapper={"refit":True,"verbose":False,"retry_on_error":True},# TPOT parameterstpot={"max_time_mins":6,# Max total time in minutes"max_eval_time_mins":1# Max time per candidate in minutes},# Autosklearn parametersautosklearn={"time_left_for_this_task":360,# Max total time in seconds"per_run_time_limit":60# Max time per candidate in seconds})# Next, we load the tasks we want to benchmark on from OpenML.# In this case, we load a list of task id's from study 99.tasks=openml.study.get_study(99).tasks# Next, we create jobs for both tpot and autosklearn.bench.create_jobs(tasks,classifiers=["tpot","autosklearn"])# And finally, we submit the jobs using qsubbench.submit_jobs()

预处理参数

fromarbokimportParamPreprocessorimportnumpyasnpfromsklearn.feature_selectionimportVarianceThresholdfromsklearn.pipelineimportmake_pipelineX=np.array([[1,2,True,"foo","one"],[1,3,False,"bar","two"],[np.nan,"bar",None,None,"three"],[1,7,0,"zip","four"],[1,9,1,"foo","five"],[1,10,0.1,"zip","six"]],dtype=object)# Manually specify types, or use types="detect" to automatically detect typestypes=["numeric","mixed","bool","nominal","nominal"]pipeline=make_pipeline(ParamPreprocessor(types="detect"),VarianceThreshold())pipeline.fit_transform(X)

输出:

[[-0.4472136  -0.4472136   1.41421356 -0.70710678 -0.4472136  -0.4472136
   2.23606798 -0.4472136  -0.4472136  -0.4472136   0.4472136  -0.4472136
  -0.85226648  1.        ]
 [-0.4472136   2.23606798 -0.70710678 -0.70710678 -0.4472136  -0.4472136
  -0.4472136  -0.4472136  -0.4472136   2.23606798  0.4472136  -0.4472136
  -0.5831297  -1.        ]
 [ 2.23606798 -0.4472136  -0.70710678 -0.70710678 -0.4472136  -0.4472136
  -0.4472136  -0.4472136   2.23606798 -0.4472136  -2.23606798  2.23606798
  -1.39054004 -1.        ]
 [-0.4472136  -0.4472136  -0.70710678  1.41421356 -0.4472136   2.23606798
  -0.4472136  -0.4472136  -0.4472136  -0.4472136   0.4472136  -0.4472136
   0.49341743 -1.        ]
 [-0.4472136  -0.4472136   1.41421356 -0.70710678  2.23606798 -0.4472136
  -0.4472136  -0.4472136  -0.4472136  -0.4472136   0.4472136  -0.4472136
   1.031691    1.        ]
 [-0.4472136  -0.4472136  -0.70710678  1.41421356 -0.4472136  -0.4472136
  -0.4472136   2.23606798 -0.4472136  -0.4472136   0.4472136  -0.4472136
   1.30082778  1.        ]]

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java ajax请求不允许POST请求   java引用Android应用程序中其他模块的通用模块   JavaSpringBootWebFluxFlatmap是链接http调用的正确方法吗?   java如何在底部工作表中设置折叠工具栏?   任务“:compileJava”的java执行失败。“>无效的源版本:1.7   java Rabbit MQ不刷新ACK?   JavaWebSphere7:METAINF/config/ibmaxis2。无法正确加载xml Axis2全局配置文件   在Java中找出字符串是否包含数组中的值   java Liquibase通过Springboot执行postgres方法   java在Eclipse中安装Maven而不使用插件   Swing应用程序中的java SMTP错误   web应用程序InvalidKeyException:Java中使用RSA密钥的密钥格式无效   java显示JFrame作为JButton单击的结果?   java如何保留Apache Camel Exchange的属性,如果消息是从RabbitMQ生成和使用的   java使用流提取哈希映射列表中的所有对象   Android应用程序中的java单词检查器