通过datarobot的预测api对csv文件进行评分的脚本

datarobot-batch-scoring的Python项目详细描述


通过DataRobot的预测API对CSV文件进行评分的脚本。

https://coveralls.io/repos/github/datarobot/batch-scoring/badge.svg?branch=masterhttps://travis-ci.com/datarobot/batch-scoring.svg?branch=masterhttps://badge.fury.io/py/datarobot-batch-scoring.svg

版本兼容性

我们的目标是在每次发布批处理评分时支持尽可能多的datarobot版本,但有时 后端中的更改会导致不兼容。此图表与版本保持同步 此工具与DataRobot版本之间的兼容性。如果您不确定datarobot的哪个版本 正在使用,请联系DataRobot支持以获取帮助。

batch_scoring_versionDataRobot Version
<=1.102.7, 2.8, 2.9
>=1.11, <1.133.0, 3.1+
>=1.132.7, 2.8, 2.9, 3.0, 3.1+

命令batch_scoring_deployment_aware仅适用于新的datarobot 版本。

batch_scoring_deployment_awareDataRobot Version
>=1.144.4+

如何安装

安装或升级到最新版本:

$ pip install -U datarobot_batch_scoring

如何安装特定版本:

$ pip install datarobot_batch_scoring==x.y.z

替代安装

我们在releases页面上发布了两种可选的安装方法。这些是针对Internet受限或Python不可用的情况。

offlinebundle:

For performing installations in environments where Python2.7 or Python3+ is available, but there is no access to the internet. Does not require administrative privileges or pip. Works on Linux, OSX or Windows.

These files have “offlinebundle” in their name on the release page.

PyInstaller:

Using pyinstaller we build a single-file-executable that does not depend on Python. It only depends on libc and can be installed without administrative privileges. Right now we publish builds that work for most Linux distros made since Centos5. OSX and Windows are also supported.

These files have “executables” in their name on the release page.

功能

  • 并发请求(--n_concurrent
  • 暂停/继续
  • gzip支持
  • 自定义分隔符
  • 并行处理

运行批处理评分、批处理评分脚本或批处理评分部署

您可以执行batch_scoringbatch_scoring_ssebatch_scoring_deployment_aware 命令行中的命令和相关参数,或者可以将参数从.ini文件传递给脚本。 将.ini文件放在主目录或运行batch_scoring的目录中, batch_scoring_ssebatch_scoring_deployment_aware命令。使用下面的语法和参数定义参数。 请注意,如果运行脚本并通过命令行执行,则命令行参数优先。

下表描述了语法约定;运行脚本的语法遵循下表。 datarobot提供两个脚本,每个脚本用于不同的应用程序。使用:

  • batch_scoring在专用预测实例上得分。
  • batch_scoring_sse在独立预测实例上得分。如果不确定实例类型,请联系DataRobot Support
  • batch_scoring_deployment_aware使用deployment_id而不是project_idmodel_id在专用预测实例上评分。
ConventionMeaning
[ ]Optional argument
< >User supplied value
{ | }Required, mutually exclusive

必需参数:

batch_scoring --host=<host>--user=<user> <project_id> <model_id> <dataset_filepath> --datarobot_key=<datarobot_key>{--password=<pwd> | --api_token=<api_token>}

batch_scoring_deployment_aware --host=<host>--user=<user> <deployment_id> <dataset_filepath> --datarobot_key=<datarobot_key>{--password=<pwd> | --api_token=<api_token>}

batch_scoring_sse --host=<host> <import_id> <dataset_filepath>

其他推荐参数:

[--verbose][--keep_cols=<keep_cols>][--n_concurrent=<n_concurrent>]

其他可选参数:

[--out=<filepath>][--api_version=<api_version>][--pred_name=<string>][--timeout=<timeout>][—-create_api_token][--n_retry=<n_retry>][--delimiter=<delimiter>][--resume][--no-resume][--skip_row_id][--output_delimiter=<delimiter>]

参数说明: 下表介绍了每个参数:

ArgumentStandaloneDedicatedDescription
host=<host>++Specifies the hostname of the prediction API endpoint (the location of the data to use for predictions).
user=<user>-+Specifies the username used to acquire the API token. Use quotes if the name contains spaces.
<import_id>+-Specifies the unique ID for the imported model. If unknown, ask your prediction administrator (the person responsible for the import procedure).
<project_id>-+Specifies the project identification string. You can find the ID embedded in the URL that displays when you are in the Leaderboard (for example, https://<host>/projects/<project_id>/models). Alternatively, when the prediction API is enabled, the project ID displays in the example shown when you click Deploy Model for a specific model in the Leaderboard.
<model_id>-+Specifies the model identification string. You can find the ID embedded in the URL that displays when you are in the Leaderboard and have selected a model (for example, https://<host>/projects/<project_id>/models/<model_id>). Alternatively, when the prediction API is enabled, the model ID displays in the example shown when you click Deploy Model for a specific model in the Leaderboard.
<deployment_id>-+Specifies the unique ID for deployed model, can be used instead of ^{tt20}$ and ^{tt21}$ pair.
<dataset_filepath>++Specifies the .csv input file that the script scores. DataRobot scores models by submitting prediction requests against ^{tt22}$ using project ^{tt20}$ and model ^{tt21}$.
datarobot_key=<datarobot_key>-+An additional datarobot_key for dedicated prediction instances. This argument is required when using on-demand workers on the Cloud platform, but not for Enterprise users.
password=<pwd>-+Specifies the password used to acquire the API token. Use quotes if the password contains spaces. You must specify either the password or the API token argument. To avoid entering your password each time you run the script, use the ^{tt25}$ argument instead.
api_token=<api_token>-+Specifies the API token for requests; if you do not have a token, you must specify the password argument. You can retrieve your token from your profile on the My Account page.
api_version=<api_version>++Specifies the API version for requests. If omitted, defaults to current latest. Override this if your DataRobot distribution doesn’t support the latest API version. Valid options are ^{tt26}$ and ^{tt27}$; ^{tt26}$ is the default.
out=<filepath>++Specifies the file name, and optionally path, to which the results are written. If not specified, the default file name is ^{tt29}$, written to the directory containing the script. The value of the output file must be a single .csv file that can be gzipped (extension .gz).
verbose++Provides status updates while the script is running. It is recommended that you include this argument to track script execution progress. Silent mode (non-verbose), the default, displays very little output.
keep_cols=<keep_cols>++Specifies the column names to append to the predictions. Enter as a comma-separated list.
max_prediction_explanations=<num>++Specifies the number of the top prediction explanations to generate for each prediction. If not specified, the default is ^{tt30}$. Compatible only with api_version ^{tt26}$.
n_samples=<n_samples>++Specifies the number of samples (rows) to use per batch. If not defined, the ^{tt32}$ option is used.
n_concurrent=<n_concurrent>++Specifies the number of concurrent requests to submit. By default, the script submits four concurrent requests. Set ^{tt33}$ to match the number of cores in the prediction API endpoint.
create_api_token++Requests a new API token. To use this option, you must specify the ^{tt34}$ argument for this request (not the ^{tt25}$ argument). Specifying this argument invalidates your existing API token and creates and stores a new token for future prediction requests.
n_retry=<n_retry>++Specifies the number of times DataRobot will retry if a request fails. A value of -1, the default, specifies an infinite number of retries.
pred_name=<pred_name>++Applies a name to the prediction column of the output file. If you do not supply the argument, the column name is blank. For binary predictions, only positive class columns are included in the output. The last class (in lexical order) is used as the name of the prediction column.
skip_row_id++Skip the row_id column in output.
output_delimiter=<delimiter>++Specifies the delimiter for the output CSV file. The special keyword “tab” can be used to indicate a tab-delimited CSV.
timeout=<timeout>++The time, in seconds, that DataRobot tries to make a connection to satisfy a prediction request. When the timeout expires, the client (the batch_scoring or batch_scoring_sse command) closes the connection and retries, up to the number of times defined by the value of ^{tt36}$. The default value is 30 seconds.
delimiter=<delimiter>++Specifies the delimiter to recognize in the input .csv file (e.g., “–delimiter=,”). If not specified, the script tries to automatically determine the delimiter. The special keyword “tab” can be used to indicate a tab-delimited CSV.
resume++Starts the prediction from the point at which it was halted. If the prediction stopped, for example due to error or network connection issue, you can run the same command with all the same arguments plus this ^{tt37}$ argument. If you do not include this argument, and the script detects a previous script was interrupted mid-execution, DataRobot prompts whether to resume. When resuming a script, you cannot change the ^{tt38}$, ^{tt14}$, ^{tt13}$, ^{tt41}$, or ^{tt42}$.
no-resume++Starts the prediction from scratch disregarding previous run.
help++Shows usage help for the command.
fast++Experimental: Enables a faster .csv processor. Note that this method does not support multiline CSV files.
stdout++Sends all log messages to stdout. If not specified, the command sends log messages to the ^{tt43}$ file.
auto_sample++Override the ^{tt44}$ value and instead uses chunks of roughly 2.5 MB to improve throughput. Enabled by default.
encoding++Specifies dataset encoding. If not provided, the batch_scoring or batch_scoring_sse script attempts to detect the decoding (e.g., “utf-8”, “latin-1”, or “iso2022_jp”). See the Python standard encodings for a list of valid values.
skip_dialect++Specifies that the script skips CSV dialect detection and uses default “excel” dialect for CSV parsing. By default, the scripts do detect CSV dialect for proper batch generation on the client side.
ca_bundle=<ca_bundle>++Specifies the path to a CA_BUNDLE file or directory with certificates of trusted Certificate Authorities (CAs) to be used for SSL verification. Note: if passed a path to a directory, the directory must have been processed using the c_rehash utility supplied with OpenSSL.
no_verify_ssl++Disable SSL verification.

示例:

batch_scoring --host=https://mycorp.orm.datarobot.com/ --user="greg@mycorp.com" --out=pred.csv 5545eb20b4912911244d4835 5545eb71b4912911244d4847 /home/greg/Downloads/diabetes_test.csv
batch_scoring_sse --host=https://mycorp.orm.datarobot.com/ --out=pred.csv 0ec5bcea7f0f45918fa88257bfe42c09 /home/greg/Downloads/diabetes_test.csv
batch_scoring_deployment_aware --host=https://mycorp.orm.datarobot.com/ --user="greg@mycorp.com" --out=pred.csv 5545eb71b4912911244d4848 /home/greg/Downloads/diabetes_test.csv

使用配置文件

> cTIT> BATCHYRITION 命令,检查在运行脚本的目录(工作目录)中是否存在BATCHYSCORGIN .IN文件,如果在工作目录中找不到,则在$home /BATCHI SCORGIN .IN(您的主目录)中。如果这个文件存在,命令使用与上面描述的相同的参数。如果文件不存在,则命令使用命令行参数正常运行。命令行参数的优先级高于文件参数(即,可以使用命令行重写文件参数)。

批评分文件的格式如下:

[batch_scoring]
host=file_host
project_id=file_project_id
model_id=file_model_id
user=file_username
password=file_password

使用说明

  • 如果脚本检测到前一个脚本在执行过程中被中断,它将提示是否继续执行。
  • 如果没有检测到中断脚本,或者如果您指示不恢复先前的执行,脚本将检查指定的输出文件是否存在。如果是,脚本将在覆盖此文件之前提示确认。
  • 每次运行batch_scoringbatch_scoring_sse的日志都存储在当前工作目录中。所有用户都会看到一个datarobot_batch_scoring_main.log日志文件。windows用户可以看到另外两个日志文件,datarobot_batch_scoring_batcher.logdatarobot_batch_scoring_writer.log
  • 如果在评分数据。这个问题是由于标准python csv解析器的限制造成的。若要解决此问题,请将索引列添加到数据集-评分时将忽略该列,但将有助于分析该列。

支持的平台

datarobot_batch_scoring在Linux、Windows和OS X上进行测试。支持Python2.7.x和Python3.x。

代理支持

批量评分脚本处理standarthttp\u proxyhttps\u proxyno\u proxy环境变量:

export HTTP_PROXY=http://192.168.1.3:3128
export HTTPS_PROXY=http://192.168.1.3:3128
export NO_PROXY=noproxy.domain.com

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
java CXF和授权   java在网站中部署使用swing创建的表单   java为什么getHeaderField()返回一个字符串,其中getHeaderFields()返回HttpUrlConnection中的Map<String,List<String>>   java如何检测恶意数据包?   webview中的java网页为空   java SWT图像资源,用于将我的所有图像存储在一个位置   java计算数组的最大长度,使平均值小于给定值   java“发件人电话号码无效”和美国号码   将Swing组件作为内容的自定义Java工具提示不会显示   在并发HashMap中重新灰化期间的java检索   Java 7和Tomcat 7.0.64 ClassFormatException:常量池中的字节标记无效   使用JUnit的java assertNull因NullPointerException失败   java内存中的文件是否与文件系统中的文件大小相同?   循环内实例化的类型的java注入依赖项