Configuration

Files

To use the Python API or the command line interface for a specific PREDIT pipeline, configure the project first.

This includes both general options that apply to all providers and provider specific options in separate YAML files.

All Providers

  • config.yaml

name:
providers:
use_auger_cloud:
source:
exclude:
target:
model_type:
experiment:
  metric:
  cross_validation_folds:
  max_total_time:
  max_eval_time:
  max_n_trials:
  use_ensemble:
  validation_source:

review:
  metric:

roi:
  filter:
  revenue:
  investment:

  alert:
    active: True
    type: model_accuracy
    threshold: 0.7
    sensitivity: 72
    action: retrain_deploy
    notification: user

Attributes

  • name The project name.

  • providers List of providers: auger, google, azure.

  • use_auger_cloud Use Auger Cloud for all providers true | false

  • source # Local file name, remote url to the data source file or postgres url. Postgres url example: jdbc:postgresql://user:pwd@ec2-54-204-21-226.compute-1.amazonaws.com:5432/dbname?tablename=table1&offset=0&limit=100. Postgres url parameters: dbname, tablename, offset(OPTIONAL), limit(OPTIONAL)

  • exclude List of columns to be excluded from the training data.

  • target Target column name.

  • model_type Model type: classification|regression.

  • experiment.metric Score used to optimize ML model.

    • Classification accuracy, precision_weighted, AUC_weighted, norm_macro_recall, average_precision_score_weighted

    • Auger only: Classification f1_macro, f1_micro, f1_weighted, neg_log_loss, precision_macro, precision_micro, recall_macro, recall_micro, recall_weighted

    • Auger only: Binary Classification average_precision, f1, f1_macro, f1_micro, f1_weighted, neg_log_loss, precision, precision_macro, precision_micro, recall, recall_macro, recall_micro, recall_weighted, roc_auc, cohen_kappa_score, matthews_corrcoef

    • Regression and/or Time Series spearman_correlation, r2, normalized_mean_absolute_error, normalized_root_mean_squared_error

    • Auger only: Regression and/or Time Series explained_variance, neg_median_absolute_error, neg_mean_absolute_error, neg_mean_squared_error, neg_mean_squared_log_error, neg_rmsle, neg_mase, mda, neg_rmse

  • experiment.cross_validation_folds Number of folds used for k-folds validation of individual trial.

  • experiment.max_total_time Maximum time to run experiment in minutes.

  • experiment.max_eval_time Maximum time to run individual trial in minutes.

  • experiment.max_n_trials Maximum trials to run to complete experiment.

  • experiment.use_ensemble Try to improve model performance by creating ensembles from the trial models true | false.

  • experiment.validation_source Path to validation dataset. If not set your source dataset will be split to validate.

  • review.metric Optional metric used for MLRAM review, can be any experiment metric + roi. By default same as experiment metric

  • review.roi.filter Filter to select records to calculate ROI. See ROI formulas language for the syntax

  • review.roi.revenue Revenue can contain formuala for calculating revenue based on fields from actual. See ROI formulas language for the syntax

  • review.roi.investment Investment can contain formuala for calculating investment based on fields from actual. See ROI formulas language for the syntax

    • Filter and formulas special fields

    • P predicted value

    • A actual value

  • review.alert.active Activate/Deactivate Review Alert (True/False)

  • review.alert.type

    • Supported Review Alert types

    • model_accuracy Decrease in Model Accuracy: the model accuracy threshold allowed before trigger is initiated. Default threshold: 0.7. Default sensitivity: 72

    • feature_average_range Feature Average Out-Of-Range: Trigger an alert if average feature value during time period goes beyond the standard deviation range calculated during training period by the specified number of times or more. Default threshold: 1. Default sensitivity: 168

    • runtime_errors_burst Burst Of Runtime Errors: Trigger an alert if runtime error count exceeds threshold. Default threshold: 5. Default sensitivity: 1

  • review.alert.threshold Float

  • review.alert.sensitivity The amount of time(in hours) this metric must be at or below the threshold to trigger the alert.

  • review.alert.action

    • Supported Review Alert actions

    • no no action should be executed

    • retrain Use new predictions and actuals as test set to retrain the model.

    • retrain_deploy Deploy retrained model and make it active model of this endpoint.

  • review.alert.notification Send message via selected notification channel. (no/user/organization)

Provider Specfic

Currently a2ml supports Auger, Azure, Google and External providers.

Auger

  • auger.yaml

dataset:
experiment:
  name:
  experiment_session_id:
  time_series:
  label_encoded: []
  blocked_models: []
  allowed_models: []
  estimate_trial_time: False
  trials_per_worker: 2
  class_weight:
  score_top_count:
  oversampling:
    name:
    params:
      sampling_strategy:
      k_neighbors:

Attributes

  • dataset Name of the DataSet on Auger Cloud.

  • experiment.name Latest experiment name.

  • experiment.experiment_session_id Latest experiment session.

  • experiment.time_series Time series feature. If Data Source contains more then one DATETIME feature you will have to explicitly specify feature to use as time series.

  • experiment.label_encoded List of columns which should be used as label encoded features.

  • experiment.blocked_models A list of model names to ignore for an experiment

  • experiment.allowed_models A list of model names to search for an experiment.If not specified, then all models supported for the task are used minus any specified in blocked_models

    • Supported models

    • Classification XGBClassifier, LGBMClassifier, SVC, SGDClassifier, AdaBoostClassifier, DecisionTreeClassifier, ExtraTreesClassifier, RandomForestClassifier, GradientBoostingClassifier, CatBoostClassifier

    • Regression SVR,XGBRegressor, LGBMRegressor, ElasticNet, SGDRegressor, AdaBoostRegressor, DecisionTreeRegressor, ExtraTreesRegressor, RandomForestRegressor, GradientBoostingRegressor, CatBoostRegressor

  • experiment.estimate_trial_time Use it if you have a lot of timeouted trials. Set it to True will predict the training time of each individual model to avoid timeouts. Default is False.

  • experiment.trials_per_worker Use it if you have a lot of failed trials. Set it to value < 8 to give trial fit process more memory. Default is None.

  • experiment.class_weight Balanced | Balanced Subsample. Class Weights associated with classes. If None, all classes are supposed to have weight one. The Balanced mode automatically adjusts weights inversely proportional to class frequencies in the input data. The Balanced Subsample mode is the same as Balanced except that weights are computed based on the bootstrap sample for every tree grown.

  • experiment.score_top_count Number of top N values(sorted in descending order) to calculate metrics while train values. For regression only.

  • experiment.oversampling.name SMOTE, RandomOverSampler, ADASYN, SMOTEENN, SMOTETomek. Oversampling Methods to adjust the class distribution of a data set

  • experiment.oversampling.params.sampling_strategy auto, minority, majority, not minority, not majority, all

  • experiment.oversampling.params.k_neighbors Integer value of k_neighbors

Note

For more information on oversampling

Azure

  • azure.yaml

dataset:
experiment:
  name:
  run_id:
  blocked_models: []
  allowed_models: []

cluster:
  region:
  min_nodes:
  max_nodes:
  type:
  name:

Attributes

  • dataset Name of the DataSet on Azure Cloud.

  • experiment.name Latest experiment name.

  • experiment.run_id Latest experiment run.

  • experiment.blocked_models A list of model names to ignore for an experiment

  • experiment.allowed_models A list of model names to search for an experiment.If not specified, then all models supported for the task are used minus any specified in blocked_models

    • Supported models

    • Classification AveragedPerceptronClassifier, BernoulliNaiveBayes, DecisionTree, ExtremeRandomTrees,GradientBoosting, KNN, LightGBM, LinearSVM, LogisticRegression, MultinomialNaiveBayes, SGD, RandomForest, SVM, XGBoostClassifier

    • Regression DecisionTree, ElasticNet, ExtremeRandomTrees, FastLinearRegressor, GradientBoosting, KNN, LassoLars, LightGBM, OnlineGradientDescentRegressor, RandomForest, SGD, XGBoostRegressor

  • cluster.region Name of cluster region. For example: eastus2

  • cluster.min_nodes Minimum number of nodes allocated for cluster. Minimum is 0.

  • cluster.max_nodes Maximum number of nodes allocated for cluster.

  • cluster.type Cluster node type. For example: STANDARD_D2_V2. Please read Azure documentation for available options and prices.

  • cluster.name Name of existing cluster or new one to create.

Google

  • google.yaml

project:
experiment:
  metric:
cluster:
  region:
gsbucket:

Attributes

  • project Name of the Project on Google Cloud.

  • experiment.metric Metric used to build Model

  • cluster.region

  • gsbucket

External

No provider specific yml-file is required. You can pass this provider to model deploy and actuals calls.

Architecture

Auger Cloud

A2ML cloud

Create one account in the Auger Cloud and let the cloud manage all the provider connections.

A2ML Local

Direct Provider Connection

A2ML client direct providers

Directly configure the provider(s) and connect to them from the a2ml client.

Server Provider Connection

A2ML cloud

Host a server which manages provider connections. The a2ml client would then point to the server.