DATA_CLEANER

NoventisAutoML

The journey from a prepared dataset to a high-performing, deployable machine learning model involves numerous steps: model selection, hyperparameter tuning, rigorous evaluation, and comparison. NoventisAutoML is an all-in-one solution designed to automate this entire workflow. It acts as your personal automated data scientist, exploring various models, optimizing their performance within a set budget, and delivering a comprehensive, interactive report with actionable insights. for normal data, RobustScaler for data with outliers, or PowerTransformer for skewed data—is often a tedious manual process.

Powered by the robust FLAML library, it can find the best model through an efficient AutoML search, train a specific list of models you define, or do both and compare them head-to-head to find the undisputed champion for your dataset.

BASH
from noventis.predictor import NoventisAutoML
Key Features
  • Hybrid Modeling Approach

    Seamlessly run a state-of-the-art AutoML search, train a specific list of manual models (like xgboost, random_forest,etc.), or do both simultaneously and compare them to find the absolute best performer.

  • Fully Automated Workflow

    Handles data loading (from CSV or DataFrame), automatic task detection (classification/regression), and stratified train-test splitting.

  • Rich Explainability & Visualization

    When explain=True, automatically generates a suite of insightful plots including feature importance, confusion matrices, ROC/AUC & Precision-Recall curves, residual plots, and more. .

  • Interactive HTML Reporting

    Produces a stunning, self-contained HTML dashboard that consolidates all results, performance metrics, model comparisons, and plots into a single, user-friendly, and shareable file.

  • Rich Explainability & Visualization

    Automatically saves the best-performing model as a .pkl file, ready for easy loading and deployment for future predictions.

Parameters
data
Type
Union[str, pd.DataFrame]
Default
None
The input data. This can be either a pandas DataFrame or a string containing the file path to a CSV file.
target
Type
str
Default
None
The name of the target variable (the column you want to predict).
task
Type
Optional[str]
Default
None
The type of machine learning task. Can be 'classification' or 'regression'. If None, the task will be automatically inferred from the target column's data type and distribution.
models
Type
List[str]
Default
None

A list of model names to train manually and compare. If None and compare=True, a default list of common models will be used. This parameter is ignored if compare=False and you are only running the AutoML engine.

• Classification examples: 'logistic_regression', 'random_forest', 'xgboost', 'lightgbm'.
• Regression examples: 'linear_regression', 'random_forest', 'xgboost'.
explain
Type
bool
Default
True
generates all performance visualizations and saves them to the output_dir.
compare
Type
bool
Default
True

Controls the operating mode.

• If True (default), the tool will run the AutoML engine and train the models specified in models, then compare all of them to find the best one.
• If False, it will only run one of the two modes: either AutoML (if models is None) or the manual list of models (if models is provided).
metrics
Type
str
Default
None

The primary metric to use for optimization and model ranking. If None, it defaults to 'macro_f1' for classification and 'r2' for regression.

• Classification examples: 'accuracy', 'precision', 'recall', 'f1_score'.
• Regression examples: 'r2_score', 'mae', 'mse'.
time_budget
Type
int
Default
60
The total time in seconds allocated to the AutoML engine for its search process. A larger budget allows for a more thorough search.
output_dir
Type
str
Default
'Noventis_Results'
The directory where all outputs (saved models, plots, reports) will be stored.
test_size
Type
float
Default
0.2
The proportion of the dataset to allocate to the test set.
random_state
Type
int
Default
42
The random seed for ensuring reproducibility in data splitting and model training.
Main Workflow Method
  • .fit(time_budget=60, metric=None) → dict

    This is the primary method to execute the entire AutoML pipeline. It orchestrates data splitting, model training (AutoML and/or manual), evaluation, comparison, and saving the best model. It returns a dictionary containing all detailed results from the run. The time_budget and metric parameters can be used here to override the values set during initialization.

Reporting & Analysis Methods
  • .generate_html_report() → HTML

    Generates the comprehensive, interactive HTML report of the entire process. In Jupyter environments, this report is often displayed automatically after .fit() completes.

  • .get_model_info() → dict

    Returns a dictionary with details about the best-found model, including the final estimator, its configuration, and feature names.

  • .export_results_to_csv()

    Saves key results—including predictions on the test set, performance metrics, and feature importances—to CSV files in the output_dir for external analysis.

Utility Methods
  • .predict(X_new, model_path=None)

    Makes predictions on new, unseen data (X_new). It can either use the model trained in the current session or load a previously saved model from a file specified by model_path.

  • .load_model(model_path) → object

    A utility function to load a saved .pkl model from the specified path.

Model Usage Examples
Prepare Dataset

Classification

BASH
import pandas as pd import seaborn as sns import pandas as pd from sklearn.datasets import fetch_california_housing from noventis.predictor import NoventisAutoML df_titanic = sns.load_dataset('titanic') df_titanic_clean = df_titanic.drop(columns=['deck', 'embark_town', 'alive']) df_titanic_clean = df_titanic_clean.dropna()

Regression

BASH
import pandas as pd import seaborn as sns import pandas as pd from sklearn.datasets import fetch_california_housing from noventis.predictor import NoventisAutoML housing = fetch_california_housing() df_housing = pd.DataFrame(housing.data, columns=housing.feature_names) df_housing['MedHouseVal'] = housing.target
01
Example 1: The Full Experience (Default)

Run AutoML, compare it against a default list of common models, and generate a full report.

Classification

BASH
automl = NoventisAutoML(data=df_titanic_clean, target='survived', task='classification', time_budget=30) results = automl.fit() automl.generate_html_report()
RESULT
Manual Predictor Analysis Report (Classification)
RESULT
Noventis AutoML Report (Classification)

Regression

BASH
automl = NoventisAutoML(data=df_housing, target='MedHouseVal', task='regression', time_budget=30) results = automl.fit() automl.generate_html_report()
RESULT
Manual Predictor Analysis Report (Regression)
RESULT
Noventis AutoML Report (Regression)
02
Pure AutoML Search

Focus exclusively on finding the best possible model using the AutoML engine within a 5-minute budget.

Classification

BASH
automl_pure = NoventisAutoML( data=df_titanic_clean, target='survived', compare=False, models=None, task='classification', time_budget=30, metrics='accuracy' ) results = automl_pure.fit() automl_pure.generate_html_report()
RESULT
Noventis AutoML Report (Classification)

Regression

BASH
automl = NoventisAutoML( data=df_housing, target='MedHouseVal', compare=False, models=None, task='regression', time_budget=30, metrics='mae' ) results = automl.fit() automl.generate_html_report()
RESULT
Noventis AutoML Report (Regression)
03
Manual Model Training & Comparison

Train only a specific set of models you want to evaluate, without running the AutoML search.

Classification

BASH
automl_pure = NoventisAutoML( data=df_titanic_clean, target='survived', compare=False, models=['random_forest', 'lightgbm', 'logistic_regression'], task='classification' ) results = automl_pure.fit() automl_pure.generate_html_report()
RESULT
Noventis AutoML Report (Classification)

Regression

BASH
automl = NoventisAutoML( data=df_housing, target='MedHouseVal', compare=False, models=['linear_regression', 'random_forest', 'xgboost'], task='regression' ) results = automl.fit() automl.generate_html_report()
RESULT
Noventis AutoML Report (Regression)
04
Loading a Saved Model and Predicting

Train a model, then load the saved best model and use it to predict on new data.

BASH
from noventis.predictor import NoventisAutoML import pandas as pd # First, run the training process automl = NoventisAutoML(data='path/to/train_data.csv', target='YourTargetColumn') automl.fit() # Now, load new data for prediction new_data = pd.read_csv('path/to/new_unseen_data.csv') # Use the predict method (it automatically finds the best saved model) predictions = automl.predict(X_new=new_data, model_path='Noventis_Results/best_model.pkl') print(predictions)