Noventis Documentation

DATA_CLEANER

NoventisAutoML

The journey from a prepared dataset to a high-performing, deployable machine learning model involves numerous steps: model selection, hyperparameter tuning, rigorous evaluation, and comparison. NoventisAutoML is an all-in-one solution designed to automate this entire workflow. It acts as your personal automated data scientist, exploring various models, optimizing their performance within a set budget, and delivering a comprehensive, interactive report with actionable insights. for normal data, RobustScaler for data with outliers, or PowerTransformer for skewed data—is often a tedious manual process.

Powered by the robust FLAML library, it can find the best model through an efficient AutoML search, train a specific list of models you define, or do both and compare them head-to-head to find the undisputed champion for your dataset.

BASH

from noventis.predictor import NoventisAutoML

Key Features

Hybrid Modeling Approach
Seamlessly run a state-of-the-art AutoML search, train a specific list of manual models (like xgboost, random_forest,etc.), or do both simultaneously and compare them to find the absolute best performer.
Fully Automated Workflow
Handles data loading (from CSV or DataFrame), automatic task detection (classification/regression), and stratified train-test splitting.
Rich Explainability & Visualization
When explain=True, automatically generates a suite of insightful plots including feature importance, confusion matrices, ROC/AUC & Precision-Recall curves, residual plots, and more. .
Interactive HTML Reporting
Produces a stunning, self-contained HTML dashboard that consolidates all results, performance metrics, model comparisons, and plots into a single, user-friendly, and shareable file.
Rich Explainability & Visualization
Automatically saves the best-performing model as a .pkl file, ready for easy loading and deployment for future predictions.

Parameters

Parameter	Type	Default	Description
data	Union[str, pd.DataFrame]	`None`	The input data. This can be either a pandas DataFrame or a string containing the file path to a CSV file.
target	str	`None`	The name of the target variable (the column you want to predict).
task	Optional[str]	`None`	The type of machine learning task. Can be `'classification'` or `'regression'`. If `None`, the task will be automatically inferred from the target column's data type and distribution.
models	List[str]	`None`	A list of model names to train manually and compare. If `None` and `compare=True`, a default list of common models will be used. This parameter is ignored if `compare=False` and you are only running the AutoML engine. • Classification examples: `'logistic_regression'`, `'random_forest'`, `'xgboost'`, `'lightgbm'`. • Regression examples: `'linear_regression'`, `'random_forest'`, `'xgboost'`.
explain	bool	`True`	generates all performance visualizations and saves them to the `output_dir`.
compare	bool	`True`	Controls the operating mode. • If `True` (default), the tool will run the AutoML engine and train the `models` specified in `models`, then compare all of them to find the best one. • If `False`, it will only run one of the two modes: either AutoML (if `models` is `None`) or the manual list of models (if `models` is provided).
metrics	str	`None`	The primary metric to use for optimization and model ranking. If `None`, it defaults to `'macro_f1'` for classification and `'r2'` for regression. • Classification examples: `'accuracy'`, `'precision'`, `'recall'`, `'f1_score'`. • Regression examples: `'r2_score'`, `'mae'`, `'mse'`.
time_budget	int	`60`	The total time in seconds allocated to the AutoML engine for its search process. A larger budget allows for a more thorough search.
output_dir	str	`'Noventis_Results'`	The directory where all outputs (saved models, plots, reports) will be stored.
test_size	float	`0.2`	The proportion of the dataset to allocate to the test set.
random_state	int	`42`	The random seed for ensuring reproducibility in data splitting and model training.

data

Type

Union[str, pd.DataFrame]

Default

None

The input data. This can be either a pandas DataFrame or a string containing the file path to a CSV file.

target

Type

str

Default

None

The name of the target variable (the column you want to predict).

task

Type

Optional[str]

Default

None

The type of machine learning task. Can be 'classification' or 'regression'. If None, the task will be automatically inferred from the target column's data type and distribution.

models

Type

List[str]

Default

None

A list of model names to train manually and compare. If None and compare=True, a default list of common models will be used. This parameter is ignored if compare=False and you are only running the AutoML engine.

• Classification examples: 'logistic_regression', 'random_forest', 'xgboost', 'lightgbm'.

• Regression examples: 'linear_regression', 'random_forest', 'xgboost'.

explain

Type

bool

Default

True

generates all performance visualizations and saves them to the output_dir.

compare

Type

bool

Default

True

Controls the operating mode.

• If True (default), the tool will run the AutoML engine and train the models specified in models, then compare all of them to find the best one.

• If False, it will only run one of the two modes: either AutoML (if models is None) or the manual list of models (if models is provided).

metrics

Type

str

Default

None

The primary metric to use for optimization and model ranking. If None, it defaults to 'macro_f1' for classification and 'r2' for regression.

• Classification examples: 'accuracy', 'precision', 'recall', 'f1_score'.

• Regression examples: 'r2_score', 'mae', 'mse'.

time_budget

Type

int

Default

60

The total time in seconds allocated to the AutoML engine for its search process. A larger budget allows for a more thorough search.

output_dir

Type

str

Default

'Noventis_Results'

The directory where all outputs (saved models, plots, reports) will be stored.

test_size

Type

float

Default

0.2

The proportion of the dataset to allocate to the test set.

random_state

Type

int

Default

42

The random seed for ensuring reproducibility in data splitting and model training.

Main Workflow Method

.fit(time_budget=60, metric=None) → dict
This is the primary method to execute the entire AutoML pipeline. It orchestrates data splitting, model training (AutoML and/or manual), evaluation, comparison, and saving the best model. It returns a dictionary containing all detailed results from the run. The time_budget and metric parameters can be used here to override the values set during initialization.

Reporting & Analysis Methods

.generate_html_report() → HTML
Generates the comprehensive, interactive HTML report of the entire process. In Jupyter environments, this report is often displayed automatically after .fit() completes.
.get_model_info() → dict
Returns a dictionary with details about the best-found model, including the final estimator, its configuration, and feature names.
.export_results_to_csv()
Saves key results—including predictions on the test set, performance metrics, and feature importances—to CSV files in the output_dir for external analysis.

Utility Methods

.predict(X_new, model_path=None)
Makes predictions on new, unseen data (X_new). It can either use the model trained in the current session or load a previously saved model from a file specified by model_path.
.load_model(model_path) → object
A utility function to load a saved .pkl model from the specified path.

Model Usage Examples

Prepare Dataset

Classification

BASH

import pandas as pd
import seaborn as sns
import pandas as pd
from sklearn.datasets import fetch_california_housing
from noventis.predictor import NoventisAutoML

df_titanic = sns.load_dataset('titanic')

df_titanic_clean = df_titanic.drop(columns=['deck', 'embark_town', 'alive'])
df_titanic_clean = df_titanic_clean.dropna()

Regression

BASH

import pandas as pd
import seaborn as sns
import pandas as pd
from sklearn.datasets import fetch_california_housing
from noventis.predictor import NoventisAutoML

housing = fetch_california_housing()

df_housing = pd.DataFrame(housing.data, columns=housing.feature_names)
df_housing['MedHouseVal'] = housing.target

Example 1: The Full Experience (Default)

Run AutoML, compare it against a default list of common models, and generate a full report.

Classification

BASH

automl = NoventisAutoML(data=df_titanic_clean, target='survived', task='classification', time_budget=30)
results = automl.fit()
automl.generate_html_report()

RESULT

Manual Predictor Analysis Report (Classification)

RESULT

Regression

BASH

automl = NoventisAutoML(data=df_housing, target='MedHouseVal', task='regression', time_budget=30)
results = automl.fit()
automl.generate_html_report()

RESULT

Manual Predictor Analysis Report (Regression)

RESULT

Pure AutoML Search

Focus exclusively on finding the best possible model using the AutoML engine within a 5-minute budget.

Classification

BASH

automl_pure = NoventisAutoML(
    data=df_titanic_clean,
    target='survived',
    compare=False,
    models=None,
    task='classification',
    time_budget=30,
    metrics='accuracy'
)
results = automl_pure.fit()
automl_pure.generate_html_report()

RESULT

Regression

BASH

automl = NoventisAutoML(
    data=df_housing,
    target='MedHouseVal',
    compare=False,
    models=None,
    task='regression',
    time_budget=30,
    metrics='mae'
)
results = automl.fit()
automl.generate_html_report()

RESULT

Manual Model Training & Comparison

Train only a specific set of models you want to evaluate, without running the AutoML search.

Classification

BASH

automl_pure = NoventisAutoML(
    data=df_titanic_clean,
    target='survived',
    compare=False,
    models=['random_forest', 'lightgbm', 
            'logistic_regression'],
    task='classification'
)
results = automl_pure.fit()
automl_pure.generate_html_report()

RESULT

Regression

BASH

automl = NoventisAutoML(
    data=df_housing,
    target='MedHouseVal',
    compare=False,
    models=['linear_regression', 'random_forest', 
            'xgboost'],
    task='regression'
)
results = automl.fit()
automl.generate_html_report()

RESULT

Loading a Saved Model and Predicting

Train a model, then load the saved best model and use it to predict on new data.

BASH

from noventis.predictor import NoventisAutoML
import pandas as pd

# First, run the training process
automl = NoventisAutoML(data='path/to/train_data.csv', target='YourTargetColumn')
automl.fit()

# Now, load new data for prediction
new_data = pd.read_csv('path/to/new_unseen_data.csv')

# Use the predict method (it automatically finds the best saved model)
predictions = automl.predict(X_new=new_data, model_path='Noventis_Results/best_model.pkl')

print(predictions)