DATA_CLEANER

NoventisManualML

While AutoML provides a powerful, hands-off approach, expert users often require granular control over model selection, hyperparameter tuning, and in-depth analysis. The NoventisManualML is designed precisely for this purpose. It serves as a comprehensive toolkit for building, tuning, comparing, and explaining a user-defined set of machine learning models.

Leveraging advanced libraries like Optuna for hyperparameter optimization and SHAP for explainability, it provides a structured and powerful environment for deliberate and insightful machine learning experimentation. It is the ideal tool when you want to compare specific algorithms or dive deep into the behavior of a single, highly-tuned model.

BASH
from noventis.predictor import NoventisManualML
Key Features
  • Custom Model Suite

    Train and compare one or more specific models from a comprehensive list, including LogisticRegression, RandomForest, XGBoost, LightGBM , and more.

  • Advanced Hyperparameter Tuning

    Integrates Optuna to perform sophisticated, state-of-the-art hyperparameter optimization, helping you squeeze maximum performance out of each model.

  • Deep Model Explainability

    Incorporates SHAP to provide deep, model-agnostic insights into how your model makes predictions, generating summary, beeswarm, and dependence plots.

  • Flexible Preprocessing

    Includes a robust internal preprocessor for handling missing values and categorical features, and can optionally be chained with a pre-configured NoventisDataCleaner instance for more complex cleaning pipelines.

  • Comprehensive Reporting

    Generates a detailed, interactive HTML report that consolidates performance metrics, model comparisons, evaluation plots, and feature importance into a single, easy-to-navigate dashboard.

Parameters
model_name
Type
Union[str, List[str]]
Default
None
The core parameter defining the experiment. Provide a single model name as a string (e.g., 'xgboost') or a list of names to train and compare (e.g., ['random_forest', 'lightgbm']).
task
Type
str
Default
None
The machine learning task. Must be either 'classification' or 'regression'.
tune_hyperparameters
Type
bool
Default
False
If True, enables hyperparameter optimization for each model using Optuna. If False, models are trained with their default parameters.
n_trials
Type
int
Default
50
The number of optimization trials to run per model when tune_hyperparameters is True.
data_cleaner
Type
Optional[NoventisDataCleaner]
Default
None
An optional, pre-configured NoventisDataCleaner instance. If provided, its cleaning pipeline will be applied to the data before training.
cv_folds
Type
int
Default
3
The number of cross-validation folds to use during the hyperparameter tuning process.
cv_strategy
Type
str
Default
'repeated'
The cross-validation strategy for tuning classification models. Can be 'repeated' (uses RepeatedStratifiedKFold) or another value (uses StratifiedKFold).
show_tuning_plots
Type
bool
Default
False
If True and tune_hyperparameters is enabled, displays Optuna's optimization history and parameter importance plots during the run.
output_dir
Type
Optional[str]
Default
None
A directory path where all artifacts (saved models, plots, reports) will be stored. If provided, a unique sub-folder is created for each run.
Main Workflow Method
  • .fit(df, target_column, test_size=0.2, compare=False, explain=False, display_report=True)

    This is the primary method to execute the entire workflow. It orchestrates data splitting, preprocessing, model training (and optional tuning), evaluation, and reporting. It's the main entry point for using the ManualPredictor.

    • df (pd.DataFrame): The full dataset including the target column.
    • target_column (str): The name of the column to be predicted.
    • test_size (float): The proportion of data to hold out for testing.
    • compare (bool): If True, prints a summary table comparing all trained models.
    • explain (bool): If True, generates a bar plot comparing model performance.
    • display_report (bool): If True, automatically displays the final HTML report in the output cell (in Jupyter environments).
Reporting & Analysis Methods
  • .generate_html_report(filepath=None) → str

    Creates the comprehensive HTML report, which includes an execution summary, a detailed model comparison table, and all generated visualizations. The report can be saved to a file if a filepath is provided.

  • .display_report()

    A convenience method to display the generated HTML report directly in a Jupyter or Google Colab output cell.

  • .explain_model(plot_type='summary', feature=None)

    Provides deep model explainability for the best-performing model using SHAP. It can generate different visualizations to understand feature impacts on the model's predictions.

    • plot_type: 'summary' (default), 'beeswarm', or 'dependence'.
    • feature: The name of a feature is required for the 'dependence' plot.
  • .get_results_dataframe() → pd.DataFrame

    Returns a clean pandas DataFrame containing the performance metrics for all successfully trained models, sorted by the primary evaluation metric.

Utility Methods
  • .save_model(filepath=None)

    Saves the best-performing model from the pipeline run to a .pkl file for later use. If filepath is not provided, it saves to the output_dir.

  • .load_model(filepath) → object

    A utility function to load a saved .pkl model from the specified path.

  • .predict(X_new, model_path=None)

    Makes predictions on new data using either the best model from the session or a loaded model from a file.

Model Usage Examples
Prepare Dataset

Classification

BASH
import pandas as pd import seaborn as sns import pandas as pd from sklearn.datasets import fetch_california_housing from noventis.predictor import NoventisAutoML df_titanic = sns.load_dataset('titanic') df_titanic_clean = df_titanic.drop(columns=['deck', 'embark_town', 'alive']) df_titanic_clean = df_titanic_clean.dropna()

Regression

BASH
import pandas as pd import seaborn as sns import pandas as pd from sklearn.datasets import fetch_california_housing from noventis.predictor import NoventisAutoML housing = fetch_california_housing() df_housing = pd.DataFrame(housing.data, columns=housing.feature_names) df_housing['MedHouseVal'] = housing.target
01
Example 1: The Full Experience (Default)

Demonstrates how to train and compare a specified list of models using their default parameters, leveraging the integrated data cleaner for preprocessing.

Classification

BASH
manualml = NoventisManualML( model_name=['logistic_regression', 'random_forest', 'lightgbm'], task='classification', ) results = manualml.fit( df=df_titanic_clean, target_column='survived', use_data_cleaner=True )
RESULT
Noventis classification manual report 01

Regression

BASH
manualml = NoventisManualML( model_name=['linear_regression', 'random_forest', 'lightgbm'], task='regression', ) results = manualml.fit( df=df_housing, target_column='MedHouseVal', use_data_cleaner=True )
RESULT
Noventis regression manual 01
02
Example 2: ManualML With Hyperparameter tunning

Showcases how to enable Optuna-based hyperparameter tuning for a single model to find its optimal configuration, along with displaying tuning plots.

Classification

BASH
manualml_tunning =NoventisManualML( model_name='xgboost', task='classification', tune_hyperparameters=True, n_trials=50, cv_folds=5, show_tuning_plots=True, random_state=42 ) results = manualml_tunning.fit( df=df_titanic_clean, target_column='survived', display_report=True, compare=True, explain=True, use_data_cleaner=True )
RESULT
Noventis classification manual report 02

Regression

BASH
manualml = NoventisManualML( model_name='xgboost', task='regression', tune_hyperparameters=True, n_trials=50, cv_folds=5, show_tuning_plots=True, random_state=42 ) results = manualml.fit( df=df_housing, target_column='MedHouseVal', display_report=True, compare=True, explain=True, use_data_cleaner=True )
RESULT
Noventis regression manual 02
03
Example 3: Save and use your model

This demonstrates the practical workflow of saving the best model found during the pipeline run and then loading it back for future use, simulating deployment.

BASH
manualml .save_model(filepath='best_xgboost_model.pkl') loaded_model = NoventisManualML.load_model(filepath='best_xgboost_model.pkl') print(f"Model {type(loaded_model)} successfully loaded.")