DATA_CLEANER

NoventisDataCleaner (The Pipeline Orchestrator)

The NoventisDataCleaner class acts as the central conductor for the entire Noventis preprocessing suite. It allows you to design, configure, and execute a sequential data cleaning pipeline, chaining together the modules for imputation, outlier handling, encoding, and scaling. Its primary purpose is to provide a unified interface to manage the complete workflow—from initial data to a model-ready dataset—and to generate comprehensive reports on the entire process.

Import
BASH
from noventis import data_cleaner
Parameters
pipeline_steps
Type
list
Default
['impute', 'outlier', 'encode', 'scale']

A list of strings that defines the sequence of cleaning operations. You can customize the order or omit steps as needed.

Available steps:
  • 'impute'
  • 'outlier'
  • 'encode'
  • 'scale'
imputer_params
Type
dict
Default
None

A dictionary of parameters passed directly to the NoventisImputer class. Refer to the NoventisImputer documentation for all available options.

Example:
{'method': 'knn', 'n_neighbors': '5'}
outlier_params
Type
dict
Default
None

A dictionary of parameters passed directly to the NoventisOutlierHandler class. Refer to the NoventisOutlierHandler documentation for available options.

Example:
{'default_method': 'winsorize', 'quantile_range': '(0.01, 0.99)'}.
encoder_params
Type
dict
Default
None

A dictionary of parameters passed directly to the NoventisEncoder class. Refer to the NoventisEncoder documentation for available options.

Example:
{'method': 'auto', 'target_column': 'YourTarget'}
scaler_params
Type
dict
Default
None

A dictionary of parameters passed directly to the NoventisScaler class. Refer to the NoventisScaler documentation for available options.

Example:
{'method': 'robust'}
verbose
Type
bool
Default
False
If True, prints real-time progress updates to the console as the pipeline executes each step.
Methods
  • fit_transform(X, y=None) → pd.DataFrame

    The main method that executes the entire cleaning pipeline. It takes a DataFrame X (and an optional target Series y for target-dependent steps) and runs it through the sequence of operations defined in pipeline_steps. It returns the fully cleaned and processed DataFrame.

  • display_summary_report()

    Prints a concise, text-based summary of the entire pipeline run to the console, including a final data quality score and key metrics from each step.

  • generate_html_report() → HTML

    Generates a rich, interactive, and visually appealing HTML report of the entire cleaning process. The report includes an overview with a final quality score, as well as dedicated tabs for each step with detailed summaries and before-and-after visualizations

data_cleaner (The Simplified Helper Function)

For rapid and straightforward data cleaning tasks, the data_cleaner function provides a high-level, simplified interface to the NoventisDataCleaner pipeline. With a single function call, you can execute a standard cleaning sequence using the most common settings, making it ideal for initial data exploration and preparing baseline models.

Import
BASH
from noventis.data_cleaner import data_cleaner
Parameters
data
Type
Union[str, pd.DataFrame]
Default
The input data. This can be either a pandas DataFrame or a string containing the file path to a CSV file.
target_column
Type
Optional[str]
Default
None

A dictionary of parameters passed directly to the NoventisImputer class. Refer to the NoventisImputer documentation for all available options.

Example:
{'method': 'knn', 'n_neighbors': 5}
null_handling
Type
str
Default
'auto'
A simplified way to specify the imputation method (e.g., 'auto', 'median', 'knn', 'drop').
outlier_handling
Type
dict
Default
None
A simplified way to specify the outlier handling method (e.g., 'auto', 'iqr_trim', 'winsorize').
encoding
Type
str
Default
'auto'
A simplified way to specify the encoding method (e.g., 'auto', 'ohe', 'target').
scaling
Type
str
Default
'auto'
A simplified way to specify the scaling method (e.g., 'auto', 'minmax', 'standard').
verbose
Type
bool
Default
True
If True, displays detailed reports and progress during the process.
return_instance
Type
bool
Default
False
Determines the function's output.
  • If False (default), only the cleaned pandas DataFrame is returned.
  • If True, the function returns a tuple: (cleaned_DataFrame, cleaner_instance) . The instance can be used to generate reports or for further analysis.
Model Usage Examples
01
Example 1: Using the NoventisDataCleaner Class for Full Control

This example shows how to build a custom pipeline with specific parameters for each step.

BASH
import pandas as pd from noventis.datacleaner import NoventisDataCleaner # Assume ‘dummy_classification_churn’ is in your folder df = pd.read_csv('dummy_classification_churn.csv') X = df.drop(columns=['churn']) y = df['churn'] # 1. Define custom configurations for each step imputer_config = {'method': 'median'} outlier_config = {'default_method': 'winsorize', 'quantile_range': (0.01, 0.99)} encoder_config = {'method': 'auto', 'target_column': ‘churn'} scaler_config = {'method': 'robust'} # 2. Initialize the cleaner with the custom configurations cleaner = NoventisDataCleaner(     pipeline_steps=['impute', 'outlier', 'encode', 'scale'],     imputer_params=imputer_config,     outlier_params=outlier_config,     encoder_params=encoder_config,     scaler_params=scaler_config,     verbose=False ) # 3. Run the entire pipeline cleaned_df = cleaner.fit_transform(X, y) # 4. Generate the interactive HTML report cleaner.generate_html_report()
RESULT
noventisd-data-cleaner-01
02
Example 2: Using the Function and Getting the Report

This shows how to use the simple function but still get the full NoventisDataCleaner instance back to generate the detailed HTML report.

BASH
from noventis import data_cleaner # Assume ‘dummy_classification_churn’ is in your folder df_2 = pd.read_csv('../dataset_for_examples/AmesHousing.csv') # Run the cleaner and ask for the instance to be returned df_cleaned, dfisinstance = data_cleaner(     data=df_2,     return_instance=True,     target_column='SalePrice' ) # Now, generate the rich HTML report from the returned instance dfisinstance.generate_html_report()
RESULT
noventisd-data-cleaner-02