NoventisDataCleaner (The Pipeline Orchestrator)
The NoventisDataCleaner class acts as the central conductor for the entire Noventis preprocessing suite. It allows you to design, configure, and execute a sequential data cleaning pipeline, chaining together the modules for imputation, outlier handling, encoding, and scaling. Its primary purpose is to provide a unified interface to manage the complete workflow—from initial data to a model-ready dataset—and to generate comprehensive reports on the entire process.
Import
from noventis import data_cleanerParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| pipeline_steps | list | ['impute', 'outlier', 'encode', 'scale'] | A list of strings that defines the sequence of cleaning operations. You can customize the order or omit steps as needed. Available steps:
|
| imputer_params | dict | None | A dictionary of parameters passed directly to the Example: {'method': 'knn', 'n_neighbors': '5'} |
| outlier_params | dict | None | A dictionary of parameters passed directly to the Example: {'default_method': 'winsorize', 'quantile_range': '(0.01, 0.99)'}. |
| encoder_params | dict | None | A dictionary of parameters passed directly to the Example: {'method': 'auto', 'target_column': 'YourTarget'} |
| scaler_params | dict | None | A dictionary of parameters passed directly to the NoventisScaler class. Refer to the NoventisScaler documentation for available options. Example: {'method': 'robust'} |
| verbose | bool | False | If True, prints real-time progress updates to the console as the pipeline executes each step. |
['impute', 'outlier', 'encode', 'scale']A list of strings that defines the sequence of cleaning operations. You can customize the order or omit steps as needed.
'impute''outlier''encode''scale'
NoneA dictionary of parameters passed directly to the NoventisImputer class. Refer to the NoventisImputer documentation for all available options.
{'method': 'knn', 'n_neighbors': '5'}NoneA dictionary of parameters passed directly to the NoventisOutlierHandler class. Refer to the NoventisOutlierHandler documentation for available options.
{'default_method': 'winsorize', 'quantile_range': '(0.01, 0.99)'}.NoneA dictionary of parameters passed directly to the NoventisEncoder class. Refer to the NoventisEncoder documentation for available options.
{'method': 'auto', 'target_column': 'YourTarget'}NoneA dictionary of parameters passed directly to the NoventisScaler class. Refer to the NoventisScaler documentation for available options.
{'method': 'robust'}FalseTrue, prints real-time progress updates to the console as the pipeline executes each step.Methods
fit_transform(X, y=None) → pd.DataFrame
The main method that executes the entire cleaning pipeline. It takes a DataFrame X (and an optional target Series y for target-dependent steps) and runs it through the sequence of operations defined in pipeline_steps. It returns the fully cleaned and processed DataFrame.
display_summary_report()
Prints a concise, text-based summary of the entire pipeline run to the console, including a final data quality score and key metrics from each step.
generate_html_report() → HTML
Generates a rich, interactive, and visually appealing HTML report of the entire cleaning process. The report includes an overview with a final quality score, as well as dedicated tabs for each step with detailed summaries and before-and-after visualizations
data_cleaner (The Simplified Helper Function)
For rapid and straightforward data cleaning tasks, the data_cleaner function provides a high-level, simplified interface to the NoventisDataCleaner pipeline. With a single function call, you can execute a standard cleaning sequence using the most common settings, making it ideal for initial data exploration and preparing baseline models.
Import
from noventis.data_cleaner import data_cleanerParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| data | Union[str, pd.DataFrame] | The input data. This can be either a pandas DataFrame or a string containing the file path to a CSV file. | |
| target_column | Optional[str] | None | A dictionary of parameters passed directly to the Example: {'method': 'knn', 'n_neighbors': 5} |
| null_handling | str | 'auto' | A simplified way to specify the imputation method (e.g., 'auto', 'median', 'knn', 'drop'). |
| outlier_handling | dict | None | A simplified way to specify the outlier handling method (e.g., 'auto', 'iqr_trim', 'winsorize'). |
| encoding | str | 'auto' | A simplified way to specify the encoding method (e.g., 'auto', 'ohe', 'target'). |
| scaling | str | 'auto' | A simplified way to specify the scaling method (e.g., 'auto', 'minmax', 'standard'). |
| verbose | bool | True | If True, displays detailed reports and progress during the process. |
| return_instance | bool | False | Determines the function's output.
|
NoneA dictionary of parameters passed directly to the NoventisImputer class. Refer to the NoventisImputer documentation for all available options.
{'method': 'knn', 'n_neighbors': 5}'auto''auto', 'median', 'knn', 'drop').None'auto', 'iqr_trim', 'winsorize').'auto''auto', 'ohe', 'target').'auto''auto', 'minmax', 'standard').TrueTrue, displays detailed reports and progress during the process.False- If
False(default), only the cleaned pandas DataFrame is returned. - If
True, the function returns a tuple: (cleaned_DataFrame, cleaner_instance) . The instance can be used to generate reports or for further analysis.
Model Usage Examples
Example 1: Using the NoventisDataCleaner Class for Full Control
This example shows how to build a custom pipeline with specific parameters for each step.
import pandas as pd
from noventis.datacleaner import NoventisDataCleaner
# Assume ‘dummy_classification_churn’ is in your folder
df = pd.read_csv('dummy_classification_churn.csv')
X = df.drop(columns=['churn'])
y = df['churn']
# 1. Define custom configurations for each step
imputer_config = {'method': 'median'}
outlier_config = {'default_method':
'winsorize', 'quantile_range': (0.01, 0.99)}
encoder_config = {'method': 'auto',
'target_column': ‘churn'}
scaler_config = {'method': 'robust'}
# 2. Initialize the cleaner with the custom configurations
cleaner = NoventisDataCleaner(
pipeline_steps=['impute', 'outlier',
'encode', 'scale'],
imputer_params=imputer_config,
outlier_params=outlier_config,
encoder_params=encoder_config,
scaler_params=scaler_config,
verbose=False
)
# 3. Run the entire pipeline
cleaned_df = cleaner.fit_transform(X, y)
# 4. Generate the interactive HTML report
cleaner.generate_html_report()Example 2: Using the Function and Getting the Report
This shows how to use the simple function but still get the full NoventisDataCleaner instance back to generate the detailed HTML report.
from noventis import data_cleaner
# Assume ‘dummy_classification_churn’ is in your folder
df_2 = pd.read_csv('../dataset_for_examples/AmesHousing.csv')
# Run the cleaner and ask for the instance to be returned
df_cleaned, dfisinstance = data_cleaner(
data=df_2,
return_instance=True,
target_column='SalePrice'
)
# Now, generate the rich HTML report from the returned instance
dfisinstance.generate_html_report()