DATA_CLEANER

NoventisDataCleaner (The Pipeline Orchestrator)

The NoventisDataCleaner class acts as the central conductor for the entire Noventis preprocessing suite. It allows you to design, configure, and execute a sequential data cleaning pipeline, chaining together the modules for imputation, outlier handling, encoding, and scaling. Its primary purpose is to provide a unified interface to manage the complete workflow, from initial data to a model-ready dataset, and to generate comprehensive reports on the entire process.

Import

BASH

from noventis.data_cleaner import NoventisDataCleaner

Parameters

Parameter	Type	Default	Description
pipeline_steps	list	`['impute', 'outlier', 'encode', 'scale']`	A list of strings that defines the sequence of cleaning operations. You can customize the order or omit steps as needed. Available steps: `impute` `outlier` `encode` `scale`
imputer_params	dict	`None`	A dictionary of parameters passed directly to the NoventisImputer class. Refer to the NoventisImputer documentation for all available options. Example: {'method': 'knn', 'n_neighbors': 5}
outlier_params	dict	`None`	A dictionary of parameters passed directly to the NoventisOutlierHandler class. Refer to the NoventisOutlierHandler documentation for available options. Example: {'default_method': 'winsorize', 'quantile_range': (0.01, 0.99)}
encoder_params	dict	`None`	A dictionary of parameters passed directly to the NoventisEncoder class. Refer to the NoventisEncoder documentation for available options. Example: {'method': 'auto', 'target_column': 'yourTarget'}
scaler_params	dict	`None`	A dictionary of parameters passed directly to the NoventisScaler class. Refer to the NoventisScaler documentation for available options. Example: {'method': 'robust'}
verbose	bool	`False`	If `True`, prints real-time progress updates to the console as the pipeline executes each step.

pipeline_steps

Type

list

Default

['impute', 'outlier', 'encode', 'scale']

A list of strings that defines the sequence of cleaning operations. You can customize the order or omit steps as needed.

Available steps:

impute
outlier
encode
scale

imputer_params

Type

dict

Default

None

A dictionary of parameters passed directly to the

NoventisImputer

class. Refer to the NoventisImputer documentation for all available options.

Example:

{'method': 'knn', 'n_neighbors': 5}

outlier_params

Type

dict

Default

None

A dictionary of parameters passed directly to the

NoventisOutlierHandler

class. Refer to the NoventisOutlierHandler documentation for available options.

Example:

{'default_method': 'winsorize', 'quantile_range': (0.01, 0.99)}

encoder_params

Type

dict

Default

None

A dictionary of parameters passed directly to the

NoventisEncoder

class. Refer to the NoventisEncoder documentation for available options.

Example:

{'method': 'auto', 'target_column': 'yourTarget'}

scaler_params

Type

dict

Default

None

A dictionary of parameters passed directly to the

NoventisScaler

class. Refer to the NoventisScaler documentation for available options.

Example:

{'method': 'robust'}

verbose

Type

bool

Default

False

If True, prints real-time progress updates to the console as the pipeline executes each step.

Methods

fit(X)
Analyzes the data and learns the imputation strategy from the input DataFrame X.
transform(X)
pd.DataFrame Applies the learned imputation to the DataFrame X and returns the transformed data.
fit_transform(X)
pd.DataFrame A convenient method that performs the fit and transform operations in a single step.

Model Usage Examples

First, let's create some sample data with missing values.

BASH

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Age': [22, 38, 26, 35, np.nan, 28, 50, np.nan],
    'Salary': [72000, 48000, 54000, 61000, 75000, np.nan, 83000, 45000],
    'City': ['London', 'Paris', 'New York', np.nan, 'Tokyo', 'London', 'Paris', 'New York'],
    'Experience': [1, 10, 3, 8, 5, 4, 20, np.nan]  # An integer column
})

Example 1: Automatic Imputation (Default)

This is the simplest use case. The imputer will automatically use the mean for numeric columns (Age, Salary, Experience) and the mode for the categorical column (City).

BASH

# Initialize the imputer with no parameters for auto mode
imputer = NoventisImputer(verbose=True)

# Fit and transform the data
df_imputed = imputer.fit_transform(df)
print(df_imputed)

Example 2: Using a Global Method (KNN)

Here, we apply the K-Nearest Neighbors algorithm to all numeric columns. The imputer is smart enough to use a fallback method (mode) for categorical columns where KNN is not applicable.

BASH

# Initialize with method='knn'
imputer_knn = NoventisImputer(method='knn', n_neighbors=3, verbose=True)

# Fit and transform
df_knn_imputed = imputer_knn.fit_transform(df)
print(df_knn_imputed)

Example 3: Per-Column Custom Strategy

This example demonstrates the highest level of control, where we define a specific imputation method for each column.

BASH

# Define a dictionary with specific methods for each column
custom_methods = {
    'Age': 'median',
    'Salary': 'mean',
    'City': 'mode',
    'Experience': 'constant'
}

# Initialize the imputer with the custom dictionary and a fill_value for 'constant'
imputer_custom = NoventisImputer(method=custom_methods, fill_value=0, verbose=True)

# Fit and transform
df_custom_imputed = imputer_custom.fit_transform(df)
print(df_custom_imputed)