Create Next App

DATA_CLEANER

NoventisOutlierHandler

Outliers, or extreme values, can significantly skew statistical analyses and degrade the performance of machine learning models. Handling them correctly is a crucial step in data preprocessing. The NoventisOutlierHandler provides a systematic and flexible framework for identifying and managing outliers in your dataset.

This tool allows you to choose between two primary strategies: removing outlier rows entirely (trimming) or capping their values to a reasonable range (winsorizing). It features an intelligent 'auto' mode to select an appropriate strategy based on your data's characteristics, but also offers fine-grained control to apply specific methods to different columns.

Import

BASH

from noventis.data_cleaner import NoventisOutlierHandler

Parameters

Parameter	Type	Default	Description
feature_method_map	Optional[Dict[str, str]]	`None`	A dictionary to specify a unique outlier handling method for each column. Any column not in this map will use the default_method. Example: { "Salary": "winsorize", "Age": "iqr_trim" }
default_method	str	`'auto'`	The default method applied to all numeric columns not specified in `feature_method_map`. auto: Intelligently selects a method based on data properties. quantile_trim: Removes rows where values fall outside the defined `quantile_range`. iqr_trim: Removes rows where values fall outside the IQR range defined by `iqr_multiplier`. winsorize: Caps values at boundaries defined by `quantile_range` instead of removing rows. none: Skips outlier handling for the column. How does default_method='auto' work? When 'auto' is selected, the handler chooses a method for each column based on the following logic: Small Dataset?: If a column has fewer data points than `min_data_threshold`, it uses `iqr_trim` (robust for small samples). Skewed Data?: If absolute skewness > `skew_threshold`, it uses `winsorize` (cap outliers without losing data). Otherwise: For larger, non-skewed datasets, it uses `quantile_trim`.
iqr_multiplier	float	`1.5`	The multiplier for the Interquartile Range (IQR) to determine the outlier boundaries when using 'iqr_trim' method. The Boundaries are calculated as Q1 - multiplier·IQR and Q3 + multiplier·IQR.
quantile_range	Tuple[float, float]	`(0.05, 0.95)`	A tuple specifying the lower and upper quantile boundaries. This is used by the quantile_trim method for trimming and winsorize method for capping.
min_data_threshold	int	`100`	Minimum number of data points below which the 'auto' mode will prefer 'iqr_trim'.
skew_threshold	float	`0.5`	Absolute skewness threshold above which the 'auto' mode will prefer 'winsorize'.
verbose	bool	`false`	If True, a summary of the outlier handling process will be printed after fitting.

feature_method_map

Type

Optional[Dict[str, str]]

Default

None

A dictionary to specify a unique outlier handling method for each column. Any column not in this map will use the default_method.

Example: { "Salary": "winsorize", "Age": "iqr_trim" }

default_method

Type

str

Default

'auto'

The default method applied to all numeric columns not specified in feature_method_map.

auto: Intelligently selects a method based on data properties.
quantile_trim: Removes rows where values fall outside the defined quantile_range.
iqr_trim: Removes rows where values fall outside the IQR range defined by iqr_multiplier.
winsorize: Caps values at boundaries defined by quantile_range instead of removing rows.
none: Skips outlier handling for the column.

How does default_method='auto' work?

When 'auto' is selected, the handler chooses a method for each column based on the following logic:

Small Dataset?: If a column has fewer data points than min_data_threshold, it uses iqr_trim (robust for small samples).
Skewed Data?: If absolute skewness > skew_threshold, it uses winsorize (cap outliers without losing data).
Otherwise: For larger, non-skewed datasets, it uses quantile_trim.

iqr_multiplier

Type

float

Default

1.5

The multiplier for the Interquartile Range (IQR) to determine the outlier boundaries when using 'iqr_trim' method. The Boundaries are calculated as Q1 - multiplier·IQR and Q3 + multiplier·IQR.

quantile_range

Type

Tuple[float, float]

Default

(0.05, 0.95)

A tuple specifying the lower and upper quantile boundaries. This is used by the quantile_trim method for trimming and winsorize method for capping.

min_data_threshold

Type

int

Default

100

Minimum number of data points below which the 'auto' mode will prefer 'iqr_trim'.

skew_threshold

Type

float

Default

0.5

Absolute skewness threshold above which the 'auto' mode will prefer 'winsorize'.

verbose

Type

bool

Default

false

If True, a summary of the outlier handling process will be printed after fitting.

Methods

fit(X)
Analyzes the data and learns the imputation strategy from the input DataFrame X.
transform(X)
pd.DataFrame Applies the learned imputation to the DataFrame X and returns the transformed data.
fit_transform(X)
pd.DataFrame A convenient method that performs the fit and transform operations in a single step.

Model Usage Examples

First, let's create sample data with some obvious outliers.

BASH

import pandas as pd
import numpy as np

# Create a base normal distribution
base_data = np.random.normal(loc=100, scale=20, size=500)

# Add some extreme outliers
outliers = np.array([5, 10, 250, 300, 320])

df = pd.DataFrame({
    'Feature_A': np.concatenate([base_data, outliers]),
    'Feature_B': np.concatenate([np.random.normal(50, 10, 500), np.array([-50, 150, 160])])
})

Example 1: Automatic Handling

This is the simplest approach. The handler will automatically decide the best method for each feature based on its statistical properties.

BASH

# Initialize the handler with default 'auto' mode
handler = NoventisOutlierHandler(verbose=True)

# Fit and transform the data
df_cleaned = handler.fit_transform(df)

print(f"Original shape: {df.shape}")
print(f"Cleaned shape: {df_cleaned.shape}")

Example 2: Global Method (Winsorizing)

This example applies a single strategy to all columns. We will use 'winsorize' to cap extreme values at the boundaries defined by the 1st and 99th percentiles instead of removing them.

BASH

# Initialize with a global method and a specific quantile range
handler_winsorize = NoventisOutlierHandler(
    default_method='winsorize',
    quantile_range=(0.01, 0.99),
    verbose=True
)

# Fit and transform
df_winsorized = handler_winsorize.fit_transform(df)

print(f"Original shape: {df.shape}")
print(f"Winsorized shape: {df_winsorized.shape}")

print("\nMin/Max values before:\n", df.agg(['min', 'max']))
print("\nMin/Max values after:\n", df_winsorized.agg(['min', 'max']))

Example 3: Per-Column Custom Strategy

This example demonstrates how to apply different outlier handling rules for each feature, providing fine-grained control.

BASH

# Define a dictionary with a specific method for each feature
method_map = {
    'Feature_A': 'iqr_trim',   # Use robust IQR trimming for Feature_A
    'Feature_B': 'winsorize'   # Cap extreme values for Feature_B
}

# Initialize the handler with the custom map
handler_custom = NoventisOutlierHandler(feature_method_map=method_map, verbose=True)

# Fit and transform
df_custom = handler_custom.fit_transform(df)

print(f"Original shape: {df.shape}")
print(f"Custom handled shape: {df_custom.shape}")