DATA_CLEANER

NoventisOutlierHandler

Outliers, or extreme values, can significantly skew statistical analyses and degrade the performance of machine learning models. Handling them correctly is a crucial step in data preprocessing. The NoventisOutlierHandler provides a systematic and flexible framework for identifying and managing outliers in your dataset.

This tool allows you to choose between two primary strategies: removing outlier rows entirely (trimming) or capping their values to a reasonable range (winsorizing). It features an intelligent 'auto' mode to select an appropriate strategy based on your data's characteristics, but also offers fine-grained control to apply specific methods to different columns.

Import
BASH
from noventis.data_cleaner import NoventisOutlierHandler
Parameters
feature_method_map
Type
Optional[Dict[str, str]]
Default
None
A dictionary to specify a unique outlier handling method for each column. Any column not in this map will use the default_method.
Example: { "Salary": "winsorize", "Age": "iqr_trim" }
default_method
Type
str
Default
'auto'
The default method applied to all numeric columns not specified in feature_method_map.
  1. auto: Intelligently selects a method based on data properties.
  2. quantile_trim: Removes rows where values fall outside the defined quantile_range.
  3. iqr_trim: Removes rows where values fall outside the IQR range defined by iqr_multiplier.
  4. winsorize: Caps values at boundaries defined by quantile_range instead of removing rows.
  5. none: Skips outlier handling for the column.
How does default_method='auto' work?

When 'auto' is selected, the handler chooses a method for each column based on the following logic:

  1. Small Dataset?: If a column has fewer data points than min_data_threshold, it uses iqr_trim (robust for small samples).
  2. Skewed Data?: If absolute skewness > skew_threshold, it uses winsorize (cap outliers without losing data).
  3. Otherwise: For larger, non-skewed datasets, it uses quantile_trim.
iqr_multiplier
Type
float
Default
1.5
The multiplier for the Interquartile Range (IQR) to determine the outlier boundaries when using 'iqr_trim' method. The Boundaries are calculated as Q1 - multiplier·IQR and Q3 + multiplier·IQR.
quantile_range
Type
Tuple[float, float]
Default
(0.05, 0.95)
A tuple specifying the lower and upper quantile boundaries. This is used by the quantile_trim method for trimming and winsorize method for capping.
min_data_threshold
Type
int
Default
100
Minimum number of data points below which the 'auto' mode will prefer 'iqr_trim'.
skew_threshold
Type
float
Default
0.5
Absolute skewness threshold above which the 'auto' mode will prefer 'winsorize'.
verbose
Type
bool
Default
false
If True, a summary of the outlier handling process will be printed after fitting.
Methods
  • fit(X)

    Analyzes the data and learns the imputation strategy from the input DataFrame X.

  • transform(X)

    pd.DataFrame Applies the learned imputation to the DataFrame X and returns the transformed data.

  • fit_transform(X)

    pd.DataFrame A convenient method that performs the fit and transform operations in a single step.

Model Usage Examples
First, let's create sample data with some obvious outliers.
BASH
import pandas as pd import numpy as np # Create a base normal distribution base_data = np.random.normal(loc=100, scale=20, size=500) # Add some extreme outliers outliers = np.array([5, 10, 250, 300, 320]) df = pd.DataFrame({ 'Feature_A': np.concatenate([base_data, outliers]), 'Feature_B': np.concatenate([np.random.normal(50, 10, 500), np.array([-50, 150, 160])]) })
01
Example 1: Automatic Handling

This is the simplest approach. The handler will automatically decide the best method for each feature based on its statistical properties.

BASH
# Initialize the handler with default 'auto' mode handler = NoventisOutlierHandler(verbose=True) # Fit and transform the data df_cleaned = handler.fit_transform(df) print(f"Original shape: {df.shape}") print(f"Cleaned shape: {df_cleaned.shape}")
02
Example 2: Global Method (Winsorizing)

This example applies a single strategy to all columns. We will use 'winsorize' to cap extreme values at the boundaries defined by the 1st and 99th percentiles instead of removing them.

BASH
# Initialize with a global method and a specific quantile range handler_winsorize = NoventisOutlierHandler( default_method='winsorize', quantile_range=(0.01, 0.99), verbose=True ) # Fit and transform df_winsorized = handler_winsorize.fit_transform(df) print(f"Original shape: {df.shape}") print(f"Winsorized shape: {df_winsorized.shape}") print("\nMin/Max values before:\n", df.agg(['min', 'max'])) print("\nMin/Max values after:\n", df_winsorized.agg(['min', 'max']))
03
Example 3: Per-Column Custom Strategy

This example demonstrates how to apply different outlier handling rules for each feature, providing fine-grained control.

BASH
# Define a dictionary with a specific method for each feature method_map = { 'Feature_A': 'iqr_trim', # Use robust IQR trimming for Feature_A 'Feature_B': 'winsorize' # Cap extreme values for Feature_B } # Initialize the handler with the custom map handler_custom = NoventisOutlierHandler(feature_method_map=method_map, verbose=True) # Fit and transform df_custom = handler_custom.fit_transform(df) print(f"Original shape: {df.shape}") print(f"Custom handled shape: {df_custom.shape}")