DATA_CLEANER

NoventisOutlierHandler

Outliers, or extreme values, can significantly skew statistical analyses and degrade the performance of machine learning models. Handling them correctly is a crucial step in data preprocessing. The NoventisOutlierHandler provides a systematic and flexible framework for identifying and managing outliers in your dataset.

This tool allows you to choose between two primary strategies: removing outlier rows entirely (trimming) or capping their values to a reasonable range (winsorizing). It features an intelligent 'auto' mode to select an appropriate strategy based on your data's characteristics, but also offers fine-grained control to apply specific methods to different columns.

Import
BASH
from noventis.data_cleaner import NoventisOutlierHandler
Parameters
feature_method_map
Type
Optional[Dict[str, str]]
Default
None
A dictionary to specify a unique outlier handling method for each column. Any column not in this map will use the default_method.
Example: { 'Salary': 'winsorize', 'Age': 'iqr_trim' }
default_method
Type
str
Default
'auto'
The default method applied to all numeric columns not specified in feature_method_map.
  1. 'auto': Intelligently selects a method based on data properties.
  2. 'quantile_trim': Removes rows where values fall outside the defined quantile_range.
  3. 'iqr_trim': Removes rows where values fall outside the IQR range defined by iqr_multiplier.
  4. 'winsorize': Caps values at boundaries defined by quantile_range instead of removing rows.
  5. 'none': Skips outlier handling for the column.
How does default_method='auto' work?

When 'auto' is selected, the handler chooses a method for each column based on the following logic:

  1. Small Dataset?: If a column has fewer data points than min_data_threshold, it uses iqr_trim (robust for small samples).
  2. Skewed Data?: If absolute skewness > skew_threshold, it uses winsorize (cap outliers without losing data).
  3. Otherwise: For larger, non-skewed datasets, it uses quantile_trim.
iqr_multiplier
Type
float
Default
1.5
The multiplier for the Interquartile Range (IQR) to determine the outlier boundaries when using 'iqr_trim' method. The Boundaries are calculated as Q1 - multiplier·IQR and Q3 + multiplier·IQR.
quantile_range
Type
Tuple[float, float]
Default
(0.05, 0.95)
A tuple specifying the lower and upper quantile boundaries. This is used by the quantile_trim method for trimming and winsorize method for capping.
min_data_threshold
Type
int
Default
100
Minimum number of data points below which the 'auto' mode will prefer 'iqr_trim'.
skew_threshold
Type
float
Default
0.5
Absolute skewness threshold above which the 'auto' mode will prefer 'winsorize'.
verbose
Type
bool
Default
false
If True, a summary of the outlier handling process will be printed after fitting.
Methods
  • fit(X)

    Analyzes the data and learns the outlier handling configuration from the input DataFrame X.

  • transform(X)

    Applies the learned strategy to X and returns the transformed pd.DataFrame.

  • fit_transform(X)

    Convenience method that performs fit and transform in one step.

Model Usage Examples
First, let's create sample data with some obvious outliers.
BASH
import pandas as pd import numpy as np # Create a base normal distribution base_data = np.random.normal(loc=100, scale=20, size=500) # Add some extreme outliers outliers = np.array([5, 10, 250, 300, 320]) df = pd.DataFrame({ 'Feature_A': np.concatenate([base_data, outliers]), 'Feature_B': np.concatenate([np.random.normal(50, 10, 500), np.array([-50, 150, 160])]) })
01
Example 1: Automatic Handling

This is the simplest approach. The handler will automatically decide the best method for each feature based on its statistical properties.

BASH
# Initialize the handler with default 'auto' mode handler = NoventisOutlierHandler(verbose=True) # Fit and transform the data df_cleaned = handler.fit_transform(df) print(f"Original shape: {df.shape}") print(f"Cleaned shape: {df_cleaned.shape}")
02
Example 2: Global Method (Winsorizing)

This example applies a single strategy to all columns. We will use 'winsorize' to cap extreme values at the boundaries defined by the 1st and 99th percentiles instead of removing them, which is useful when you want to preserve all your data rows.

BASH
# Initialize with a global method and a specific quantile range handler_winsorize = NoventisOutlierHandler( default_method='winsorize', quantile_range=(0.01, 0.99), verbose=True ) # Fit and transform df_winsorized = handler_winsorize.fit_transform(df) print(f"Original shape: {df.shape}") print(f"Winsorized shape: {df_winsorized.shape}") print("\nMin/Max values before:\n", df.agg(['min', 'max'])) print("\nMin/Max values after:\n", df_winsorized.agg(['min', 'max']))
03
Example 3: Per-Column Custom Strategy

This shows how to apply different rules to different columns, providing maximum control over the process.

BASH
# Define a dictionary with a specific method for each feature method_map = { 'Feature_A': 'iqr_trim', # Use robust IQR trimming for Feature_A 'Feature_B': 'winsorize' # Cap extreme values for Feature_B } # Initialize the handler with the custom map handler_custom = NoventisOutlierHandler( feature_method_map=method_map, verbose=True ) # Fit and transform df_custom = handler_custom.fit_transform(df) print(f"Original shape: {df.shape}") print(f"Custom handled shape: {df_custom.shape}")