NoventisOutlierHandler
Outliers, or extreme values, can significantly skew statistical analyses and degrade the performance of machine learning models. Handling them correctly is a crucial step in data preprocessing. The NoventisOutlierHandler provides a systematic and flexible framework for identifying and managing outliers in your dataset.
This tool allows you to choose between two primary strategies: removing outlier rows entirely (trimming) or capping their values to a reasonable range (winsorizing). It features an intelligent 'auto' mode to select an appropriate strategy based on your data's characteristics, but also offers fine-grained control to apply specific methods to different columns.
Import
from noventis.data_cleaner import NoventisOutlierHandler
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
feature_method_map | Optional[Dict[str, str]] | None | A dictionary to specify a unique outlier handling method for each column. Any column not in this map will use the default_method. Example: { "Salary": "winsorize", "Age": "iqr_trim" } |
default_method | str | 'auto' | The default method applied to all numeric columns not specified in feature_method_map .
How does default_method='auto' work? When 'auto' is selected, the handler chooses a method for each column based on the following logic:
|
iqr_multiplier | float | 1.5 | The multiplier for the Interquartile Range (IQR) to determine the outlier boundaries when using 'iqr_trim' method. The Boundaries are calculated as Q1 - multiplier·IQR and Q3 + multiplier·IQR. |
quantile_range | Tuple[float, float] | (0.05, 0.95) | A tuple specifying the lower and upper quantile boundaries. This is used by the quantile_trim method for trimming and winsorize method for capping. |
min_data_threshold | int | 100 | Minimum number of data points below which the 'auto' mode will prefer 'iqr_trim'. |
skew_threshold | float | 0.5 | Absolute skewness threshold above which the 'auto' mode will prefer 'winsorize'. |
verbose | bool | false | If True, a summary of the outlier handling process will be printed after fitting. |
None
'auto'
feature_method_map
.- auto: Intelligently selects a method based on data properties.
- quantile_trim: Removes rows where values fall outside the defined
quantile_range
. - iqr_trim: Removes rows where values fall outside the IQR range defined by
iqr_multiplier
. - winsorize: Caps values at boundaries defined by
quantile_range
instead of removing rows. - none: Skips outlier handling for the column.
When 'auto' is selected, the handler chooses a method for each column based on the following logic:
- Small Dataset?: If a column has fewer data points than
min_data_threshold
, it usesiqr_trim
(robust for small samples). - Skewed Data?: If absolute skewness >
skew_threshold
, it useswinsorize
(cap outliers without losing data). - Otherwise: For larger, non-skewed datasets, it uses
quantile_trim
.
1.5
(0.05, 0.95)
100
0.5
false
Methods
fit(X)
Analyzes the data and learns the imputation strategy from the input DataFrame X.
transform(X)
pd.DataFrame Applies the learned imputation to the DataFrame X and returns the transformed data.
fit_transform(X)
pd.DataFrame A convenient method that performs the fit and transform operations in a single step.
Model Usage Examples
import pandas as pd
import numpy as np
# Create a base normal distribution
base_data = np.random.normal(loc=100, scale=20, size=500)
# Add some extreme outliers
outliers = np.array([5, 10, 250, 300, 320])
df = pd.DataFrame({
'Feature_A': np.concatenate([base_data, outliers]),
'Feature_B': np.concatenate([np.random.normal(50, 10, 500), np.array([-50, 150, 160])])
})
Example 1: Automatic Handling
This is the simplest approach. The handler will automatically decide the best method for each feature based on its statistical properties.
# Initialize the handler with default 'auto' mode
handler = NoventisOutlierHandler(verbose=True)
# Fit and transform the data
df_cleaned = handler.fit_transform(df)
print(f"Original shape: {df.shape}")
print(f"Cleaned shape: {df_cleaned.shape}")
Example 2: Global Method (Winsorizing)
This example applies a single strategy to all columns. We will use 'winsorize' to cap extreme values at the boundaries defined by the 1st and 99th percentiles instead of removing them.
# Initialize with a global method and a specific quantile range
handler_winsorize = NoventisOutlierHandler(
default_method='winsorize',
quantile_range=(0.01, 0.99),
verbose=True
)
# Fit and transform
df_winsorized = handler_winsorize.fit_transform(df)
print(f"Original shape: {df.shape}")
print(f"Winsorized shape: {df_winsorized.shape}")
print("\nMin/Max values before:\n", df.agg(['min', 'max']))
print("\nMin/Max values after:\n", df_winsorized.agg(['min', 'max']))
Example 3: Per-Column Custom Strategy
This example demonstrates how to apply different outlier handling rules for each feature, providing fine-grained control.
# Define a dictionary with a specific method for each feature
method_map = {
'Feature_A': 'iqr_trim', # Use robust IQR trimming for Feature_A
'Feature_B': 'winsorize' # Cap extreme values for Feature_B
}
# Initialize the handler with the custom map
handler_custom = NoventisOutlierHandler(feature_method_map=method_map, verbose=True)
# Fit and transform
df_custom = handler_custom.fit_transform(df)
print(f"Original shape: {df.shape}")
print(f"Custom handled shape: {df_custom.shape}")