NoventisOutlierHandler
Outliers, or extreme values, can significantly skew statistical analyses and degrade the performance of machine learning models. Handling them correctly is a crucial step in data preprocessing. The NoventisOutlierHandler provides a systematic and flexible framework for identifying and managing outliers in your dataset.
This tool allows you to choose between two primary strategies: removing outlier rows entirely (trimming) or capping their values to a reasonable range (winsorizing). It features an intelligent 'auto' mode to select an appropriate strategy based on your data's characteristics, but also offers fine-grained control to apply specific methods to different columns.
Import
from noventis.data_cleaner import NoventisOutlierHandlerParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| feature_method_map | Optional[Dict[str, str]] | None | A dictionary to specify a unique outlier handling method for each column. Any column not in this map will use the default_method. Example: { 'Salary': 'winsorize', 'Age': 'iqr_trim' } |
| default_method | str | 'auto' | The default method applied to all numeric columns not specified in feature_method_map.
How does default_method='auto' work? When 'auto' is selected, the handler chooses a method for each column based on the following logic:
|
| iqr_multiplier | float | 1.5 | The multiplier for the Interquartile Range (IQR) to determine the outlier boundaries when using 'iqr_trim' method. The Boundaries are calculated as Q1 - multiplier·IQR and Q3 + multiplier·IQR. |
| quantile_range | Tuple[float, float] | (0.05, 0.95) | A tuple specifying the lower and upper quantile boundaries. This is used by the quantile_trim method for trimming and winsorize method for capping. |
| min_data_threshold | int | 100 | Minimum number of data points below which the 'auto' mode will prefer 'iqr_trim'. |
| skew_threshold | float | 0.5 | Absolute skewness threshold above which the 'auto' mode will prefer 'winsorize'. |
| verbose | bool | false | If True, a summary of the outlier handling process will be printed after fitting. |
None{ 'Salary': 'winsorize', 'Age': 'iqr_trim' }'auto'feature_method_map.- 'auto': Intelligently selects a method based on data properties.
- 'quantile_trim': Removes rows where values fall outside the defined
quantile_range. - 'iqr_trim': Removes rows where values fall outside the IQR range defined by
iqr_multiplier. - 'winsorize': Caps values at boundaries defined by
quantile_rangeinstead of removing rows. - 'none': Skips outlier handling for the column.
When 'auto' is selected, the handler chooses a method for each column based on the following logic:
- Small Dataset?: If a column has fewer data points than
min_data_threshold, it usesiqr_trim(robust for small samples). - Skewed Data?: If absolute skewness >
skew_threshold, it useswinsorize(cap outliers without losing data). - Otherwise: For larger, non-skewed datasets, it uses
quantile_trim.
1.5(0.05, 0.95)1000.5falseMethods
fit(X)
Analyzes the data and learns the outlier handling configuration from the input DataFrame
X.transform(X)
Applies the learned strategy to
Xand returns the transformedpd.DataFrame.fit_transform(X)
Convenience method that performs
fitandtransformin one step.
Model Usage Examples
import pandas as pd
import numpy as np
# Create a base normal distribution
base_data = np.random.normal(loc=100, scale=20, size=500)
# Add some extreme outliers
outliers = np.array([5, 10, 250, 300, 320])
df = pd.DataFrame({
'Feature_A': np.concatenate([base_data, outliers]),
'Feature_B': np.concatenate([np.random.normal(50, 10, 500), np.array([-50, 150, 160])])
})Example 1: Automatic Handling
This is the simplest approach. The handler will automatically decide the best method for each feature based on its statistical properties.
# Initialize the handler with default 'auto' mode
handler = NoventisOutlierHandler(verbose=True)
# Fit and transform the data
df_cleaned = handler.fit_transform(df)
print(f"Original shape: {df.shape}")
print(f"Cleaned shape: {df_cleaned.shape}")Example 2: Global Method (Winsorizing)
This example applies a single strategy to all columns. We will use 'winsorize' to cap extreme values at the boundaries defined by the 1st and 99th percentiles instead of removing them, which is useful when you want to preserve all your data rows.
# Initialize with a global method and a specific quantile range
handler_winsorize = NoventisOutlierHandler(
default_method='winsorize',
quantile_range=(0.01, 0.99),
verbose=True
)
# Fit and transform
df_winsorized = handler_winsorize.fit_transform(df)
print(f"Original shape: {df.shape}")
print(f"Winsorized shape: {df_winsorized.shape}")
print("\nMin/Max values before:\n", df.agg(['min', 'max']))
print("\nMin/Max values after:\n", df_winsorized.agg(['min', 'max']))Example 3: Per-Column Custom Strategy
This shows how to apply different rules to different columns, providing maximum control over the process.
# Define a dictionary with a specific method for each feature
method_map = {
'Feature_A': 'iqr_trim', # Use robust IQR trimming for Feature_A
'Feature_B': 'winsorize' # Cap extreme values for Feature_B
}
# Initialize the handler with the custom map
handler_custom = NoventisOutlierHandler(
feature_method_map=method_map,
verbose=True
)
# Fit and transform
df_custom = handler_custom.fit_transform(df)
print(f"Original shape: {df.shape}")
print(f"Custom handled shape: {df_custom.shape}")