NoventisScaler
Feature scaling is a crucial preprocessing step that ensures all numerical features have a comparable scale. This can dramatically improve the performance of many machine learning models. However, choosing the right scaler— StandardScaler for normal data, RobustScaler for data with outliers, or PowerTransformer for skewed data—is often a tedious manual process.
NoventisScaler is here to automate this process. It intelligently analyzes each numerical column in your dataset and applies the most suitable scaling strategy, ensuring each feature is treated optimally.
from noventis.data_cleaner import NoventisScalerParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| method | {'auto', 'standard', 'minmax', 'robust', 'power'} | 'auto' The scaling algorithm to be used. |
How does method='auto' work? When method is set to 'auto', NoventisScaler evaluates each column using a prioritized decision hierarchy to select the most appropriate scaler:
|
| optimize | bool | True | If True, the scaler's internal parameters will be fine-tuned. |
| custom_params | Optional[dict] | None | Allows you to override the default or optimized parameters for specific scaling methods. |
| skew_threshold | float | 2.0 | Threshold of absolute skewness to consider a column as "highly skewed". |
| outlier_threshold | float | '0.01' | The proportion of data points that must be outliers for a column to be categorized as "having outliers". |
| normality_alpha | float | '0.05' | The significance level (alpha) used in the statistical test for normality. |
| verbose | bool | false | If True, a summary of the scaling process will be printed after fitting. |
'auto' The scaling algorithm to be used.'auto'(default): Automatically selects the best scaling strategy for each column based on its statistical properties (skewness, outliers, etc.).'standard': Uses StandardScaler. Best for data that is already normally distributed (or close to it). Scales data to have a mean of 0 and a standard deviation of 1.'minmax': Uses MinMaxScaler. Scales data to a fixed range, typically [0, 1]. Useful for algorithms that require feature values in a specific range, like neural networks.'robust': Uses RobustScaler. This method is great for datasets with significant outliers, as it scales data based on the median and interquartile range (IQR).'power': Uses PowerTransformer. This is a powerful technique to transform skewed data to be more Gaussian (normal-like).
When method is set to 'auto', NoventisScaler evaluates each column using a prioritized decision hierarchy to select the most appropriate scaler:
- Forced for KNN? : If is_for_knn=True is passed to the .fit() method, MinMaxScaler is used.
- High cardinality (>50) : If the column's absolute skewness is greater than skew_threshold (default: 2.0), PowerTransformer is used to make the data more Gaussian-like.
- Significant Outliers? : If the ratio of outliers exceeds outlier_threshold (default: 0.01) , the outlier-resistant RobustScaler is chosen.
- Normally Distributed? : If the data passes a normality test (using normality_alpha as the significance level), the standard StandardScaler is applied.
- Default Fallback : If the data passes a normality test (using normality_alpha as the significance level), the standard StandardScaler is applied
TrueTrue, the scaler's internal parameters will be fine-tuned.None2.0'0.01''0.05'falseMethods
fit(X, is_for_knn=False) → returns self
Analyze data and fit scalers for each column.
Parameters:
X(pd.DataFrame): Input dataframe.is_for_knn(bool): Force MinMax scaling for KNN algorithms.
transform(X) → Returns: pd.DataFrame (scaled data)
Apply fitted scalers to transform data.
Parameters:
X(pd.DataFrame): Input dataframe.
fit_transform(X, is_for_knn=False) → Returns: pd.DataFrame
Fit and transform in one step.
Parameters:
X(pd.DataFrame): Input dataframe.is_for_knn(bool): Force MinMax scaling for KNN algorithms.
inverse_transform(X) → Returns: pd.DataFrame
Reverse transformation to the original scale.
Parameters:
X(pd.DataFrame): Scaled dataframe.
Model Usage Examples
Example 1: Automatic Scaling
This is the most powerful feature of NoventisScaler. We'll let it analyze each column of our diverse dataset and apply the most appropriate scaling strategy. Using verbose=True will show us the decisions it made.
import pandas as pd
import numpy as np
from noventis_scaler import NoventisScaler
# Create a diverse sample dataset
df = pd.DataFrame({
'normal_data': np.random.normal(loc=100,
scale=15, size=500),
'skewed_data': np.random.gamma(shape=1,
scale=50, size=500)**2,
'data_with_outliers':
np.concatenate([np.random.normal(loc=0, scale=5, size=496),
np.array([-50, 50, -60, 60])]),
'bimodal_data':
np.concatenate([np.random.normal(loc=20, scale=5, size=250),
np.random.normal(loc=80, scale=7, size=250)])
})
# Initialize in 'auto' mode to let the scaler decide
scaler = NoventisScaler(method='auto', verbose=True)
# Fit and transform the data
df_scaled = scaler.fit_transform(df)
# Check which method was chosen for each column
print("\nScaler chosen for each column:")
print(scaler.fitted_methods_)Example 2: Force Specific Method
Sometimes, you might want to apply a single scaling strategy to all columns, overriding the automatic selection. Here, we'll force every column to use RobustScaler.
# Initialize with the 'robust' method
scaler_robust = NoventisScaler(method='robust')
# Fit and transform the data
df_robust_scaled = scaler_robust.fit_transform(df)
print("\nDescription of data after forcing RobustScaler on all columns:")
print(df_robust_scaled.describe())Example 3: Advanced Usage with custom_params
You can get even more granular control by passing custom parameters to the underlying scalers. Here, we will use PowerTransformer on all columns but disable its default behavior of standardizing the output (setting mean=0, std=1).
# Define a custom parameter to override the default
# We want the PowerTransformer to transform the data but not standardize it
custom_config = {'power': {'standardize': False}}
# Initialize with the 'power' method and our custom parameters
scaler_custom = NoventisScaler(method='power',
custom_params=custom_config)
# Fit and transform
df_custom_scaled = scaler_custom.fit_transform(df)
print("\nDescription of data after custom PowerTransformer:")
print(df_custom_scaled.describe())