DATA_CLEANER
Scaling
This Scaling module scales numerical features in your dataset. It's a powerful tool for handling common data issues like skewness and outliers, which can significantly improve the performance of many machine learning models.
BASH
from noventis.data_cleaner import NoventisScaler
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
Method | {'auto', 'standard', 'minmax', 'robust', 'power'} | "auto" | The scaling algorithm to be used.
|
optimize | bool | True | If True, the scaler's internal parameters will be fine-tuned. |
custom_params | Optional[dict] | None | Allows you to override the default or optimized parameters for specific scaling methods. |
skew_threshold | float | 2.0 | Threshold of absolute skewness to consider a column as "highly skewed". |
outlier_threshold | float | 0.01 | Proportion of data points that must be outliers for a column to be categorized as "having outliers." |
normality_alpha | float | 0.05 | The significance level (alpha) used in the statistical test for normality. |
verbose | bool | True | If True, a summary of the scaling process will be printed after fitting. |
Method
Type
{'auto', 'standard', 'minmax', 'robust', 'power'}
Default
"auto"
The scaling algorithm to be used.
- auto (default): Automatically selects the best scaling strategy for each column based on its statistical properties (e.g., skewness, outliers).
- standard Uses StandardScaler. Best when data is already close to normal. Scales to mean 0 and std 1.
- minmax Uses MinMaxScaler. Scales to a fixed range (typically [0, 1]). Good for models that expect bounded features (e.g., many neural nets).
- robust Uses RobustScaler. Resistant to outliers by using median and IQR. Suitable when outliers are present.
- power Uses PowerTransformer. Transforms skewed data to be closer to Gaussian, helping models that assume normality.
optimize
Type
bool
Default
True
If True, the scaler's internal parameters will be fine-tuned.
custom_params
Type
Optional[dict]
Default
None
Allows you to override the default or optimized parameters for specific scaling methods.
skew_threshold
Type
float
Default
2.0
Threshold of absolute skewness to consider a column as "highly skewed".
outlier_threshold
Type
float
Default
0.01
Proportion of data points that must be outliers for a column to be categorized as "having outliers."
normality_alpha
Type
float
Default
0.05
The significance level (alpha) used in the statistical test for normality.
verbose
Type
bool
Default
True
If True, a summary of the scaling process will be printed after fitting.
Model Usage Examples
01
Automatic Scaling
PYTHON
import pandas as pd
from noventis_scaler import NoventisScaler
# Sample data with different distributions
df = pd.DataFrame({
'normal_data': np.random.normal(100, 15, 1000),
'skewed_data': np.random.exponential(2, 1000),
'with_outliers': np.concatenate([np.random.normal(50, 10, 950),
np.random.normal(200, 10, 50)])
})
# initialize & fit/transform
scaler = NoventisScaler(method='auto')
df_scaled = scaler.fit_transform(df)
# see chosen methods per column
print(scaler.fitted_methods_)
02
Force Specific Method
PYTHON
import pandas as pd
from noventis_scaler import NoventisScaler
df = pd.DataFrame({
'normal_data': np.random.normal(100, 15, 1000),
'skewed_data': np.random.exponential(2, 1000),
'with_outliers': np.concatenate([np.random.normal(50, 10, 950),
np.random.normal(200, 10, 50)])
})
scaler = NoventisScaler(method='robust')
df_scaled = scaler.fit_transform(df)