NoventisEncoder
Encoding categorical features is a critical and often complex step in preparing data for machine learning. The choice of encoding strategy (One-Hot, Target, or Label encoding) can dramatically affect model performance. A poor choice can lead to bloated datasets (dimensionality curse) or mislead the model by creating false ordinal relationships.
The NoventisEncoder is an advanced tool designed to solve this problem. It not only provides a comprehensive suite of encoding methods but also features an intelligent 'auto' mode. This mode analyzes each categorical column's characteristics (such as its number of unique values , its relationship with the target variable, and its potential memory impact) to recommend and apply the most effective encoding strategy automatically.
Import
from noventis.data_cleaner import NoventisEncoder
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
method | str | 'auto' |
How does method='auto' work? The
|
target_column | Optional[str] | None | The name of the target variable (label) column. This is required when method is set to 'auto' or 'target' . |
columns_to_encode | Optional[List[str]] | None | A list of specific column names to encode. If None , all categorical columns in the DataFrame will be processed. |
category_mapping | Optional[Dict[str, Dict]] | None | A dictionary defining the explicit order for ordinal features (e.g., { "Size": { "order": ["Small", "Medium", "Large"] } } ). This is required when method = 'ordinal' . |
cv | Union[float, str] | 'auto' | The smoothing parameter for TargetEncoder, which helps regularize the encoding for categories with few samples. |
target_type | str | 'auto' | The type of target variable ( 'binary' or 'continuous' ). Used by TargetEncoder. If 'auto' , the type is inferred from target_column . |
verbose | bool | false | If True, prints a detailed analysis and summary of the encoding process. |
'auto'
- 'auto': (Recommended) Automatically selects the best encoding method for each column based on its statistical properties. Requires
target_column
to be set. - label: Converts categories into integers (0, 1, 2, …). Best for binary features or ordinal features where the default integer assignment is acceptable.
- ohe (One-Hot Encoding): Creates a new binary (0/1) column for each category. Best for low-cardinality nominal features (e.g., ≤ 15 categories).
- target: Replaces each category with the mean of the target variable for that category. Powerful for features with a strong relationship to the target; prone to overfitting but mitigated by cross-validation and smoothing.
- ordinal: Converts categories to integers based on a user-defined order. Requires
category_mapping
. Best for features with a clear inherent order (e.g., “Low”, “Medium”, “High”). - binary: Converts categories into binary code and creates a column for each bit. A memory-efficient alternative to OHE for medium-cardinality features (e.g., 15–50 categories).
- hashing: Uses a hashing function to convert categories into a fixed number of features. Memory-efficient for very high-cardinality features, but can result in collisions (different categories mapped to the same hash).
The auto
mode uses a rule-based system to choose an optimal encoder for each column:
- Binary features (only 2 unique values): use
label
encoding. - High cardinality (>50): use
target
if the feature is strongly correlated with the target; otherwise fall back to memory-efficienthashing
. - Medium cardinality (16–50): prefer
target
if correlated; otherwise usebinary
to balance performance and memory. - Low cardinality (3–15): if correlation is very high and order is meaningful, use
ordinal
(requires mapping). Otherwise default toohe
.
None
method
is set to 'auto'
or 'target'
.None
None
, all categorical columns in the DataFrame will be processed.None
{ "Size": { "order": ["Small", "Medium", "Large"] } }
). This is required when method
= 'ordinal'
.'auto'
'auto'
'binary'
or 'continuous'
). Used by TargetEncoder. If 'auto'
, the type is inferred from target_column
.false
Methods
fit(X, y)
Analyzes the dataset and fits the appropriate encoder for each categorical column. The target series
y
is required for 'auto' and 'target' methods.transform(X) → pd.DataFrame
Applies the learned encoding to the input DataFrame
X
and returns the transformed data as a newpd.DataFrame
.fit_transform(X, y) → pd.DataFrame
A convenient shortcut that performs both
fit
andtransform
operations in a single step, returning the encoded DataFrame.
Model Usage Examples
import pandas as pd
import numpy as np
data = {
'Country': ['USA', 'UK', 'Canada', 'USA', 'Germany', 'UK', 'USA', 'France', 'Canada', 'Germany'],
'Education': ['Bachelors', 'Masters', 'PhD', 'Bachelors', 'Masters', 'Bachelors', 'PhD', 'Masters', 'Masters', 'Bachelors'],
'Size': ['Medium', 'Small', 'Large', 'Medium', 'Large', 'Small', 'Medium', 'Large', 'Medium', 'Small'],
'Has_Pet': ['Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No'],
'Target': [1, 0, 0, 1, 0, 0, 1, 0, 1, 1]
}
df = pd.DataFrame(data)
y = df['Target']
X = df.drop('Target', axis=1)
Example 1: Automatic Encoding (Recommended)
This is the most powerful feature. The encoder will analyze each column and apply the best strategy. verbose=True is highly recommended to understand the decisions made.
# Initialize in 'auto' mode, providing the target column name
encoder_auto = NoventisEncoder(method='auto', target_column='Target', verbose=True)
# Fit and transform the data
df_encoded_auto = encoder_auto.fit_transform(X, y)
print("
Transformed DataFrame Head:")
print(df_encoded_auto.head())
Example 2: Manual Ordinal Encoding
This is used when a feature has a clear, inherent order. You must provide the mapping.
# Define the explicit order for the 'Size' column
size_mapping = {
'Size': {'Small': 1, 'Medium': 2, 'Large': 3}
}
# Initialize in 'ordinal' mode with the mapping
encoder_ordinal = NoventisEncoder(method='ordinal',
columns_to_encode=['Size'],
category_mapping=size_mapping)
df_encoded_ordinal = encoder_ordinal.fit_transform(X)
print(df_encoded_ordinal[['Size_ordinal_encoded']].head())
Example 3: Manual Target Encoding
This example applies Target Encoding to the 'Country' column, using cross-validation to ensure the encoding is robust.
# Initialize in 'target' mode for a specific column
encoder_target = NoventisEncoder(method='target',
columns_to_encode=['Country'],
target_column='Target',
cv=3) # Use 3 folds for this small dataset
df_encoded_target = encoder_target.fit_transform(X, y)
print(df_encoded_target[['Country_target_encoded']].head())