Create Next App

DATA_CLEANER

NoventisEncoder

Encoding categorical features is a critical and often complex step in preparing data for machine learning. The choice of encoding strategy (One-Hot, Target, or Label encoding) can dramatically affect model performance. A poor choice can lead to bloated datasets (dimensionality curse) or mislead the model by creating false ordinal relationships.

The NoventisEncoder is an advanced tool designed to solve this problem. It not only provides a comprehensive suite of encoding methods but also features an intelligent 'auto' mode. This mode analyzes each categorical column's characteristics (such as its number of unique values , its relationship with the target variable, and its potential memory impact) to recommend and apply the most effective encoding strategy automatically.

Import

BASH

from noventis.data_cleaner import NoventisEncoder

Parameters

Parameter	Type	Default	Description
method	str	`'auto'`	'auto': (Recommended) Automatically selects the best encoding method for each column based on its statistical properties. Requires `target_column` to be set. label: Converts categories into integers (0, 1, 2, …). Best for binary features or ordinal features where the default integer assignment is acceptable. ohe (One-Hot Encoding): Creates a new binary (0/1) column for each category. Best for low-cardinality nominal features (e.g., ≤ 15 categories). target: Replaces each category with the mean of the target variable for that category. Powerful for features with a strong relationship to the target; prone to overfitting but mitigated by cross-validation and smoothing. ordinal: Converts categories to integers based on a user-defined order. Requires `category_mapping`. Best for features with a clear inherent order (e.g., “Low”, “Medium”, “High”). binary: Converts categories into binary code and creates a column for each bit. A memory-efficient alternative to OHE for medium-cardinality features (e.g., 15–50 categories). hashing: Uses a hashing function to convert categories into a fixed number of features. Memory-efficient for very high-cardinality features, but can result in collisions (different categories mapped to the same hash). How does method='auto' work? The `auto` mode uses a rule-based system to choose an optimal encoder for each column: Binary features (only 2 unique values): use `label` encoding. High cardinality (>50): use `target` if the feature is strongly correlated with the target; otherwise fall back to memory-efficient `hashing`. Medium cardinality (16–50): prefer `target` if correlated; otherwise use `binary` to balance performance and memory. Low cardinality (3–15): if correlation is very high and order is meaningful, use `ordinal`(requires mapping). Otherwise default to `ohe`.
target_column	Optional[str]	`None`	The name of the target variable (label) column. This is required when `method` is set to `'auto'` or `'target'`.
columns_to_encode	Optional[List[str]]	`None`	A list of specific column names to encode. If `None`, all categorical columns in the DataFrame will be processed.
category_mapping	Optional[Dict[str, Dict]]	`None`	A dictionary defining the explicit order for ordinal features (e.g., `{ "Size": { "order": ["Small", "Medium", "Large"] } }`). This is required when `method` = `'ordinal'`.
cv	Union[float, str]	`'auto'`	The smoothing parameter for TargetEncoder, which helps regularize the encoding for categories with few samples.
target_type	str	`'auto'`	The type of target variable (`'binary'` or `'continuous'`). Used by TargetEncoder. If `'auto'`, the type is inferred from `target_column`.
verbose	bool	`false`	If True, prints a detailed analysis and summary of the encoding process.

method

Type

str

Default

'auto'

'auto': (Recommended) Automatically selects the best encoding method for each column based on its statistical properties. Requires target_column to be set.
label: Converts categories into integers (0, 1, 2, …). Best for binary features or ordinal features where the default integer assignment is acceptable.
ohe (One-Hot Encoding): Creates a new binary (0/1) column for each category. Best for low-cardinality nominal features (e.g., ≤ 15 categories).
target: Replaces each category with the mean of the target variable for that category. Powerful for features with a strong relationship to the target; prone to overfitting but mitigated by cross-validation and smoothing.
ordinal: Converts categories to integers based on a user-defined order. Requires category_mapping. Best for features with a clear inherent order (e.g., “Low”, “Medium”, “High”).
binary: Converts categories into binary code and creates a column for each bit. A memory-efficient alternative to OHE for medium-cardinality features (e.g., 15–50 categories).
hashing: Uses a hashing function to convert categories into a fixed number of features. Memory-efficient for very high-cardinality features, but can result in collisions (different categories mapped to the same hash).

How does method='auto' work?

The auto mode uses a rule-based system to choose an optimal encoder for each column:

Binary features (only 2 unique values): use label encoding.
High cardinality (>50): use target if the feature is strongly correlated with the target; otherwise fall back to memory-efficient hashing.
Medium cardinality (16–50): prefer target if correlated; otherwise use binary to balance performance and memory.
Low cardinality (3–15): if correlation is very high and order is meaningful, use ordinal(requires mapping). Otherwise default to ohe.

target_column

Type

Optional[str]

Default

None

The name of the target variable (label) column. This is required when method is set to 'auto' or 'target'.

columns_to_encode

Type

Optional[List[str]]

Default

None

A list of specific column names to encode. If None, all categorical columns in the DataFrame will be processed.

category_mapping

Type

Optional[Dict[str, Dict]]

Default

None

A dictionary defining the explicit order for ordinal features (e.g., { "Size": { "order": ["Small", "Medium", "Large"] } }). This is required when method = 'ordinal'.

Type

Union[float, str]

Default

'auto'

The smoothing parameter for TargetEncoder, which helps regularize the encoding for categories with few samples.

target_type

Type

str

Default

'auto'

The type of target variable ('binary' or 'continuous'). Used by TargetEncoder. If 'auto', the type is inferred from target_column.

verbose

Type

bool

Default

false

If True, prints a detailed analysis and summary of the encoding process.

Methods

fit(X, y)
Analyzes the dataset and fits the appropriate encoder for each categorical column. The target series y is required for 'auto' and 'target' methods.
transform(X) → pd.DataFrame
Applies the learned encoding to the input DataFrame X and returns the transformed data as a new pd.DataFrame.
fit_transform(X, y) → pd.DataFrame
A convenient shortcut that performs both fit and transform operations in a single step, returning the encoded DataFrame.

Model Usage Examples

First, let's create sample data with some obvious outliers.

BASH

import pandas as pd
import numpy as np

data = {
    'Country': ['USA', 'UK', 'Canada', 'USA', 'Germany', 'UK', 'USA', 'France', 'Canada', 'Germany'],
    'Education': ['Bachelors', 'Masters', 'PhD', 'Bachelors', 'Masters', 'Bachelors', 'PhD', 'Masters', 'Masters', 'Bachelors'],
    'Size': ['Medium', 'Small', 'Large', 'Medium', 'Large', 'Small', 'Medium', 'Large', 'Medium', 'Small'],
    'Has_Pet': ['Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No'],
    'Target': [1, 0, 0, 1, 0, 0, 1, 0, 1, 1]
}
df = pd.DataFrame(data)
y = df['Target']
X = df.drop('Target', axis=1)

Example 1: Automatic Encoding (Recommended)

This is the most powerful feature. The encoder will analyze each column and apply the best strategy. verbose=True is highly recommended to understand the decisions made.

BASH

# Initialize in 'auto' mode, providing the target column name
encoder_auto = NoventisEncoder(method='auto', target_column='Target', verbose=True)

# Fit and transform the data
df_encoded_auto = encoder_auto.fit_transform(X, y)

print("
Transformed DataFrame Head:")
print(df_encoded_auto.head())

Example 2: Manual Ordinal Encoding

This is used when a feature has a clear, inherent order. You must provide the mapping.

BASH

# Define the explicit order for the 'Size' column
size_mapping = {
    'Size': {'Small': 1, 'Medium': 2, 'Large': 3}
}

# Initialize in 'ordinal' mode with the mapping
encoder_ordinal = NoventisEncoder(method='ordinal', 
                                  columns_to_encode=['Size'], 
                                  category_mapping=size_mapping)

df_encoded_ordinal = encoder_ordinal.fit_transform(X)
print(df_encoded_ordinal[['Size_ordinal_encoded']].head())

Example 3: Manual Target Encoding

This example applies Target Encoding to the 'Country' column, using cross-validation to ensure the encoding is robust.

BASH

# Initialize in 'target' mode for a specific column
encoder_target = NoventisEncoder(method='target',
                                 columns_to_encode=['Country'],
                                 target_column='Target',
                                 cv=3) # Use 3 folds for this small dataset

df_encoded_target = encoder_target.fit_transform(X, y)
print(df_encoded_target[['Country_target_encoded']].head())