DATA_CLEANER

NoventisEncoder

Encoding categorical features is a critical and often complex step in preparing data for machine learning. The choice of encoding strategy (One-Hot, Target, or Label encoding) can dramatically affect model performance. A poor choice can lead to bloated datasets (dimensionality curse) or mislead the model by creating false ordinal relationships.

The NoventisEncoder is an advanced tool designed to solve this problem. It not only provides a comprehensive suite of encoding methods but also features an intelligent 'auto' mode. This mode analyzes each categorical column's characteristics (such as its number of unique values , its relationship with the target variable, and its potential memory impact) to recommend and apply the most effective encoding strategy automatically.

Import
BASH
from noventis.data_cleaner import NoventisEncoder
Parameters
method
Type
str
Default
'auto'
  • 'auto': (Recommended) Automatically selects the best encoding method for each column based on its statistical properties. Requires target_column to be set.
  • label: Converts categories into integers (0, 1, 2, …). Best for binary features or ordinal features where the default integer assignment is acceptable.
  • ohe (One-Hot Encoding): Creates a new binary (0/1) column for each category. Best for low-cardinality nominal features (e.g., ≤ 15 categories).
  • target: Replaces each category with the mean of the target variable for that category. Powerful for features with a strong relationship to the target; prone to overfitting but mitigated by cross-validation and smoothing.
  • ordinal: Converts categories to integers based on a user-defined order. Requires category_mapping. Best for features with a clear inherent order (e.g., “Low”, “Medium”, “High”).
  • binary: Converts categories into binary code and creates a column for each bit. A memory-efficient alternative to OHE for medium-cardinality features (e.g., 15–50 categories).
  • hashing: Uses a hashing function to convert categories into a fixed number of features. Memory-efficient for very high-cardinality features, but can result in collisions (different categories mapped to the same hash).
How does method='auto' work?

The auto mode uses a rule-based system to choose an optimal encoder for each column:

  1. Binary features (only 2 unique values): use label encoding.
  2. High cardinality (>50): use target if the feature is strongly correlated with the target; otherwise fall back to memory-efficient hashing.
  3. Medium cardinality (16–50): prefer target if correlated; otherwise use binary to balance performance and memory.
  4. Low cardinality (3–15): if correlation is very high and order is meaningful, use ordinal(requires mapping). Otherwise default to ohe.
target_column
Type
Optional[str]
Default
None
The name of the target variable (label) column. This is required when method is set to 'auto' or 'target'.
columns_to_encode
Type
Optional[List[str]]
Default
None
A list of specific column names to encode. If None, all categorical columns in the DataFrame will be processed.
category_mapping
Type
Optional[Dict[str, Dict]]
Default
None
A dictionary defining the explicit order for ordinal features (e.g., { "Size": { "order": ["Small", "Medium", "Large"] } }). This is required when method = 'ordinal'.
cv
Type
Union[float, str]
Default
'auto'
The smoothing parameter for TargetEncoder, which helps regularize the encoding for categories with few samples.
target_type
Type
str
Default
'auto'
The type of target variable ('binary' or 'continuous'). Used by TargetEncoder. If 'auto', the type is inferred from target_column.
verbose
Type
bool
Default
false
If True, prints a detailed analysis and summary of the encoding process.
Methods
  • fit(X, y)

    Analyzes the dataset and fits the appropriate encoder for each categorical column. The target series y is required for 'auto' and 'target' methods.

  • transform(X) → pd.DataFrame

    Applies the learned encoding to the input DataFrame X and returns the transformed data as a new pd.DataFrame.

  • fit_transform(X, y) → pd.DataFrame

    A convenient shortcut that performs both fit and transform operations in a single step, returning the encoded DataFrame.

Model Usage Examples
First, let's create sample data with some obvious outliers.
BASH
import pandas as pd import numpy as np data = { 'Country': ['USA', 'UK', 'Canada', 'USA', 'Germany', 'UK', 'USA', 'France', 'Canada', 'Germany'], 'Education': ['Bachelors', 'Masters', 'PhD', 'Bachelors', 'Masters', 'Bachelors', 'PhD', 'Masters', 'Masters', 'Bachelors'], 'Size': ['Medium', 'Small', 'Large', 'Medium', 'Large', 'Small', 'Medium', 'Large', 'Medium', 'Small'], 'Has_Pet': ['Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No'], 'Target': [1, 0, 0, 1, 0, 0, 1, 0, 1, 1] } df = pd.DataFrame(data) y = df['Target'] X = df.drop('Target', axis=1)
01
Example 1: Automatic Encoding (Recommended)

This is the most powerful feature. The encoder will analyze each column and apply the best strategy. verbose=True is highly recommended to understand the decisions made.

BASH
# Initialize in 'auto' mode, providing the target column name encoder_auto = NoventisEncoder(method='auto', target_column='Target', verbose=True) # Fit and transform the data df_encoded_auto = encoder_auto.fit_transform(X, y) print(" Transformed DataFrame Head:") print(df_encoded_auto.head())
02
Example 2: Manual Ordinal Encoding

This is used when a feature has a clear, inherent order. You must provide the mapping.

BASH
# Define the explicit order for the 'Size' column size_mapping = { 'Size': {'Small': 1, 'Medium': 2, 'Large': 3} } # Initialize in 'ordinal' mode with the mapping encoder_ordinal = NoventisEncoder(method='ordinal', columns_to_encode=['Size'], category_mapping=size_mapping) df_encoded_ordinal = encoder_ordinal.fit_transform(X) print(df_encoded_ordinal[['Size_ordinal_encoded']].head())
03
Example 3: Manual Target Encoding

This example applies Target Encoding to the 'Country' column, using cross-validation to ensure the encoding is robust.

BASH
# Initialize in 'target' mode for a specific column encoder_target = NoventisEncoder(method='target', columns_to_encode=['Country'], target_column='Target', cv=3) # Use 3 folds for this small dataset df_encoded_target = encoder_target.fit_transform(X, y) print(df_encoded_target[['Country_target_encoded']].head())