Create Next App

DATA_CLEANER

NoventisImputer

Handling missing data (NaNs) is a critical preprocessing step that can significantly impact model performance. Manually filling these values for each column can be tedious and error-prone. The NoventisImputer provides an intelligent and flexible solution to automate this process.

It automatically detects column types (numeric, categorical) and applies appropriate imputation strategies. Whether you need a simple automatic fix, a powerful global method like KNN, or a specific strategy for each column, NoventisImputer streamlines the entire workflow in a scikit-learn compatible interface.

Import

BASH

from noventis.data_cleaner import NoventisImputer

Parameters

Parameter	Type	Default	Description
method	str, dict, or None	`None`	None (Auto Mode): This is the default behavior. It intelligently selects the best simple strategy for each column: 'mean' for numeric (float) columns. 'mode' for categorical (object) columns. str (Global Method): Applies a single method to all columns with missing values. The available options are: mean: Fills with the column mean (for numeric columns). median: Fills with the column median (for numeric columns). mode: Fills with the most frequent value (mode). knn: Uses K-Nearest Neighbors to impute values based on the nearest data points. constant: Fills with a fixed value defined by `fill_value`. ffill: Forward-fills the last valid observation. fill: Backward-fills with the next valid observation. drop: Drops rows containing missing values in the processed columns. dict (Per-Column Method): Provides fine-grained control by specifying a method for each column. Example: { "Age": "median", "Salary": "knn", "Embarked": "mode" }
columns	Optional[List[str]]	`None`	A list of column names to apply the imputation to. If `None`, the imputer will automatically find and process all columns in the DataFrame that have missing values.
fill_value	Any	`None`	The constant value to use for imputation when `method="constant"`.
n_neighbors	int	`5`	The number of neighboring samples to use for imputation when `method="knn"`.
verbose	bool	`False`	If `True`, a summary of the imputation process will be printed after fitting.

method

Type

str, dict, or None

Default

None

None (Auto Mode): This is the default behavior. It intelligently selects the best simple strategy for each column:

'mean' for numeric (float) columns.
'mode' for categorical (object) columns.

str (Global Method): Applies a single method to all columns with missing values. The available options are:

mean: Fills with the column mean (for numeric columns).
median: Fills with the column median (for numeric columns).
mode: Fills with the most frequent value (mode).
knn: Uses K-Nearest Neighbors to impute values based on the nearest data points.
constant: Fills with a fixed value defined by fill_value.
ffill: Forward-fills the last valid observation.
fill: Backward-fills with the next valid observation.
drop: Drops rows containing missing values in the processed columns.

dict (Per-Column Method): Provides fine-grained control by specifying a method for each column.

Example: { "Age": "median", "Salary": "knn", "Embarked": "mode" }

columns

Type

Optional[List[str]]

Default

None

A list of column names to apply the imputation to. If None, the imputer will automatically find and process all columns in the DataFrame that have missing values.

fill_value

Type

Any

Default

None

The constant value to use for imputation when method="constant".

n_neighbors

Type

int

Default

5

The number of neighboring samples to use for imputation when method="knn".

verbose

Type

bool

Default

False

If True, a summary of the imputation process will be printed after fitting.

Methods

fit(X)
Analyzes the data and learns the imputation strategy from the input DataFrame X.
transform(X)
pd.DataFrame Applies the learned imputation to the DataFrame X and returns the transformed data.
fit_transform(X)
pd.DataFrame A convenient method that performs the fit and transform operations in a single step.

Model Usage Examples

First, let's create some sample data with missing values.

BASH

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Age': [22, 38, 26, 35, np.nan, 28, 50, np.nan],
    'Salary': [72000, 48000, 54000, 61000, 75000, np.nan, 83000, 45000],
    'City': ['London', 'Paris', 'New York', np.nan, 'Tokyo', 'London', 'Paris', 'New York'],
    'Experience': [1, 10, 3, 8, 5, 4, 20, np.nan]  # An integer column
})

Example 1: Automatic Imputation (Default)

This is the simplest use case. The imputer will automatically use the mean for numeric columns (Age, Salary, Experience) and the mode for the categorical column (City).

BASH

# Initialize the imputer with no parameters for auto mode
imputer = NoventisImputer(verbose=True)

# Fit and transform the data
df_imputed = imputer.fit_transform(df)
print(df_imputed)

Example 2: Using a Global Method (KNN)

Here, we apply the K-Nearest Neighbors algorithm to all numeric columns. The imputer is smart enough to use a fallback method (mode) for categorical columns where KNN is not applicable.

BASH

# Initialize with method='knn'
imputer_knn = NoventisImputer(method='knn', n_neighbors=3, verbose=True)

# Fit and transform
df_knn_imputed = imputer_knn.fit_transform(df)
print(df_knn_imputed)

Example 3: Per-Column Custom Strategy

This example demonstrates the highest level of control, where we define a specific imputation method for each column.

BASH

# Define a dictionary with specific methods for each column
custom_methods = {
    'Age': 'median',
    'Salary': 'mean',
    'City': 'mode',
    'Experience': 'constant'
}

# Initialize the imputer with the custom dictionary and a fill_value for 'constant'
imputer_custom = NoventisImputer(method=custom_methods, fill_value=0, verbose=True)

# Fit and transform
df_custom_imputed = imputer_custom.fit_transform(df)
print(df_custom_imputed)