Create Next App

Quickstart Guide: From Raw Data to a Model in 5 Minutes

Welcome to Noventis! In this guide, we'll walk through a complete machine learning workflow using just a few lines of code. We'll take a "dirty" dataset, automatically analyze it, clean it, and then train and compare multiple models to find the best one.

Let's get started!

Step 1: Setup & Load Sample Data

First, let's import all the tools we'll need from the Noventis library. We'll also create a sample DataFrame that has several common issues: missing data (NaN), categorical features, a potential outlier, and a binary target to predict.

PYTHON

import pandas as pd
import numpy as np
from noventis_eda import NoventisAutoEDA
from noventis_datacleaner import data_cleaner
from noventis_automl import NoventisAutoML

# Create a "dirty" sample DataFrame
data = {
    'Age': [22, 38, 26, 35, np.nan, 28, 50, 45],
    'City': ['London', 'Paris', 'New York', 'Tokyo', 'London', 'Paris', np.nan, 'New York'],
    'Experience': [1, 10, 3, 8, 5, 4, 20, 15],
    'Salary': [72000, 48000, 54000, 250000, 75000, np.nan, 83000, 45000], # 250000 is an outlier
    'Purchased': [0, 1, 0, 1, 1, 0, 1, 0] # Our target
}
df = pd.DataFrame(data)

print("Initial Data:")
display(df)

Step 2: Automated Exploratory Data Analysis (AutoEDA)

Before we clean the data, it's a good idea to "peek" inside to understand its issues. Let's use NoventisAutoEDA to automatically generate an interactive report.

PYTHON

# Initialize AutoEDA with our data and target
eda = NoventisAutoEDA(data=df, target='Purchased')

# Run the analysis and display the report
eda.run()

This single command will generate a complete HTML dashboard showing data distributions, missing values, correlations, and more. From this, we can confirm that we have missing data in the Age, City, and Salary columns.

Step 3: Automated Data Cleaning (Just One Line!)

Now that we know the problems, let's fix them with a single line of code using data_cleaner. This function will intelligently handle missing values, outliers, encode categorical features, and perform scaling using smart defaults.

PYTHON

# Run the automated data cleaner
cleaned_df = data_cleaner(data=df, target_column='Purchased')

print("\nData After Cleaning:")
display(cleaned_df.head())

Notice how the City column has been transformed into several numeric columns (via encoding), and all NaN values have been filled. Our data is now clean, fully numeric, and ready for machine learning!

Step 4: Automated Machine Learning (AutoML)

With our data now clean, it's time to train a model. NoventisAutoML will take over, automatically detecting the task (classification), training various models, and comparing them to find the best one within our specified time budget.

PYTHON

# Initialize AutoML with the cleaned data
# We'll give it a 60-second time budget to find the best model
automl = NoventisAutoML(data=cleaned_df, 
                        target='Purchased', 
                        time_budget=60)

# Start the training and evaluation process
results = automl.fit()

This process will display a log as various models are tested. Once finished, the best model will be saved, and the results are ready to be displayed.

Step 5: See the Results!

The process is complete! NoventisAutoML has found the best model and generated an interactive report that will appear directly in your output cell (if you're using a Jupyter Notebook).

This report contains everything you need:

Model comparison rankings.
Detailed performance metrics of the best model (Accuracy, F1-Score, etc.).
Visualizations like a Confusion Matrix and Feature Importance.

Conclusion

Congratulations! In just a few minutes and with only a handful of code lines, you have successfully:

Analyzed the quality of a dataset automatically.
Cleaned the data of various common issues.
Trained, evaluated, and compared multiple ML models.
Found the best-performing model and saved it.
Generated a comprehensive, interactive report.

You are now ready to explore the more in-depth features of each Noventis component!