Quickstart Guide: From Raw Data to a Model in 5 Minutes
Welcome to Noventis! In this guide, we'll walk through a complete machine learning workflow using just a few lines of code. We'll take a "dirty" dataset, automatically analyze it, clean it, and then train and compare multiple models to find the best one.
Let's get started!
Step 1: Setup & Load Sample Data
First, let's import all the tools we'll need from the Noventis library. We'll also create a sample DataFrame that has several common issues: missing data (NaN), categorical features, a potential outlier, and a binary target to predict.
import pandas as pd
import numpy as np
from noventis_eda import NoventisAutoEDA
from noventis_datacleaner import data_cleaner
from noventis_automl import NoventisAutoML
# Create a "dirty" sample DataFrame
data = {
'Age': [22, 38, 26, 35, np.nan, 28, 50, 45],
'City': ['London', 'Paris', 'New York', 'Tokyo', 'London', 'Paris', np.nan, 'New York'],
'Experience': [1, 10, 3, 8, 5, 4, 20, 15],
'Salary': [72000, 48000, 54000, 250000, 75000, np.nan, 83000, 45000], # 250000 is an outlier
'Purchased': [0, 1, 0, 1, 1, 0, 1, 0] # Our target
}
df = pd.DataFrame(data)
print("Initial Data:")
display(df)
Step 2: Automated Exploratory Data Analysis (AutoEDA)
# Initialize AutoEDA with our data and target
eda = NoventisAutoEDA(data=df, target='Purchased')
# Run the analysis and display the report
eda.run()
This single command will generate a complete HTML dashboard showing data distributions, missing values, correlations, and more. From this, we can confirm that we have missing data in the Age, City, and Salary columns.
Step 3: Automated Data Cleaning (Just One Line!)
# Run the automated data cleaner
cleaned_df = data_cleaner(data=df, target_column='Purchased')
print("\nData After Cleaning:")
display(cleaned_df.head())
Notice how the City column has been transformed into several numeric columns (via encoding), and all NaN values have been filled. Our data is now clean, fully numeric, and ready for machine learning!
Step 4: Automated Machine Learning (AutoML)
# Initialize AutoML with the cleaned data
# We'll give it a 60-second time budget to find the best model
automl = NoventisAutoML(data=cleaned_df,
target='Purchased',
time_budget=60)
# Start the training and evaluation process
results = automl.fit()
This process will display a log as various models are tested. Once finished, the best model will be saved, and the results are ready to be displayed.
Step 5: See the Results!
This report contains everything you need:
- Model comparison rankings.
- Detailed performance metrics of the best model (Accuracy, F1-Score, etc.).
- Visualizations like a Confusion Matrix and Feature Importance.
Conclusion
Congratulations! In just a few minutes and with only a handful of code lines, you have successfully:
- Analyzed the quality of a dataset automatically.
- Cleaned the data of various common issues.
- Trained, evaluated, and compared multiple ML models.
- Found the best-performing model and saved it.
- Generated a comprehensive, interactive report.
You are now ready to explore the more in-depth features of each Noventis component!