Data Preprocessing for Machine Learning

Introduction

Machine learning performance depends on model architecture — but even more on the quality of data you feed into it.

"Better data beats better models."

Even the most advanced neural network will fail if your data is messy, inconsistent, incomplete, or poorly scaled.

In real businesses — finance, healthcare, e-commerce, marketing, gaming — 80% of machine learning work is preprocessing. This guide teaches you everything you need to properly prepare datasets before training.

1. Why Data Preprocessing Matters

Raw data is rarely usable. It often contains:

Missing values
Duplicates
Incorrect formatting
Extreme outliers
Inconsistent categories
Unequal scales
Noise

If you train a model on this:

❌ Accuracy drops
❌ Overfitting rises
❌ Predictions become unreliable
❌ Model fails on real-world data

Good preprocessing ensures:

✔ Clean, consistent datasets
✔ Better accuracy
✔ Faster training times
✔ More stable predictions
✔ Lower computational cost

2. Step 1 — Data Cleaning

Cleaning is the foundation of preprocessing.

2.1 Handling Missing Values

Approaches:

A) Remove rows

Useful when missing values are rare.

df = df.dropna()

B) Fill with statistical values

Mean/median for numeric columns
Mode for categorical columns

df["age"].fillna(df["age"].median(), inplace=True)

C) Forward/Backward fill

df.fillna(method="ffill")
df.fillna(method="bfill")

D) Predict missing values (advanced)

Use models to impute values (KNN, regression).

2.2 Handling Duplicates

df.drop_duplicates(inplace=True)

Duplicates distort distributions and correlations.

2.3 Handling Inconsistent Formats

Examples:

Upper/lower-case mismatch
Date formats
Numeric strings
Currency symbols

df["city"] = df["city"].str.lower()

3. Step 2 — Data Transformation

After cleaning, you need to transform values into formats ML can understand.

3.1 Normalization vs Standardization

Many ML models (SVM, KNN, Neural Networks) require scaled data.

Normalization (0–1 Range)

Good for neural networks.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df)

Standardization (Mean 0, Std 1)

Ideal for linear models and SVM.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

3.2 Encoding Categorical Data

Models cannot process text categories.

A) One-Hot Encoding

pd.get_dummies(df, columns=["gender"])

B) Label Encoding

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df["color"] = encoder.fit_transform(df["color"])

C) Target Encoding

Useful for high-cardinality columns in large datasets.

4. Step 3 — Feature Engineering

Feature engineering transforms raw data into meaningful features.

Common Techniques:

4.1 Creating new features

Examples:

BMI from height & weight
Total revenue from quantity × price
Age from date of birth

4.2 Feature Extraction

Reduce dimensionality using:

PCA (Principal Component Analysis)
t-SNE
Autoencoders

Example PCA:

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
reduced = pca.fit_transform(df)

4.3 Feature Selection

Select the most important columns.

Techniques:

Correlation analysis
Mutual information
Chi-square test
Recursive Feature Elimination (RFE)

from sklearn.feature_selection import RFE

5. Step 4 — Handling Outliers

Outliers distort distributions and confuse models.

Approaches:

5.1 Z-Score Method

from scipy import stats
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

5.2 IQR Method

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

5.3 Capping (Winsorization)

Replace extreme values with thresholds.

6. Step 5 — Splitting Data

Always split before training.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Avoid leakage by only scaling after splitting.

7. Step 6 — Balancing the Dataset

If one class dominates, models become biased.

Solutions:

A) Oversampling

from imblearn.over_sampling import SMOTE

B) Undersampling

Remove instances from majority class.

C) Class weights

model = LogisticRegression(class_weight="balanced")

8. Step 7 — Noise Reduction

Remove random errors in the data.

Methods:

Smoothing
Rolling averages
Removing irrelevant features
Filtering sparse text

For images:

Gaussian blur
Median filtering

9. Step 8 — Data Augmentation

Used mainly in:

Computer vision
NLP
Audio processing

Examples:

Rotate/flip images
Synonym replacement for text
Pitch shifting for audio

Augmentation increases dataset size and prevents overfitting.

10. Putting It All Together (Master Workflow)

Your full preprocessing pipeline typically becomes:

Load raw dataset
Clean data (missing values, duplicates)
Fix formats (dates, text, numerics)
Encode categories
Scale numeric values
Engineer new features
Handle outliers
Reduce dimensionality
Split dataset
Balance classes
Augment if needed
Train model

This pipeline works for almost all ML tasks — classification, regression, clustering, NLP, image classification, and more.

11. Summary

By now, you've mastered:

✔ Data cleaning
✔ Handling missing values
✔ Scaling & normalization
✔ Encoding categorical variables
✔ Feature engineering
✔ Outlier detection
✔ Dataset splitting & balancing
✔ Noise reduction & augmentation

This is the real backbone of machine learning. Models rely on good preprocessing — it's where most accuracy improvements happen.

Introduction

1. Why Data Preprocessing Matters

2. Step 1 — Data Cleaning

2.1 Handling Missing Values

2.2 Handling Duplicates

2.3 Handling Inconsistent Formats

3. Step 2 — Data Transformation

3.1 Normalization vs Standardization

3.2 Encoding Categorical Data

4. Step 3 — Feature Engineering

4.1 Creating new features

4.2 Feature Extraction

4.3 Feature Selection

5. Step 4 — Handling Outliers

5.1 Z-Score Method

5.2 IQR Method

5.3 Capping (Winsorization)

6. Step 5 — Splitting Data

7. Step 6 — Balancing the Dataset

8. Step 7 — Noise Reduction

9. Step 8 — Data Augmentation

10. Putting It All Together (Master Workflow)

11. Summary

Cookie & Privacy Settings