Machine Learning

From Zero to Hero: A Comprehensive Guide

Master Machine Learning from basics to advanced concepts with interactive examples and practical code snippets

Start Learning

Beginner Level

Start your journey into the world of Machine Learning with fundamental concepts and algorithms

1. Introduction to Machine Learning

What is Machine Learning?

Machine Learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on developing computer programs that can access data and use it to learn for themselves.

Why Machine Learning?

  • Automates decision-making processes
  • Handles large and complex datasets
  • Improves over time with more data
  • Solves problems where traditional programming fails

Example: Email Spam Detection

A machine learning model can learn to classify emails as spam or not spam by analyzing thousands of labeled email examples.

# Simple example using scikit-learn
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Sample data
emails = [
    "Win money now! Click here!",
    "Meeting at 3pm tomorrow",
    "Free gift! Claim now!",
    "Project deadline next week"
]
labels = [1, 0, 1, 0]  # 1 = spam, 0 = not spam

# Vectorize text
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

# Train model
model = MultinomialNB()
model.fit(X, labels)

# Predict
new_email = ["Congratulations! You won a prize!"]
X_new = vectorizer.transform(new_email)
prediction = model.predict(X_new)
print("Spam" if prediction[0] == 1 else "Not Spam")

2. Types of Machine Learning

Supervised Learning

Learning with labeled data. The model learns from input-output pairs.

  • Classification: Predicting categories (e.g., spam/not spam)
  • Regression: Predicting continuous values (e.g., house prices)
Example: House Price Prediction
from sklearn.linear_model import LinearRegression
import numpy as np

# Features: [size_in_sqft, bedrooms, age]
X = np.array([
    [1500, 3, 5],
    [2000, 4, 3],
    [1200, 2, 8],
    [2500, 5, 2]
])
# Target: price in thousands
y = np.array([300, 450, 250, 550])

model = LinearRegression()
model.fit(X, y)

# Predict price for new house
new_house = np.array([[1800, 3, 4]])
predicted_price = model.predict(new_house)
print(f"Predicted price: ${predicted_price[0]:.0f},000")

Unsupervised Learning

Learning patterns from unlabeled data without predefined outputs.

  • Clustering: Grouping similar data points
  • Dimensionality Reduction: Reducing features while preserving information
Example: Customer Segmentation
from sklearn.cluster import KMeans
import numpy as np

# Customer data: [annual_spending, visits_per_year]
customers = np.array([
    [1200, 8], [3500, 15], [800, 5],
    [2800, 12], [1500, 10], [4200, 20]
])

kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(customers)

print("Customer segments:", clusters)
print("Cluster centers:", kmeans.cluster_centers_)

Reinforcement Learning

Learning through trial and error using rewards and penalties.

  • Agent interacts with environment
  • Receives rewards for good actions
  • Learns optimal strategy over time
Example: Game Playing
# Simplified Q-Learning concept
# Agent learns which actions maximize reward
# Used in: Chess, Go, Video Games, Robotics

# Q-Table stores state-action values
# Q(state, action) = expected future reward

# Key components:
# - State: Current game situation
# - Action: Move to make
# - Reward: Feedback (+1 for win, -1 for loss)
# - Policy: Strategy to choose actions

3. Data Preprocessing

Data preprocessing is crucial for building effective ML models. Raw data is often incomplete, inconsistent, or in a format that's difficult to use.

Key Steps:

  1. Handling Missing Values: Remove or impute missing data
  2. Encoding Categorical Variables: Convert text categories to numbers
  3. Feature Scaling: Normalize or standardize numerical features
  4. Train-Test Split: Separate data for training and testing

Complete Preprocessing Pipeline

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Load data
data = pd.DataFrame({
    'age': [25, 30, np.nan, 35, 40],
    'salary': [50000, 60000, 70000, np.nan, 90000],
    'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA'],
    'purchased': ['Yes', 'No', 'Yes', 'No', 'Yes']
})

# 1. Handle missing values
imputer = SimpleImputer(strategy='mean')
data[['age', 'salary']] = imputer.fit_transform(data[['age', 'salary']])

# 2. Encode categorical variables
label_encoder = LabelEncoder()
data['city'] = label_encoder.fit_transform(data['city'])
data['purchased'] = label_encoder.fit_transform(data['purchased'])

# 3. Feature scaling
scaler = StandardScaler()
features = ['age', 'salary', 'city']
X = scaler.fit_transform(data[features])
y = data['purchased']

# 4. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Preprocessed data shape:", X_train.shape)

4. Linear Regression

Linear Regression is the simplest regression algorithm that models the relationship between dependent and independent variables using a straight line.

Formula: y = mx + b (for simple) or y = b₀ + b₁x₁ + b₂x₂ + ... (for multiple)

Key Concepts:

  • Slope (m): Rate of change
  • Intercept (b): Value when x = 0
  • Cost Function: Mean Squared Error (MSE)
  • Optimization: Gradient Descent

Implementation Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Generate sample data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2.5 * X.flatten() + 1.5 + np.random.randn(100) * 2

# Create and train model
model = LinearRegression()
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Evaluate
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

print(f"Coefficient: {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"MSE: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

# Visualize
plt.scatter(X, y, alpha=0.5, label='Data')
plt.plot(X, y_pred, 'r-', label='Prediction')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.title('Linear Regression')
plt.show()

5. Logistic Regression

Logistic Regression is used for binary classification problems. It uses the sigmoid function to output probabilities between 0 and 1.

Sigmoid Function: σ(z) = 1 / (1 + e^(-z))

Key Concepts:

  • Outputs probability (0 to 1)
  • Uses logistic function (S-curve)
  • Decision boundary at 0.5
  • Cost function: Log Loss

Classification Example

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Generate classification data
X, y = make_classification(
    n_samples=1000,
    n_features=4,
    n_classes=2,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Show probability for first prediction
print(f"\nFirst prediction probability: {y_proba[0]}")
print(f"Predicted class: {y_pred[0]}")

Intermediate Level

Explore more sophisticated algorithms and techniques for complex problems

6. Decision Trees

Decision Trees create a model that predicts the value of a target variable by learning simple decision rules inferred from data features.

Key Concepts:

  • Nodes: Decision points (features)
  • Branches: Outcomes (values)
  • Leaves: Final predictions
  • Splitting Criteria: Gini Impurity, Entropy, Information Gain

Decision Tree Implementation

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Load iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

# Evaluate
accuracy = tree.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2%}")

# Visualize tree
plt.figure(figsize=(15, 10))
plot_tree(tree, feature_names=iris.feature_names,
          class_names=iris.target_names, filled=True)
plt.title("Decision Tree Visualization")
plt.show()

# Feature importance
importances = tree.feature_importances_
for i, (feature, importance) in enumerate(zip(iris.feature_names, importances)):
    print(f"{feature}: {importance:.3f}")

7. Random Forest

Random Forest is an ensemble method that combines multiple decision trees to create a more robust and accurate model.

Key Concepts:

  • Ensemble Learning: Combining multiple models
  • Bootstrap Aggregating (Bagging): Training on random subsets
  • Random Feature Selection: Using random features at each split
  • Voting: Final prediction from majority vote

Random Forest Example

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

# Generate data
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_classes=3,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Random Forest
rf = RandomForestClassifier(
    n_estimators=100,  # Number of trees
    max_depth=10,
    random_state=42
)
rf.fit(X_train, y_train)

# Predictions
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2%}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Feature importance
importances = rf.feature_importances_
print(f"\nTop 5 Most Important Features:")
indices = importances.argsort()[-5:][::-1]
for i in indices:
    print(f"Feature {i}: {importances[i]:.3f}")

8. Support Vector Machines (SVM)

SVM finds the optimal hyperplane that separates classes with maximum margin.

Key Concepts:

  • Support Vectors: Data points closest to hyperplane
  • Margin: Distance between hyperplane and nearest points
  • Kernel Trick: Transforming data to higher dimensions
  • Kernels: Linear, Polynomial, RBF (Radial Basis Function)

SVM with Different Kernels

from sklearn.svm import SVC
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

# Generate non-linearly separable data
X, y = make_circles(n_samples=500, noise=0.1, factor=0.2, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# SVM with RBF kernel (handles non-linear data)
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale')
svm_rbf.fit(X_train, y_train)

# SVM with polynomial kernel
svm_poly = SVC(kernel='poly', degree=3, C=1.0)
svm_poly.fit(X_train, y_train)

# Evaluate
rbf_acc = accuracy_score(y_test, svm_rbf.predict(X_test))
poly_acc = accuracy_score(y_test, svm_poly.predict(X_test))

print(f"RBF Kernel Accuracy: {rbf_acc:.2%}")
print(f"Polynomial Kernel Accuracy: {poly_acc:.2%}")

# Show support vectors
print(f"\nNumber of support vectors (RBF): {len(svm_rbf.support_vectors_)}")

9. K-Means Clustering

K-Means is an unsupervised clustering algorithm that partitions data into k clusters based on similarity.

Algorithm Steps:

  1. Initialize k centroids randomly
  2. Assign each data point to nearest centroid
  3. Update centroids to mean of assigned points
  4. Repeat until convergence

K-Means Clustering Example

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data
X, y_true = make_blobs(
    n_samples=300,
    centers=4,
    n_features=2,
    random_state=42
)

# Apply K-Means
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)

# Visualize
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis')
plt.title('True Clusters')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=y_pred, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], 
            kmeans.cluster_centers_[:, 1],
            c='red', marker='x', s=200, linewidths=3)
plt.title('K-Means Clusters')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.tight_layout()
plt.show()

# Find optimal k using Elbow Method
inertias = []
K_range = range(1, 10)
for k in K_range:
    kmeans_temp = KMeans(n_clusters=k, random_state=42)
    kmeans_temp.fit(X)
    inertias.append(kmeans_temp.inertia_)

plt.figure(figsize=(8, 5))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()

10. Neural Networks Basics

Neural Networks are computing systems inspired by biological neural networks. They consist of interconnected nodes (neurons) organized in layers.

Architecture:

  • Input Layer: Receives features
  • Hidden Layers: Process information
  • Output Layer: Produces predictions
  • Weights & Biases: Learnable parameters
  • Activation Functions: ReLU, Sigmoid, Tanh

Simple Neural Network

from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Generate data
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_classes=3,
    random_state=42
)

# Preprocess
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

# Create Neural Network
mlp = MLPClassifier(
    hidden_layer_sizes=(100, 50),  # Two hidden layers
    activation='relu',
    solver='adam',
    alpha=0.01,  # L2 penalty
    learning_rate='adaptive',
    max_iter=500,
    random_state=42
)

# Train
mlp.fit(X_train, y_train)

# Predict
y_pred = mlp.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2%}")
print(f"Number of layers: {mlp.n_layers_}")
print(f"Number of iterations: {mlp.n_iter_}")
print(f"Loss: {mlp.loss_:.4f}")

Advanced Level

Master cutting-edge techniques and deep learning architectures

11. Deep Learning Fundamentals

Deep Learning uses neural networks with multiple hidden layers to learn complex patterns in data.

Key Concepts:

  • Backpropagation: Learning algorithm for training
  • Gradient Descent: Optimization method
  • Batch Normalization: Stabilizing training
  • Dropout: Preventing overfitting

Deep Neural Network with TensorFlow/Keras

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

# Load and preprocess data
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Normalize pixel values
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

# Flatten images
x_train = x_train.reshape((60000, 28 * 28))
x_test = x_test.reshape((10000, 28 * 28))

# Build model
model = keras.Sequential([
    layers.Dense(512, activation='relu', input_shape=(28 * 28,)),
    layers.Dropout(0.2),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(10, activation='softmax')
])

# Compile
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Train
history = model.fit(
    x_train, y_train,
    epochs=5,
    batch_size=128,
    validation_split=0.2
)

# Evaluate
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"Test accuracy: {test_acc:.2%}")

12. Convolutional Neural Networks (CNN)

CNNs are specialized neural networks for processing grid-like data such as images. They use convolutional layers to detect patterns.

Architecture Components:

  • Convolutional Layers: Feature detection
  • Pooling Layers: Dimensionality reduction
  • Fully Connected Layers: Classification
  • Filters/Kernels: Learnable feature detectors

CNN for Image Classification

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Load CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()

# Normalize
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0

# Build CNN
model = keras.Sequential([
    # Convolutional base
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    
    # Dense classifier
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(10, activation='softmax')
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Train
model.fit(
    x_train, y_train,
    epochs=10,
    batch_size=64,
    validation_data=(x_test, y_test)
)

# Evaluate
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"Test accuracy: {test_acc:.2%}")

13. Recurrent Neural Networks (RNN & LSTM)

RNNs are designed for sequence data. LSTMs (Long Short-Term Memory) are a special type that can learn long-term dependencies.

Applications:

  • Natural Language Processing
  • Time Series Prediction
  • Speech Recognition
  • Machine Translation

LSTM for Time Series Prediction

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Generate time series data
def generate_sequence(length):
    time = np.arange(0, length, 0.1)
    data = np.sin(time) + np.random.normal(0, 0.1, length * 10)
    return data

# Prepare sequences
def create_sequences(data, seq_length):
    X, y = [], []
    for i in range(len(data) - seq_length):
        X.append(data[i:i+seq_length])
        y.append(data[i+seq_length])
    return np.array(X), np.array(y)

# Generate data
data = generate_sequence(10)
seq_length = 20
X, y = create_sequences(data, seq_length)

# Reshape for LSTM: [samples, time_steps, features]
X = X.reshape((X.shape[0], X.shape[1], 1))

# Split
split = int(len(X) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Build LSTM model
model = keras.Sequential([
    layers.LSTM(50, activation='relu', return_sequences=True, input_shape=(seq_length, 1)),
    layers.LSTM(50, activation='relu'),
    layers.Dense(1)
])

model.compile(optimizer='adam', loss='mse')

# Train
model.fit(X_train, y_train, epochs=20, batch_size=32, verbose=0)

# Evaluate
train_loss = model.evaluate(X_train, y_train, verbose=0)
test_loss = model.evaluate(X_test, y_test, verbose=0)
print(f"Train Loss: {train_loss:.4f}")
print(f"Test Loss: {test_loss:.4f}")

14. Natural Language Processing (NLP)

NLP enables machines to understand, interpret, and generate human language.

Key Techniques:

  • Tokenization: Breaking text into words/tokens
  • Word Embeddings: Word2Vec, GloVe, BERT
  • TF-IDF: Term frequency-inverse document frequency
  • Sentiment Analysis: Classifying emotions
  • Text Classification: Categorizing documents

Text Classification with NLP

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Sample text data
texts = [
    "I love this product! It's amazing.",
    "Terrible service, very disappointed.",
    "Great quality, highly recommend!",
    "Poor quality, waste of money.",
    "Excellent customer service and fast delivery.",
    "Not worth the price, cheaply made."
]
labels = ['positive', 'negative', 'positive', 'negative', 'positive', 'negative']

# Convert labels to numbers
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(labels)

# Vectorize text
vectorizer = TfidfVectorizer(max_features=100, stop_words='english')
X = vectorizer.fit_transform(texts)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2%}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_))

# Predict new text
new_text = ["This is a fantastic product!"]
new_vector = vectorizer.transform(new_text)
prediction = model.predict(new_vector)
print(f"\nPrediction: {le.inverse_transform(prediction)[0]}")

15. Model Evaluation & Metrics

Proper evaluation is crucial for understanding model performance and making improvements.

Classification Metrics:

  • Accuracy: Overall correctness
  • Precision: True positives / (True positives + False positives)
  • Recall: True positives / (True positives + False negatives)
  • F1-Score: Harmonic mean of precision and recall
  • ROC-AUC: Area under ROC curve

Regression Metrics:

  • MAE: Mean Absolute Error
  • MSE: Mean Squared Error
  • RMSE: Root Mean Squared Error
  • R² Score: Coefficient of determination

Comprehensive Model Evaluation

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve
)
import matplotlib.pyplot as plt

# Generate data
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_classes=2,
    random_state=42
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_proba)

print("Model Evaluation Metrics:")
print(f"Accuracy:  {accuracy:.2%}")
print(f"Precision: {precision:.2%}")
print(f"Recall:    {recall:.2%}")
print(f"F1-Score:  {f1:.2%}")
print(f"ROC-AUC:   {roc_auc:.2%}")

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"\nCross-Validation Accuracy: {cv_scores.mean():.2%} (+/- {cv_scores.std() * 2:.2%})")

16. Hyperparameter Tuning

Hyperparameter tuning optimizes model performance by finding the best configuration.

Methods:

  • Grid Search: Exhaustive search over parameter grid
  • Random Search: Random sampling of parameters
  • Bayesian Optimization: Efficient search using probability
  • Early Stopping: Stopping training when validation loss stops improving

Hyperparameter Tuning Example

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate data
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_classes=2,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Base model
rf = RandomForestClassifier(random_state=42)

# Grid Search
print("Performing Grid Search...")
grid_search = GridSearchCV(
    rf, param_grid, cv=5,
    scoring='accuracy', n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.2%}")

# Test on test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test, y_test)
print(f"Test accuracy: {test_accuracy:.2%}")

# Random Search (faster for large parameter spaces)
print("\nPerforming Random Search...")
random_search = RandomizedSearchCV(
    rf, param_grid, n_iter=20, cv=5,
    scoring='accuracy', n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_:.2%}")