Kaggle’s Brain Cancer MRI Dataset Classification Solution

Listen to this article

The provided Python script offers a step-by-step solution to the Kaggle “Brain Cancer – MRI Dataset” challenge, which involves classifying brain MRI images into two categories: those with brain tumors (“yes”) and those without (“no”). This explanation breaks down the code into its key components, detailing the methodology, rationale, and implementation choices.

Overview

The solution leverages a Convolutional Neural Network (CNN) implemented using TensorFlow/Keras to perform binary image classification. The dataset, sourced from Kaggle, contains MRI images, and the task is to train a model to accurately predict the presence of a tumor. The process includes data preparation, model training, evaluation, and visualization of results.

Step-by-Step Explanation

Step 1: Set Up Environment and Load Data

  • Purpose: Initializes the environment and downloads the dataset from Kaggle.
  • Implementation: The script uses Google Colab to upload a kaggle.json file for authentication, creating a Kaggle API setup. The dataset “brain-mri-images-for-brain-tumor-detection” is downloaded and unzipped into a local directory (/content/brain_mri).
  • Rationale: Colab provides free GPU access, essential for training deep learning models. The Kaggle API ensures access to the dataset, which contains approximately 253 images (split between tumor and non-tumor cases).

Step 2: Prepare Data

  • Purpose: Loads and preprocesses the MRI images into a format suitable for ML.
  • Implementation:
    • Iterates over the yes and no folders, reading images with cv2.imread.
    • Resizes all images to 128×128 pixels for uniformity using cv2.resize.
    • Stores images in X (features) and labels in y (1 for tumor, 0 for no tumor), converting to NumPy arrays.
  • Rationale: Image resizing ensures consistent input dimensions for the CNN. The binary labeling aligns with the classification task, and NumPy arrays optimize memory usage for tensor operations.

Step 3: Split and Augment Data

  • Purpose: Divides data into training, validation, and test sets, and augments training data to prevent overfitting.
  • Implementation:
    • Splits data using train_test_split (80% train, 20% test, with a further 20% of train as validation).
    • Applies ImageDataGenerator with transformations (rotation, shifts, flips) to artificially expand the training set.
  • Rationale: The small dataset size (253 images) risks overfitting. Data augmentation increases diversity, while the split ensures unbiased evaluation. Validation helps tune the model during training.

Step 4: Build and Train CNN Model

  • Purpose: Constructs and trains a CNN to learn tumor features from MRI images.
  • Implementation:
    • Defines a Sequential model with:
      • Three Conv2D layers (32, 64, 128 filters) for feature extraction.
      • Corresponding MaxPooling2D layers to reduce spatial dimensions.
      • Flatten to transition to dense layers, followed by a Dense layer (128 units), Dropout (0.5) for regularization, and a final Dense layer (1 unit) with sigmoid activation for binary output.
    • Compiles with adam optimizer and binary_crossentropy loss, training for 20 epochs with a batch size of 32.
  • Rationale: CNNs excel at image classification by learning spatial hierarchies. The architecture balances depth and complexity for the dataset size, while Dropout prevents overfitting. Sigmoid activation suits binary classification.

Step 5: Evaluate the Model

  • Purpose: Assesses the model’s performance on unseen test data.
  • Implementation: Uses model.evaluate to compute test loss and accuracy, printing the accuracy.
  • Rationale: Evaluation on the test set provides an unbiased measure of generalization. Accuracy is a suitable metric for balanced binary classification, though additional metrics (e.g., precision, recall) could enhance analysis.

Step 6: Visualize Training Results

  • Purpose: Provides insights into model training progress.
  • Implementation: Plots training and validation accuracy/loss over epochs using matplotlib, with subplots for clarity.
  • Rationale: Visualizations help identify overfitting (e.g., if training accuracy exceeds validation) or underfitting (e.g., low accuracy). This aids in adjusting epochs or model complexity.

Step 7: Make Predictions

  • Purpose: Demonstrates the model’s practical application.
  • Implementation: Uses model.predict to generate probabilities, converting to binary classes (> 0.5 threshold), and prints the first five actual vs. predicted labels.
  • Rationale: This step validates the model’s output and allows manual inspection of errors, guiding further improvements.

Technical Considerations

  • Hardware: The script assumes GPU support (via Colab), critical for CNN training. Without it, training time would increase significantly.
  • Limitations: The small dataset (253 images) limits model robustness. Augmentation mitigates this, but more data or transfer learning (e.g., pre-trained models like ResNet) could enhance performance.
  • Extensions: Adding data from other sources, implementing cross-validation, or using advanced architectures (e.g., Inception) could improve results.

This solution provides a foundational approach to classifying brain cancer MRI images using a CNN. The process—from data loading to prediction—demonstrates a practical ML workflow. With a test accuracy typically ranging from 85-95% (depending on initialization), the model is a starting point. Future enhancements could involve fine-tuning hyperparameters, expanding the dataset, or integrating clinical validation to transition from research to real-world application.

Solution:

import numpy as np
import pandas as pd
import os
import cv2
from tqdm import tqdm
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

Step 1: Set up environment and load data from Kaggle

from google.colab import files
files.upload() # Upload kaggle.json
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d navoneel/brain-mri-images-for-brain-tumor-detection
!unzip brain-mri-images-for-brain-tumor-detection.zip -d /content/brain_mri

Step 2: Prepare data

data_dir = ‘/content/brain_mri/brain_tumor_dataset’
X = []
y = []
for folder in [‘yes’, ‘no’]:
folder_path = os.path.join(data_dir, folder)
for img_name in tqdm(os.listdir(folder_path)):
img_path = os.path.join(folder_path, img_name)
img = cv2.imread(img_path)
img = cv2.resize(img, (128, 128)) # Resize to 128×128
X.append(img)
y.append(1 if folder == ‘yes’ else 0)

X = np.array(X)
y = np.array(y)

Step 3: Split and augment data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

Data augmentation

datagen = ImageDataGenerator(
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True,
fill_mode=’nearest’
)
datagen.fit(X_train)

Step 4: Build and train CNN model

model = Sequential([
Conv2D(32, (3, 3), activation=’relu’, input_shape=(128, 128, 3)),
MaxPooling2D(pool_size=(2, 2)),
Conv2D(64, (3, 3), activation=’relu’),
MaxPooling2D(pool_size=(2, 2)),
Conv2D(128, (3, 3), activation=’relu’),
MaxPooling2D(pool_size=(2, 2)),
Flatten(),
Dense(128, activation=’relu’),
Dropout(0.5),
Dense(1, activation=’sigmoid’)
])

model.compile(optimizer=’adam’, loss=’binary_crossentropy’, metrics=[‘accuracy’])
history = model.fit(datagen.flow(X_train, y_train, batch_size=32),
validation_data=(X_val, y_val),
epochs=20,
verbose=1)

Step 5: Evaluate the model

test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f”Test accuracy: {test_accuracy:.4f}”)

Step 6: Visualize training results

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history[‘accuracy’], label=’Training Accuracy’)
plt.plot(history.history[‘val_accuracy’], label=’Validation Accuracy’)
plt.title(‘Model Accuracy’)
plt.xlabel(‘Epoch’)
plt.ylabel(‘Accuracy’)
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history.history[‘loss’], label=’Training Loss’)
plt.plot(history.history[‘val_loss’], label=’Validation Loss’)
plt.title(‘Model Loss’)
plt.xlabel(‘Epoch’)
plt.ylabel(‘Loss’)
plt.legend()
plt.show()

Step 7: Make predictions

predictions = model.predict(X_test)
predicted_classes = (predictions > 0.5).astype(int)
for i in range(5):
print(f”Actual: {y_test[i]}, Predicted: {predicted_classes[i][0]}”)

Leave a Reply

Your email address will not be published. Required fields are marked *