Intro to Deep Learning — Exam Study Guide

R2 QUIZConceptual

"What does empirical risk mean?" (trick: avg loss on TRAINING data, not dev)

Week 4 Thu

▾

Average loss on the TRAINING data. Not the dev set. Empirical risk = (1/N) Σ L(f(xᵢ), yᵢ) over training samples. The "empirical" part means we use the actual observed data, not the true distribution.

Learn more

R2 QUIZConceptual

"More features = better performance?" (No — curse of dimensionality)

Week 4 Fri

▾

No. The curse of dimensionality: more features means the data becomes sparser in high-dimensional space. Need exponentially more data to maintain the same density. Can lead to overfitting. Feature selection/reduction (PCA, etc.) can help.

Learn more

R2 QUIZConceptual

"What are the stop-criteria for k-means?"

Week 4 Thu

▾

K-means stops when: (1) No data points change cluster assignment, (2) Centroids don't move significantly, (3) Maximum number of iterations reached, or (4) Average distance to centroids falls below a threshold.

Learn more

✏️ Exercise 4 ✏️ Solution 4 📝 Notes: ML Basics

R3 EXERCISEConceptual

Feature engineering — what is it and why does it matter?

Exercise Sheet 4, Q4.2

▾

Feature engineering is the process of creating/selecting/transforming input features to improve model performance. Includes normalization, encoding categorical variables, creating interaction terms. Good features can make simple models perform well; bad features make even complex models fail.

Learn more

R3 EXERCISEConceptual

Training curves interpretation — what do different curve shapes indicate?

Exercise Sheet 4, Q4.5

▾

Training curves plot loss/accuracy vs. epochs. Key patterns: (1) Both curves converging = good fit, (2) Training low but val high = overfitting, (3) Both high = underfitting, (4) Gap between curves = generalization gap. Use to decide: more data, more complexity, or regularization.

Learn more

✏️ Exercise 4 ✏️ Solution 4 📊 Slides: ML Basics 2

R4 MEMORYConceptual

Define kernel, support vector, and maximal margin (SVM)

Memory Protocol 2021

▾

Kernel: A function that computes dot products in a higher-dimensional space without explicitly mapping to it (kernel trick). Support vectors: The data points closest to the decision boundary — they "support" it. Maximal margin: The largest possible gap between the decision boundary and the nearest data points of each class.

Learn more

📊 Slides: ML Basics 2 📝 Notes: ML Basics 📋 Memory Protocol 2021

R4 MEMORYConceptual

SVM for multiclass: how?

Memory Protocol 2022

▾

SVMs are natively binary classifiers. For multiclass: (1) One-vs-All (OvA): Train K classifiers, each separating one class from all others. (2) One-vs-One (OvO): Train K(K-1)/2 classifiers for each pair. Use voting for final decision.

Learn more

📊 Slides: ML Basics 2 📝 Notes: ML Basics 📋 Memory Protocol 2022

R4 MEMORYConceptual

Decision trees more transparent than NNs: why?

Memory Protocol 2022

▾

Decision trees produce human-readable if/then rules. You can trace exactly why a prediction was made by following the path from root to leaf. Neural networks are "black boxes" — millions of parameters interact in non-linear ways, making it nearly impossible to explain individual predictions.

Learn more

📊 Slides: ML Basics 2 📝 Notes: ML Basics 📋 Memory Protocol 2022

R4 MEMORYConceptual

Which NN model for sentence classification? Why different random initializations?

exam_example.pdf 1c

▾

Feedforward NN with bag-of-words input, or CNN/RNN for sequence. Different random initializations lead to different local minima during training, so models may learn different features. Common practice: train multiple and ensemble, or pick best on validation.

Learn more

📋 Older Mock Exam 📊 Slides: ML Basics 1 📊 Slides: Neural Nets 1

Calculation

5 questions

R1 MOCK4 ptsCalculation

1b. Given a linear regression model \(f(\mathbf{x}) = \mathbf{w}^T \mathbf{x}\) with \(\mathbf{w} = [1, 0, 1, 0]^T\), compute the MSE for these 3 data points: \(\mathbf{x}_1=[1,0,0,0]^T, y_1=2\); \(\mathbf{x}_2=[0,1,0,1]^T, y_2=1\); \(\mathbf{x}_3=[0,0,1,1]^T, y_3=3\).

Mock Exam 1b

▾

Verbatim from mock exam.

Step 1 — predictions: \(f(x_1)=1\cdot1+0+1\cdot0+0=1\), \(f(x_2)=0+0+0+0=0\), \(f(x_3)=0+0+1+0=1\)

Step 2 — squared errors: \((2-1)^2=1\), \((1-0)^2=1\), \((3-1)^2=4\)

Step 3 — MSE: \(\frac{1}{3}(1+1+4) = \frac{6}{3} = 2\)

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: ML Basics 1

R3 EXERCISECalculation

K-means clustering computation — assign points to clusters, update centroids, iterate.

Exercise Sheet 4, Q4.1

▾

K-means algorithm: (1) Initialize K centroids randomly, (2) Assign each point to nearest centroid (Euclidean distance), (3) Recompute centroids as mean of assigned points, (4) Repeat until convergence. Practice the distance calculations and centroid updates by hand.

Learn more

✏️ Exercise 4 ✏️ Solution 4 📊 Slides: ML Basics 1

R3 EXERCISECalculation

Logistic regression computation — apply sigmoid to linear output.

Exercise Sheet 4, Q4.3

▾

Logistic regression: \(\hat{y} = \sigma(\mathbf{w}^T\mathbf{x} + b)\) where \(\sigma(z) = \frac{1}{1+e^{-z}}\). Output is a probability ∈ (0,1). Decision boundary at 0.5. Practice computing the sigmoid for specific values.

Learn more

✏️ Exercise 4 ✏️ Solution 4 📊 Slides: ML Basics 1

R3 EXERCISECalculation

Linear regression / MSE calculation from exercise sheet.

Exercise Sheet 4, Q4.4

▾

Same MSE formula as mock exam: \(\text{MSE} = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2\). Practice computing predictions from weight vectors, then computing the loss.

Learn more

✏️ Exercise 4 ✏️ Solution 4 📊 Slides: ML Basics 1

R4 MEMORYCalculation

Logistic regression output calculation — compute ŷ given weights and input.

exam_example.pdf 1b

▾

Given weights w, bias b, and input x: compute z = wᵀx + b, then ŷ = σ(z) = 1/(1+e^(-z)). Remember: σ(0) = 0.5, σ is monotonic, σ(-z) = 1 - σ(z).

Learn more

📋 Older Mock Exam 📊 Slides: ML Basics 1

True/False

14 questions

R1 MOCK5 ptsTrue/False

1c. True or False? (5 statements, 1 pt each)

Mock Exam 1c

▾

Statement	T	F
The goal of machine learning is to overfit the training data.		✓
Regularization is used to prevent the model from overfitting.	✓
Support Vector Machines (SVM) can only be used for binary classification.		✓
K-means clustering is a partitional clustering algorithm.	✓
Hyperparameters should be tuned on the test set.		✓

Verbatim from mock exam. Key: goal is generalization not overfitting; SVMs can do multiclass via OvA/OvO; hyperparams tuned on validation set, NOT test set.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: ML Basics 1 📊 Slides: ML Basics 2

R2 QUIZTrue/False

T/F: "Hyperparameters and parameters are model-dependent"

Week 4 Thu

▾

TRUE — but nuanced. Parameters (weights) are learned during training and are model-dependent. Hyperparameters (learning rate, etc.) are set before training. Most hyperparameters ARE model-dependent (e.g., number of layers), but learning rate is a general hyperparameter not tied to a specific model architecture.

Learn more

R2 QUIZTrue/False

T/F: "In k-means, k is a trainable parameter"

Week 4 Thu

▾

FALSE. k is a hyperparameter — it's set before training and not learned during the algorithm. The trainable "parameters" are the centroid positions.

Learn more

R2 QUIZTrue/False

T/F: "The more training data, the better the SVM performs"

Week 4 Fri

▾

FALSE. More data doesn't always help. Noisy data can hurt. Also, SVM performance depends on finding good support vectors — beyond a point, additional data points far from the decision boundary don't change the model.

Learn more

📊 Slides: ML Basics 2 📝 Notes: ML Basics 📝 Review Transcript

R2 QUIZTrue/False

T/F: "The goal of preventing overfitting is to better generalize on unseen data"

Week 13 Thu transcript

▾

TRUE. The entire point of regularization, early stopping, dropout, etc. is to ensure the model performs well on data it hasn't seen during training.

Learn more

R2 QUIZTrue/False

T/F: "SVM can only be used for binary classification"

Week 13 Thu transcript

▾

FALSE. SVMs can be extended to multiclass via One-vs-All or One-vs-One strategies. But natively, a single SVM is binary.

Learn more

📊 Slides: ML Basics 2 📝 Review Transcript

R2 QUIZTrue/False

T/F: "Is KNN parametric where k is a parameter?"

Week 6 Thu

▾

FALSE. KNN is non-parametric — it doesn't learn fixed parameters. k is a hyperparameter (set before "training"). KNN stores all training data and computes distances at prediction time.

Learn more

R2 QUIZTrue/False

T/F: "Can you stop k-means if avg max distance to centroids < fixed value?"

Week 6 Thu

▾

TRUE. This is a valid convergence criterion. When no point is farther than a threshold from its centroid, the clustering is stable enough.

Learn more

R2 QUIZTrue/False

T/F: "Does SVM decision boundary depend only on data near boundary?"

Week 6 Thu

▾

TRUE. The decision boundary depends only on the support vectors — the points closest to it. All other points could be removed without changing the boundary.

Learn more

📊 Slides: Calculus 📊 Slides: Prob & Opt 📝 Notes: ML Basics

R2 QUIZTrue/False

T/F: "Can gradient descent find global minimum?"

Week 6 Thu

▾

FALSE (in general). Gradient descent can get stuck in local minima for non-convex loss functions (like neural networks). For convex functions (like linear regression), it does find the global minimum.

Learn more

R2 QUIZTrue/False

T/F: "Is LDA a supervised method learning linear transformation?"

Week 6 Thu

▾

TRUE. Linear Discriminant Analysis is supervised — it uses class labels to find a linear transformation that maxim

R1 MOCK EXAMTrue/False10 pts

[2024 Mock 1a] True or False? (10 statements, MINUS points for wrong answers, min 0)

Source: 2024 Mock Exam 1a

Statement	T	F
K-nearest neighbors algorithms is a parametric method.		✓
Hyperparameters are trained to optimize the results on the training set.		✓
Generalisation is a central problem of machine learning.	✓
An overfitted model perfectly matches the evaluation data.		✓
Regularization typically increases the error on the training set.	✓
Logistic regression is typically used to predict real values.		✓
Empirical risk is an average of a loss function on a finite development set.		✓
Support vector machine can be used only for binary classification tasks.		✓
Each data item is assigned to exactly one cluster with k-means clustering.	✓
In active learning, systems actively select queries and request feedback from human.	✓

From 2024 Mock Exam. Key traps:

• Overfitted model ≠ matches evaluation data. Overfitting means matching TRAINING data too well, performing POORLY on evaluation/test data.

• Empirical risk uses TRAINING set, not development set — this is the same trick from the in-class quiz.

• 10 statements with MINUS scoring — our existing mock had only 5. Budget more time for this.

Learn more

izes class separation.

Learn more

✏️ Exercise 4 ✏️ Solution 4

R3 EXERCISETrue/False

ML fundamentals T/F block from exercise sheet.

Exercise Sheet 4, Q4.6

▾

Practice T/F statements from exercise sheets. These cover similar ground to the mock exam: overfitting, regularization, model selection, bias-variance tradeoff.

Learn more

R4 MEMORYTrue/False

T/F: k-means, CNN, RNN vanishing gradients, parameter tuning, batch training.

exam_example.pdf 1d

▾

Mixed-topic T/F from older exam. Test yourself on each claim individually. Remember: brief reasoning keywords can earn partial credit even if your T/F answer is wrong.

Learn more

📋 Older Mock Exam 📊 Slides: ML Basics 1 📊 Slides: ML Basics 2

Conceptual

7 questions

R2 QUIZConceptual

"Is a single neuron interpretable?"

Week 6 Thu

▾

Yes, somewhat. A single neuron computes a weighted sum + bias + activation — you can look at the weights to see which inputs matter most. But as networks get deeper, individual neurons become less interpretable because they represent increasingly abstract features.

Learn more

📊 Slides: Neural Nets 1 📝 Notes: Neural Networks

R3 EXERCISEConceptual

Loss function selection — when to use which loss?

Exercise Sheet 6, Q6.4

▾

MSE: regression tasks. Cross-entropy (CE): classification tasks. Binary CE: two-class problems. Categorical CE: multi-class (with softmax). CE is preferred for classification because it penalizes confident wrong predictions more heavily and has nicer gradients with softmax.

Learn more

✏️ Exercise 6 ✏️ Solution 6 📊 Slides: Neural Nets 2

R3 EXERCISEConceptual

Activation functions — compare sigmoid, tanh, ReLU.

Exercise Sheet 6, Q6.5

▾

Sigmoid: σ(z) = 1/(1+e^(-z)), output (0,1), vanishing gradient for large |z|. Tanh: output (-1,1), zero-centered, still vanishing gradient. ReLU: max(0,z), no vanishing gradient for z>0, but "dying ReLU" for z<0. ReLU is most common in hidden layers; sigmoid/softmax for output layers.

Learn more

✏️ Exercise 6 ✏️ Solution 6 📊 Slides: Neural Nets 1

R4 MEMORYConceptual

Explain how backward pass works (in words).

Memory Protocol 2021

▾

The backward pass computes gradients of the loss w.r.t. each parameter using the chain rule. Starting from the output layer: (1) Compute error signal δ at output, (2) Propagate δ backward through each layer, (3) At each layer, compute ∂L/∂W = δ · aᵀ (activation from previous layer), (4) Update δ for next layer back using weights and activation derivative.

Learn more

📊 Slides: Neural Nets 3 📋 Backward Pass Example 📋 Memory Protocol 2021

R4 MEMORYConceptual

Why non-linear activation functions?

Memory Protocol 2021 + 2022

▾

Without non-linear activations, stacking multiple linear layers collapses to a single linear transformation: W₂(W₁x + b₁) + b₂ = W'x + b'. The network couldn't learn any non-linear patterns regardless of depth. Non-linear activations let the network approximate arbitrary functions.

Learn more

📊 Slides: Neural Nets 1 📋 Memory Protocol 2021 📋 Memory Protocol 2022

R4 MEMORYConceptual

Why deep networks (multiple layers)?

Memory Protocol 2021

▾

Deeper networks can represent increasingly abstract features hierarchically. Early layers learn simple patterns (edges), later layers combine them into complex concepts (faces). A single wide layer would need exponentially more neurons to represent the same functions. Depth = compositional power.

Learn more

📊 Slides: Neural Nets 1 📋 Memory Protocol 2021

R4 MEMORYConceptual

Name 3 hyperparameters of a feedforward neural network.

Memory Protocol 2021

▾

(1) Number of layers (depth), (2) Number of neurons per layer (width), (3) Learning rate. Others: activation function, batch size, number of epochs, optimizer choice, regularization strength.

Learn more

📊 Slides: Neural Nets 1 📋 Memory Protocol 2021

Calculation

11 questions

R1 MOCK4 ptsCalculation

2a. Given sentence "this exam is fair" with vocabulary V={exam, fair, is, this}, compute bag-of-words → forward pass → cross-entropy loss.

Network: input (4) → hidden (3, ReLU) → output (2, softmax). True label: positive (class 1).

\(W^1 = \begin{bmatrix} 0 & 0.5 & 0 & 0 \\ 0 & 0 & 0.5 & 0 \\ 0.5 & 0 & 0 & 0 \end{bmatrix}\), \(W^2 = \begin{bmatrix} 0.5 & 0 & 0 \\ 0 & 0 & 0.5 \end{bmatrix}\), \(b^2 = \begin{bmatrix} 0 \\ 0.5 \end{bmatrix}\)

Mock Exam 2a

▾

Verbatim from mock exam.

Step 1 — BoW: x = [1,1,1,1]ᵀ (each word appears once)

Step 2 — Hidden: h = ReLU(W¹x) = ReLU([0.5, 0.5, 0.5]ᵀ) = [0.5, 0.5, 0.5]ᵀ

Step 3 — Output: z = W²h + b² = [0.25, 0.75]ᵀ

Step 4 — Softmax: \(\hat{y} = [\frac{e^{0.25}}{e^{0.25}+e^{0.75}}, \frac{e^{0.75}}{e^{0.25}+e^{0.75}}]\)

Step 5 — CE Loss: \(L = -\log(\hat{y}_{\text{class 1}})\) where class 1 = positive

Learn more

📋 Mock Exam ✏️ Mock Solutions 📋 Forward Pass Example 📊 Slides: Neural Nets 2

R1 MOCK5 ptsCalculation

2b. Backward pass: Given ŷ = [0.88, 0.12] and true label = class 2 (y = [0, 1]), compute δ² and gradients ∇W²_CE, ∇b²_CE.

Mock Exam 2b

▾

Verbatim from mock exam.

Step 1 — Error signal: δ² = ŷ - y = [0.88, 0.12] - [0, 1] = [0.88, -0.88]

Step 2 — Weight gradient: \(\nabla_{W^2} = \delta^2 \cdot (a^1)^T\) where a¹ is the hidden layer output from forward pass

Step 3 — Bias gradient: \(\nabla_{b^2} = \delta^2 = [0.88, -0.88]\)

Key insight: For softmax + CE, the error signal simplifies to ŷ - y (prediction minus target).

Learn more

📋 Mock Exam ✏️ Mock Solutions 📋 Backward Pass Example 📊 Slides: Neural Nets 3

R2 QUIZCalculation

Compute output y of 2-layer network: x=[1,0]ᵀ, given W¹, b¹, W², b², with ReLU→softmax.

Week 6 Thu

▾

Same process as mock 2a but with different dimensions. (1) Compute z¹ = W¹x + b¹, (2) Apply ReLU: a¹ = max(0, z¹), (3) Compute z² = W²a¹ + b², (4) Apply softmax: ŷ = softmax(z²). Practice this until automatic.

Learn more

📋 Forward Pass Example 📊 Slides: Neural Nets 2 📝 Notes: Neural Networks

R3 EXERCISECalculation

Parameter counting — how many weights and biases in each layer?

Exercise Sheet 6, Q6.1

▾

For a layer mapping from n inputs to m outputs: Weights: m × n parameters, Biases: m parameters. Total per layer: m(n+1). For the whole network, sum across all layers. Example: 4→3→2 network = 3×4 + 3 + 2×3 + 2 = 12+3+6+2 = 23 parameters.

Learn more

✏️ Exercise 6 ✏️ Solution 6 📊 Slides: Neural Nets 1

R3 EXERCISECalculation

Forward pass computation from exercise sheet.

Exercise Sheet 6, Q6.2

▾

Same procedure as mock exam. Practice with different weight matrices and activation functions. Key steps: (1) Linear: z = Wx + b, (2) Activation: a = f(z), (3) Repeat for each layer, (4) Final output with appropriate activation (softmax for classification).

Learn more

✏️ Exercise 6 ✏️ Solution 6 📋 Forward Pass Example

R3 EXERCISECalculation

Backpropagation computation from exercise sheet.

Exercise Sheet 6, Q6.3

▾

Full backprop: (1) Compute δ at output layer (depends on loss function), (2) For each layer going backward: ∂L/∂W = δ · aᵀ (previous activation), ∂L/∂b = δ, then propagate: δ_prev = (Wᵀδ) ⊙ f'(z). The ⊙ is element-wise multiplication with the activation derivative.

Learn more

✏️ Exercise 6 ✏️ Solution 6 📋 Backward Pass Example 📊 Slides: Neural Nets 3

R4 MEMORYCalculation

One-hot encode sentence → forward pass with f=x² → CE loss.

DL_mock_exam.pdf Ex2 + Gedächtnisprotokoll

▾

Variant of mock exam 2a but with f(x) = x² as activation instead of ReLU. Same steps: encode input → matrix multiply → apply activation → matrix multiply → softmax → CE loss. Be careful: x² activation means f'(x) = 2x for the backward pass.

Learn more

📋 DL Mock Exam 📊 Slides: Neural Nets 2 📋 Forward Pass Example

R4 MEMORYCalculation

Fill in backpropagation formulas from slide.

Gedächtnisprotokoll

▾

The key formulas: (1) Output error: δᴸ = ŷ - y (for softmax+CE), (2) Hidden error: δˡ = (Wˡ⁺¹)ᵀδˡ⁺¹ ⊙ f'(zˡ), (3) Weight gradient: ∂L/∂Wˡ = δˡ(aˡ⁻¹)ᵀ, (4) Bias gradient: ∂L/∂bˡ = δˡ. Make sure you can fill these in from memory.

Learn more

📊 Slides: Neural Nets 3 📋 Backward Pass Example

R1 MOCK EXAMCalculation5 pts

[2024 Mock 2a] Tweet sentiment classification: Given a 2-hidden-layer FFNN with activation \(f = x^2\) (element-wise), softmax output, no biases. Weights: \(W^1 \in \mathbb{R}^{2 \times 4}\), \(W^2 \in \mathbb{R}^{2 \times 2}\), \(W^3 \in \mathbb{R}^{3 \times 2}\). Compute output \(y\) for tweet 'this exam is fair' using bag-of-words with V = {exam, good, bad, fair}. Then compute cross-entropy loss (using \(\log_{10}\)) given correct label \(\hat{y} = (1,0,0)^T\).

Source: 2024 Mock Exam 2a

From 2024 Mock Exam. Key differences from our existing mock:

• Same f=x² activation trick, but different weight matrices:

\(W^1 = \begin{bmatrix} 0 & 1 & -1 & 0 \\ 1 & -1 & 0 & 1 \end{bmatrix}\), \(W^2 = \begin{bmatrix} -1 & 1 \\ 1 & 0 \end{bmatrix}\), \(W^3 = \begin{bmatrix} 0 & 1 \\ 1 & -1 \\ -1 & 1 \end{bmatrix}\)

• 3-class output (positive/negative/neutral) with softmax

• Uses log₁₀ for cross-entropy (not ln) — watch the base!

• Round to 1 decimal point

Procedure: (1) Encode tweet as BoW → x = [1,0,0,1]ᵀ, (2) z¹ = W¹x, a¹ = (z¹)², (3) z² = W²a¹, a² = (z²)², (4) z³ = W³a², y = softmax(z³), (5) CE = -log₁₀(ŷ_true_class)

Learn more

R1 MOCK EXAMConceptual4 pts

[2024 Mock 2b] Given a backpropagation diagram showing \(\frac{\partial C}{\partial w_{ij}^l} = \frac{\partial z_i^l}{\partial w_{ij}^l} \cdot \frac{\partial C}{\partial z_i^l}\), fill in the two missing equations: (1) the forward pass formula and (2) the backward pass (δ) formula.

Source: 2024 Mock Exam 2b

From 2024 Mock Exam. Fill-in-the-blank format.

Box 1 (Forward pass):

\(z_i^l = \sum_j w_{ij}^l \cdot a_j^{l-1}\) (for l > 1), or \(z_i^l = \sum_j w_{ij}^l \cdot x_j\) (for l = 1)

Box 2 (Backward pass / δ):

\(\delta_i^l = \frac{\partial C}{\partial z_i^l}\) — the error signal at neuron i in layer l

This is a diagram-based question testing whether you understand how the chain rule decomposes into a forward pass component and a backward pass component.

Learn more

R1 MOCK EXAMConceptual3 pts

[2024 Mock 2c] Why does parameter initialization have a strong impact on the final results? Describe one possibility how to initialize the parameters.

Source: 2024 Mock Exam 2c

From 2024 Mock Exam.

Why initialization matters: Neural networks use gradient descent to find local minima. Different initializations start the optimization at different points in the loss landscape, leading to different local minima → different final results. Bad initialization can cause vanishing/exploding gradients from the very start.

Xavier/Glorot initialization: \(W \sim \mathcal{N}(0, \frac{2}{n_{in} + n_{out}})\) — keeps variance of activations stable across layers. Good for sigmoid/tanh.

He initialization: \(W \sim \mathcal{N}(0, \frac{2}{n_{in}})\) — designed for ReLU activations.

Key point: we NEVER initialize all weights to the same value (e.g., all zeros) because then all neurons compute the same thing → symmetry problem → network can't learn different features.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: PyTorch Neural Nets 📊 Slides: Basic PyTorch

Code / Bug-Finding

1 questions

R1 MOCK5 ptsCode

2c. PyTorch training loop — explain what loss.backward() does, and identify 2 missing optimizer operations.

Mock Exam 2c

▾

Verbatim from mock exam.

 1for epoch in range(num_epochs):
 2    for batch_x, batch_y in train_loader:
 3        output = model(batch_x)
 4        loss = criterion(output, batch_y)
 5        loss.backward()
 6
 7    print(f'Epoch {epoch}, Loss: {loss.item()}')

loss.backward() computes gradients of the loss w.r.t. all parameters via backpropagation (fills .grad attributes).

Missing operations:

1. optimizer.zero_grad() — must clear old gradients before backward() (otherwise they accumulate)

2. optimizer.step() — must update parameters using the computed gradients (after backward())

Learn more

True/False

4 questions

R1 MOCK5 ptsTrue/False

2d. True or False? (5 statements, 1 pt each)

Mock Exam 2d

▾

Statement	T	F
A neural network layer can be described as a linear transformation followed by a nonlinear activation.	✓
Cross-entropy is often used as loss function for multi-class, multi-label classification.	✓
In the backward pass, we start computing from δ¹.		✓
A neural network typically consists of multiple layers.	✓
ReLU activation is defined as min(0, z).		✓

Verbatim from mock exam. Key: backward starts from OUTPUT (δᴸ), not δ¹. ReLU = max(0,z) not min.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: Neural Nets 1 📊 Slides: Neural Nets 3

R2 QUIZTrue/False

T/F: "Cross-entropy loss is often used for multi-label classification"

Week 13 Thu transcript

▾

TRUE — this is tricky. Binary cross-entropy can be applied per-label independently for multi-label problems. Each label gets its own sigmoid output and BCE loss. The losses are then summed.

Learn more

📊 Slides: Neural Nets 2 📝 Review Transcript

R2 QUIZTrue/False

T/F: "Neural network typically consists of multiple layers"

Week 13 Thu transcript

▾

TRUE. By definition, a neural network has at least an input layer, one or more hidden layers, and an output layer. Even the simplest useful NN has multiple layers.

Learn more

📊 Slides: Neural Nets 1 📝 Review Transcript

="mock_exams/DL_mock_exam_2024.pdf" target="_blank">📋 2024 Mock Exam

R1 MOCK EXAMConceptual3 pts

[2024 Mock 2c] Why does parameter initialization have a strong impact on the final results? Describe one possibility how to initialize the parameters.

Source: 2024 Mock Exam 2c

From 2024 Mock Exam.

Xavier/Glorot initialization: \(W \sim \mathcal{N}(0, \frac{2}{n_{in} + n_{out}})\) — keeps variance of activations stable across layers. Good for sigmoid/tanh.

He initialization: \(W \sim \mathcal{N}(0, \frac{2}{n_{in}})\) — designed for ReLU activations.

Key point: we NEVER initialize all weights to the same value (e.g., all zeros) because then all neurons compute the same thing → symmetry problem → network can't learn different features.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: RNN Motivation 📝 Notes: RNN/LSTM

Conceptual

7 questions

R1 MOCK2 ptsConceptual

3a. What is the main difference between an RNN and a feed-forward neural network? What kind of input data can RNNs handle that FFNNs cannot?

Mock Exam 3a

▾

Verbatim from mock exam.

RNNs have recurrent connections — they maintain a hidden state that is updated at each timestep, creating a "memory" of previous inputs. FFNNs process each input independently with no memory.

RNNs handle sequential/variable-length input data (text, speech, time series) where order matters. FFNNs require fixed-size input.

Learn more

R1 MOCK2 ptsConceptual

3b. Write the Elman RNN formula for the hidden state. Define all variables.

Mock Exam 3b

▾

Verbatim from mock exam.

\(a_t = f(W_i x_t + W_h a_{t-1} + b)\)

Where: \(a_t\) = hidden state at time t, \(x_t\) = input at time t, \(a_{t-1}\) = previous hidden state, \(W_i\) = input-to-hidden weights, \(W_h\) = hidden-to-hidden (recurrent) weights, \(b\) = bias, \(f\) = activation function (e.g., tanh, ReLU).

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: RNN Forward 📝 Notes: RNN/LSTM

R4 MEMORYConceptual

LSTM: how it works (intuitive explanation).

Memory Protocol 2021

▾

LSTM uses gates to control information flow: (1) Forget gate: decides what to discard from cell state (sigmoid → 0=forget, 1=keep), (2) Input gate: decides what new info to store (sigmoid × tanh candidate), (3) Output gate: decides what to output based on cell state. The cell state acts as a "conveyor belt" — information can flow through unchanged, solving vanishing gradients.

Learn more

📊 Slides: LSTM 📝 Notes: RNN/LSTM 📋 Memory Protocol 2021

R4 MEMORYConceptual

Why does LSTM have 4x parameters of a simple RNN?

Memory Protocol 2021 + 2022

▾

A simple RNN has one set of weight matrices (Wᵢ, Wₕ, b). An LSTM has 4 sets — one for each gate: forget gate (Wf), input gate (Wi), candidate cell state (Wc), and output gate (Wo). Each gate has its own input weights, recurrent weights, and bias. So ~4× the parameters.

Learn more

📊 Slides: LSTM 📝 Notes: RNN/LSTM 📋 Memory Protocol 2021 📋 Memory Protocol 2022

R4 MEMORYConceptual

Elman vs Jordan RNN — what's the difference?

Memory Protocol 2021 + 2022

▾

Elman RNN: Hidden state feeds back to itself. \(a_t = f(W_i x_t + W_h a_{t-1})\). Jordan RNN: Output feeds back to hidden layer. \(a_t = f(W_i x_t + W_h y_{t-1})\). Elman is more common in practice. Key distinction: what gets fed back — hidden state (Elman) vs. output (Jordan).

Learn more

📊 Slides: RNN Motivation 📊 Slides: RNN Forward 📋 Memory Protocol 2021

R4 MEMORYConceptual

Gates in LSTM: what is the functionality of each?

Memory Protocol 2022

▾

Forget gate: fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf) — controls what to forget from cell state. Input gate: iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi) — controls what new info to add. Output gate: oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo) — controls what to output. All gates use sigmoid (0-1 range) to act as "valves".

Learn more

📊 Slides: LSTM 📝 Notes: RNN/LSTM 📋 Memory Protocol 2022

R4 MEMORYConceptual

Is "backpropagation through space" a thing?

Gedächtnisprotokoll

▾

No. The correct term is Backpropagation Through Time (BPTT). We "unroll" the RNN across timesteps and backpropagate through the unrolled graph. "Through space" is not a real concept — it's a trick question.

Learn more

📊 Slides: RNN Training 📝 Notes: RNN/LSTM

Calculation

8 questions

R1 MOCK3 ptsCalculation

3c. Given an Elman RNN with no biases and ReLU activation, compute the output at timestep 2.

\(W_i = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}\), \(W_h = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}\), \(W_o = \begin{bmatrix} 1 & 1 \end{bmatrix}\), \(a_0 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\), \(x_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}\), \(x_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}\)

Mock Exam 3c

▾

Verbatim from mock exam.

Step 1 — t=1: \(a_1 = \text{ReLU}(W_i x_1 + W_h a_0) = \text{ReLU}([1,0]^T + [0,0]^T) = [1,0]^T\)

Step 2 — t=2: \(a_2 = \text{ReLU}(W_i x_2 + W_h a_1) = \text{ReLU}([0,1]^T + [0,1]^T) = [0,2]^T\)

Step 3 — output: \(y_2 = W_o a_2 = [1,1] \cdot [0,2]^T = 2\)

Learn more

📋 Mock Exam ✏️ Mock Solutions 📋 RNN Forward Example 📊 Slides: RNN Forward

R3 EXERCISECalculation

RNN/LSTM parameter counting.

Exercise Sheet 9, Q9.1

▾

For an RNN with input size n and hidden size h: Wᵢ has h×n params, Wₕ has h×h params, bias has h params. Total: h(n+h+1). For LSTM: multiply by 4 (four gates). Plus output layer Wₒ with o×h params. Practice counting for specific dimensions.

Learn more

✏️ Exercise 9 📊 Slides: RNN Forward 📊 Slides: LSTM

R3 EXERCISECalculation

Elman RNN forward pass computation (exercise sheet version).

Exercise Sheet 9, Q9.2

▾

Same process as mock exam 3c. Given specific weight matrices and inputs, compute hidden states step by step: aₜ = f(Wᵢxₜ + Wₕaₜ₋₁ + b). Then compute outputs yₜ = Wₒaₜ. Practice with different sizes and activations.

Learn more

✏️ Exercise 9 📋 RNN Forward Example 📊 Slides: RNN Forward

R3 EXERCISECalculation

LSTM cell computation — step through one timestep.

Exercise Sheet 9, Q9.3

▾

For one LSTM timestep: (1) fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf), (2) iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi), (3) c̃ₜ = tanh(Wc·[hₜ₋₁, xₜ] + bc), (4) cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ, (5) oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo), (6) hₜ = oₜ ⊙ tanh(cₜ). Practice computing each gate value for given inputs.

Learn more

✏️ Exercise 9 📊 Slides: LSTM 📝 Notes: RNN/LSTM

R4 MEMORYCalculation

RNN with one-hot encoding of "deep learning", f=x², softmax output.

DL_mock_exam.pdf Ex4

▾

Similar to mock exam but with RNN architecture and x² activation. One-hot encode each word → feed through RNN timesteps → apply f(z)=z² → output with softmax. Remember: f'(z) = 2z for the backward pass variant.

Learn more

📋 DL Mock Exam 📊 Slides: RNN Forward 📋 RNN Forward Example

R4 MEMORYCode

LSTM equations — find 3 errors in the given equations.

DL_mock_exam.pdf Ex4

▾

Common errors in LSTM equations: (1) Wrong activation (using ReLU instead of sigmoid for gates), (2) Missing element-wise multiplication ⊙, (3) Wrong concatenation in gate inputs, (4) Forget gate applied to wrong thing, (5) Output gate formula errors. Check each gate formula carefully against the standard LSTM.

Learn more

📋 DL Mock Exam 📊 Slides: LSTM

R1 MOCK EXAMCalculation5 pts

[2024 Mock 4a] RNN forward pass: Compute softmax output of last hidden state for input 'deep learning'. V = {machine, deep, learning}, \(a_0 = [-1, 1]^T\), \(f = x^2\) (not sigmoid). Given \(W_i\), \(W_h\), \(W_{out}\). No biases. Round to 1 decimal.

Source: 2024 Mock Exam 4a

From 2024 Mock Exam.

Weights: \(W_i = \begin{bmatrix} 1 & -1 & 0 \\ -1 & 0 & 1 \end{bmatrix}\), \(W_h = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}\), \(W_{out} = \begin{bmatrix} -1 & 1 \\ 0 & -1 \\ 1 & 0 \end{bmatrix}\)

Important: The activation is f=x² (element-wise square), NOT sigmoid! This simplification is repeated from the other mock exam — expect this on the real exam.

Procedure: For each word in sequence: (1) encode as one-hot, (2) compute \(z_t = W_i x_t + W_h a_{t-1}\), (3) apply \(a_t = z_t^2\), (4) after last word: \(y = \text{softmax}(W_{out} \cdot a_{last})\)

Learn more

R1 MOCK EXAMConceptual4 pts

[2024 Mock 4b] LSTM equations error identification: Given 6 LSTM equations with 3 deliberate mistakes. Identify and correct them.

Source: 2024 Mock Exam 4b

From 2024 Mock Exam. Given equations:

\(f_t = \text{sigmoid}(W_f[h_{t-1}, x_{t-1}] + b_f)\) — Eq.2

\(i_t = \text{sigmoid}(W_i[h_{t-1}, x_{t-1}] + b_i)\) — Eq.3

\(\tilde{C}_t = \tanh(W_C[h_{t-1}, x_t] + b_C)\) — Eq.4

\(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\) — Eq.5

\(o_t = \tanh(W_o[h_{t-1}, x_t] + b_o)\) — Eq.6

\(h_t = o_t \odot \tanh(C_t)\) — Eq.7

Three errors:

1. Eq.2 & 3: Should be \(x_t\) not \(x_{t-1}\) — LSTM gates use CURRENT input

2. Eq.6: Output gate uses sigmoid, not tanh — gates are always sigmoid (values 0-1)

This is a tricky "spot the bug" question. Know the correct LSTM equations cold.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: RNN Training 📊 Slides: LSTM

True/False

3 questions

R1 MOCK4 ptsTrue/False

3d. True or False? (4 statements, 1 pt each)

Mock Exam 3d

▾

Statement	T	F
Recurrent neural networks are often trained using Backpropagation Through Time (BPTT).	✓
LSTMs completely solve the vanishing gradient problem.		✓
The last hidden state of an RNN always captures the information of the whole input sequence.		✓
In practice, we always need to pad the input for an RNN to work.		✓

Verbatim from mock exam. Key: LSTMs mitigate but don't completely solve vanishing gradients. Last hidden state may lose early info. Padding is practical necessity for batching, not theoretical requirement.

Learn more

R2 QUIZConceptual

"In theory, does an RNN need padding?"

Week 13 Thu transcript

▾

No. In theory, RNNs can process variable-length sequences one at a time. Padding is a practical requirement for batched processing — you need uniform tensor dimensions within a batch. A single sequence needs no padding.

Learn more

📊 Slides: RNN Applications 📝 Review Transcript 📝 Notes: RNN/LSTM

er last word: \(y = \text{softmax}(W_{out} \cdot a_{last})\)

Learn more

R1 MOCK EXAMConceptual4 pts

[2024 Mock 4b] LSTM equations error identification: Given 6 LSTM equations with 3 deliberate mistakes. Identify and correct them.

Source: 2024 Mock Exam 4b

From 2024 Mock Exam. Given equations:

\(f_t = \text{sigmoid}(W_f[h_{t-1}, x_{t-1}] + b_f)\) — Eq.2

\(i_t = \text{sigmoid}(W_i[h_{t-1}, x_{t-1}] + b_i)\) — Eq.3

\(\tilde{C}_t = \tanh(W_C[h_{t-1}, x_t] + b_C)\) — Eq.4

\(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\) — Eq.5

\(o_t = \tanh(W_o[h_{t-1}, x_t] + b_o)\) — Eq.6

\(h_t = o_t \odot \tanh(C_t)\) — Eq.7

Three errors:

1. Eq.2 & 3: Should be \(x_t\) not \(x_{t-1}\) — LSTM gates use CURRENT input

2. Eq.6: Output gate uses sigmoid, not tanh — gates are always sigmoid (values 0-1)

This is a tricky "spot the bug" question. Know the correct LSTM equations cold.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: CNN 📝 Notes: CNN

Conceptual

10 questions

R1 MOCK2 ptsConceptual

4a. What is the motivation for using CNNs? Name two reasons.

Mock Exam 4a

▾

Verbatim from mock exam.

(1) Parameter sharing / reduction: Same filter applied across entire input → far fewer parameters than fully connected. (2) Translation invariance: Features detected regardless of position (a cat is a cat whether left or right in the image). Also: local connectivity captures spatial patterns.

Learn more

R1 MOCK2 ptsConceptual

4b. Name one speech task and one text task. How is the input to a CNN represented in each case?

Mock Exam 4b

▾

Verbatim from mock exam.

Speech: e.g., speech recognition or speaker identification. Input = spectrogram (2D: time × frequency). Text: e.g., sentiment analysis or text classification. Input = word embeddings stacked as a matrix (2D: sequence length × embedding dimension).

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: CNN 📝 Notes: CNN

R1 MOCK1 ptConceptual

4c. CNNs applied to word embeddings — explain the idea.

Mock Exam 4c

▾

Verbatim from mock exam.

Stack word embeddings into a matrix (rows = words, columns = embedding dimensions). Apply 1D filters that span the full embedding width but vary in height (n-gram size). A filter of height 2 captures bigram patterns, height 3 captures trigrams, etc. This lets CNNs capture local n-gram features without explicit n-gram engineering.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: CNN

R1 MOCK1 ptConceptual

4e. How do you compute the derivative of a max pooling layer?

Mock Exam 4e

▾

Verbatim from mock exam.

The gradient only flows through the position that was selected as maximum. For the max element: gradient passes through unchanged (derivative = 1). For all non-max elements: gradient = 0. In practice, you store the indices of max elements during forward pass ("switches") and route gradients back through those positions.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: CNN

R2 QUIZConceptual

"What if the input is a matrix for a feedforward NN?" (parameter explosion)

Week 8 Thu

▾

A fully connected layer would need m×n parameters for every element of the matrix to every neuron. For a 100×100 image → 10,000 inputs → a hidden layer of 1000 neurons needs 10 million weights! This motivates CNNs: parameter sharing through filters dramatically reduces parameters while capturing spatial structure.

Learn more

📊 Slides: CNN 📝 Notes: CNN 📊 Slides: Neural Nets 1

R2 QUIZConceptual

"What if the input is shifted?" (lose spatial relationships)

Week 8 Thu

▾

If an object shifts position in the input, a fully connected network treats it as completely different input (different neurons activate). CNNs are translation-invariant — the same filter detects the same feature regardless of position. This is why CNNs are essential for vision and signal processing.

Learn more

📊 Slides: CNN 📝 Notes: CNN

R2 QUIZConceptual

What are the CNN hyperparameters?

Week 8 Thu

▾

Hyperparameters: (1) Number of filters (channels), (2) Filter/kernel size, (3) Stride, (4) Padding (same/valid), (5) Pooling size and type (max/average). Note: filter size is a hyperparameter, but filter weights are learned parameters.

Learn more

📊 Slides: CNN 📝 Notes: CNN

R3 EXERCISEConceptual

Output dimension formula for convolution and pooling layers.

Exercise Sheet 8, Q8.1

▾

Output size = \(\lfloor\frac{n - k + 2p}{s}\rfloor + 1\) where n = input size, k = kernel size, p = padding, s = stride. Same formula for both conv and pooling layers. For 2D: apply independently to height and width.

Learn more

✏️ Exercise 8 📊 Slides: CNN

R4 MEMORYConceptual

CNN input for different sequence lengths?

Memory Protocol 2022

▾

For variable-length inputs: (1) Padding: pad shorter sequences to max length, (2) Global pooling: apply global max/average pooling over the sequence dimension to get fixed-size output regardless of input length, (3) 1-max pooling: take maximum value from each filter's output across all positions.

Learn more

📊 Slides: CNN 📋 Memory Protocol 2022

R1 MOCK EXAMConceptual2 pts

[2024 Mock 3a] CNNs have been proposed for image processing. What is the intuition of using them for language and how does the input look like in the case of language?

Source: 2024 Mock Exam 3a

From 2024 Mock Exam.

Intuition: In images, CNNs detect local spatial patterns (edges, textures). In language, local patterns are n-grams — groups of adjacent words that carry meaning. A CNN filter sliding over a sentence detects local word patterns just like it detects visual patterns in images.

Input representation: Each word is represented as a dense vector (word embedding). The sentence becomes a 2D matrix: rows = words in sequence, columns = embedding dimensions. So a sentence of length L with embedding dimension D gives an L × D input matrix.

Filters then slide vertically (across words) with full width (across all embedding dimensions), detecting n-gram patterns of different sizes.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: CNN ✏️ Exercise 8

Calculation

3 questions

R1 MOCK4 ptsCalculation

4d. Compute convolution + max-pooling output.

Input (3×4): \(\begin{bmatrix} 0 & 0 & 1 & 2 \\ 1 & 1 & 2 & 1 \\ 1 & 0 & 1 & 0 \end{bmatrix}\), Filter (2×2): \(\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}\), stride=(1,2), then max-pool (1×2).

Mock Exam 4d

▾

Verbatim from mock exam.

Step 1 — Conv output size: height = (3-2)/1 + 1 = 2, width = (4-2)/2 + 1 = 2 → 2×2 output

Step 2 — Conv computation (stride 1 vertically, 2 horizontally):

Position (0,0): 0·1 + 0·0 + 1·0 + 1·1 = 1

Position (0,1): 1·1 + 2·0 + 2·0 + 1·1 = 2

Position (1,0): 1·1 + 1·0 + 1·0 + 0·1 = 1

Position (1,1): 2·1 + 1·0 + 1·0 + 0·1 = 2

Conv output: \(\begin{bmatrix} 1 & 2 \\ 1 & 2 \end{bmatrix}\)

Step 3 — Max pool (1×2): pool each row → [2, 2] → Result: \(\begin{bmatrix} 2 \\ 2 \end{bmatrix}\)

Learn more

R3 EXERCISECalculation

2D convolution forward pass computation (exercise sheet).

Exercise Sheet 8, Q8.2

▾

Practice the same convolution process with different inputs and filters. For each position: element-wise multiply filter with input patch, sum all products. Move filter by stride amount. Remember: output dimensions use the formula ⌊(n-k+2p)/s⌋ + 1.

Learn more

✏️ Exercise 8 📊 Slides: CNN

R1 MOCK EXAMCalculation3 pts

[2024 Mock 3b] Compute convolution output (stride=2, no padding, \(f = x^2\)) then max pooling (2×2 window, stride=1). Input: 4×4 matrix, Filter: 2×2. Given specific matrices.

Source: 2024 Mock Exam 3b

From 2024 Mock Exam.

Input: \(\begin{pmatrix} 2 & 2 & 1 & 0 \\ -1 & 2 & 1 & 1 \\ 1 & -1 & 1 & -1 \\ 0 & 1 & 2 & -2 \end{pmatrix}\), Filter: \(\begin{pmatrix} 1 & -1 \\ 0 & 2 \end{pmatrix}\), Pooling: 2×2 max

Key differences from our existing mock: stride=2 for conv (not 1), activation f=x², pooling stride=1

Procedure:

1. Slide 2×2 filter with stride 2 → output size = ⌊(4-2)/2⌋+1 = 2 → 2×2 conv output

2. Apply activation f=x² element-wise

3. Apply 2×2 max pooling with stride 1 → output size = ⌊(2-2)/1⌋+1 = 1 → 1×1 output

Learn more

📊 CNN for NLP slides 📋 Mock Exam 📋 Mock Solutions

True/False

3 questions

R1 MOCK4 ptsTrue/False

4f. True or False? (4 statements, 1 pt each)

Mock Exam 4f

▾

Statement	T	F
The number of filters is a hyperparameter.	✓
The filter size determines the parameters of the model.		✓
A CNN is a special case of a feedforward neural network.	✓
The weights of the filter are the parameters of the model.	✓

Verbatim from mock exam. Key: filter SIZE is a hyperparameter (you choose it), but filter WEIGHTS are the learned parameters. The tricky one is "filter size determines parameters" — FALSE, because while filter size affects the count of parameters, it doesn't determine which values the parameters take.

Learn more

R2 QUIZTrue/False

Quiz. "The number of filters is a hyperparameter" (TRUE)

Week 13 Thu transcript (Q34)

▾

TRUE. You (the designer) choose how many filters to use — it's not learned from data. This was an in-class quiz question testing the same concept as Mock 4f statement 1.

Learn more

📊 CNN for NLP slides

R1 MOCK EXAMTrue/False5 pts

[2024 Mock 3c] True or False? (5 statements on CNNs)

Source: 2024 Mock Exam 3c

Statement	T	F
The filter weights are trainable parameters of a CNN.	✓
A convolutional layer with 10 filters (with bias) of size 3×3 has 100 trainable parameters.		✓
Zero padding is only necessary when processing several input matrices at a time (in a batch or mini-batch).		✓
A typical convolutional layer for language spans the whole sentence.		✓
The average pooling layer can be used to downsample a matrix.	✓

From 2024 Mock Exam. Key traps:

• 10 filters of 3×3 with bias ≠ 100 params. It's 10 × (3×3×C + 1) where C = number of input channels. With C=1: 10×(9+1) = 100 would be TRUE, but the question doesn't specify single-channel. In general CNN context (e.g., RGB), C>1 so it's FALSE.

• Zero padding is used to control output spatial dimensions, NOT just for batching.

• Conv for language spans the full embedding width but NOT the whole sentence — filters slide across words.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: Seq2seq+Attention 📝 Notes: Seq2Seq

Conceptual

18 questions

R1 MOCK2 ptsConceptual

5a. Seq2seq models are used for tasks where the input and output are both sequences. Name a property that the task should have and give an example.

Mock Exam 5a

▾

Verbatim from mock exam.

Property: Variable-length input maps to variable-length output (lengths can differ). Example: Machine translation ("Ich bin müde" → "I am tired"), text summarization, speech recognition (audio → text).

Learn more

R1 MOCK2 ptsConceptual

5b. Name two scoring methods for attention. How are the encoder hidden states used by the decoder?

Mock Exam 5b

▾

Verbatim from mock exam.

Scoring methods: (1) Dot product: score = hₑᵀ · hd, (2) Additive (Bahdanau): score = vᵀ · tanh(W₁hₑ + W₂hd). Other option: (3) Scaled dot product: score = (hₑᵀ · hd) / √d.

How encoder states are used: Compute attention scores between decoder hidden state and ALL encoder hidden states → softmax → weighted sum of encoder states → context vector fed to decoder.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: Seq2seq+Attention

R1 MOCK2 ptsConceptual

5c. Self-attention: how are Q, K, V computed? What are the learnable weights?

Mock Exam 5c

▾

Verbatim from mock exam.

Q = XW_Q, K = XW_K, V = XW_V where X is the input matrix and W_Q, W_K, W_V are separate learnable weight matrices. Each input token gets its own query, key, and value by linear projection. The three weight matrices W_Q, W_K, W_V are the learnable parameters.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: Transformers 📊 Notes: Transformers

R1 MOCK4 ptsConceptual

5d. Self-attention diagram: describe steps 1, 3, 4. What is the purpose of step 2?

Mock Exam 5d

▾

Verbatim from mock exam (4 pts). Based on the jalammar.github.io illustration:

Step 1: Compute Q, K, V by multiplying input embeddings with weight matrices W_Q, W_K, W_V.

Step 2 (purpose): Compute attention scores by taking dot product of query with all keys: score = Q · Kᵀ. Then scale by √d_k to prevent softmax saturation. This determines how much each token should "attend to" every other token.

Step 3: Apply softmax to the scaled scores to get attention weights (probabilities summing to 1).

Step 4: Multiply attention weights by V to get weighted sum — the output representation for each token.

Full formula: \(\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\)

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: Transformers 📊 Notes: Transformers

R2 QUIZConceptual

"How many weight tensors are there in nn.Linear?"

Week 10 Fri transcript

▾

2: the weight matrix W and the bias vector b. nn.Linear(in, out) stores W of shape (out, in) and b of shape (out). If bias=False, then just 1 tensor (W only).

Learn more

📊 Slides: PyTorch Neural Nets 📝 Notes: Seq2Seq

R4 MEMORYConceptual

Explain seq2seq in your own words.

Gedächtnisprotokoll

▾

Seq2seq uses two RNNs: an encoder reads the input sequence and compresses it into a context vector (final hidden state), and a decoder generates the output sequence one token at a time, conditioned on the context vector. Problem: fixed-size context vector is a bottleneck for long sequences → solved by attention.

Learn more

📊 Slides: Seq2seq+Attention 📝 Notes: Seq2Seq

R4 MEMORYConceptual

Self-attention: describe the mechanism.

Memory Protocol 2021

▾

Self-attention lets each token attend to all other tokens in the same sequence. For each token: (1) Create Q, K, V vectors via learned projections, (2) Score = dot product of Q with all K's, (3) Softmax to get weights, (4) Weighted sum of V's = output. Captures long-range dependencies without recurrence. O(n²) complexity in sequence length.

Learn more

📊 Slides: Transformers 📊 Notes: Transformers 📋 Memory Protocol 2021

R4 MEMORYConceptual

Self-attention: what are parameters vs hyperparameters?

Memory Protocol 2022

▾

Parameters (learned): W_Q, W_K, W_V weight matrices, output projection weights. Hyperparameters (chosen): d_model (model dimension), d_k (key/query dimension), d_v (value dimension), number of attention heads, number of layers.

Learn more

📊 Slides: Transformers 📊 Notes: Transformers 📋 Memory Protocol 2022

Transformers & BERT

R2 SLIDEConceptual

Why does the transformer architecture eliminate recurrence and convolution entirely? What does it use instead?

Transformers slide 15

▾

The transformer is based solely on attention mechanisms — no recurrence, no convolution. Advantages over RNNs: (1) fully parallelizable (no sequential dependency), (2) O(1) maximum path length for long-range dependencies vs O(n). Advantage over CNNs: O(1) path vs O(log_k(n)). The trade-off: self-attention has O(n²·d) complexity per layer, which is expensive for very long sequences.

Learn more

R2 SLIDEConceptual

Name and describe the three types of attention used in the full transformer architecture.

Transformers slide 28

▾

1. Encoder self-attention: Each encoder position attends to all positions in the encoder input. Captures relationships within the source sequence.

2. Masked decoder self-attention: Each decoder position attends only to previous decoder positions (future tokens are masked with −∞ before softmax). Ensures autoregressive generation.

3. Encoder-decoder (cross) attention: Queries come from the decoder, keys and values come from the encoder output. This is how the decoder "reads" the source — analogous to attention in seq2seq.

Learn more

R2 SLIDEConceptual

Why does the transformer need positional encoding? What information would be lost without it?

Transformers slides 29–30

▾

Self-attention treats the input as a set, not a sequence — it has no built-in notion of word order. Without positional encoding, "the cat sat on the mat" and "the mat sat on the cat" produce identical representations. Positional encodings are added to the input embeddings to inject position information. The original paper uses sinusoidal functions so the model can generalize to unseen sequence lengths.

Learn more

R2 SLIDEConceptual

Write the sinusoidal positional encoding formulas. Why use sin and cos at different frequencies?

Transformers slide 31

▾

\(PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)\)

\(PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)\)

Where pos = position in the sequence, i = dimension index, d_model = model dimension. Each dimension uses a different frequency. For any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos), letting the model learn to attend to relative positions.

Learn more

R2 SLIDEConceptual

What is multi-head attention? Why use multiple heads instead of one? How are the outputs combined?

Transformers slides 32–34

▾

Instead of one attention function with d_model-dimensional keys/values/queries, multi-head attention runs h parallel attention heads, each with reduced dimension d_k = d_v = d_model / h.

Formula: MultiHead(Q, K, V) = Concat(head₁, …, headₕ) · W^O, where each headᵢ = Attention(Q·Wᵢ^Q, K·Wᵢ^K, V·Wᵢ^V).

Why multiple heads: Different heads can learn to attend to different types of relationships (e.g., syntactic vs. semantic). The output projection W^O (d_model × d_model) combines information from all heads.

Learn more

R2 SLIDEConceptual

Compare self-attention, recurrent, and convolutional layers: complexity per layer, sequential operations, maximum path length.

Transformers slide 37 (Table 1)

▾

From "Attention Is All You Need" (Table 1 on the slides):

Layer Type	Complexity/Layer	Sequential Ops	Max Path Length
Self-Attention	O(n²·d)	O(1)	O(1)
Recurrent	O(n·d²)	O(n)	O(n)
Convolutional	O(k·n·d²)	O(1)	O(log_k(n))

n = sequence length, d = representation dimension, k = kernel size. Self-attention wins on parallelization and long-range paths, but costs more per layer for long sequences (n > d).

Learn more

R2 SLIDEConceptual

What is the problem with static word embeddings like Word2vec? How do contextual embeddings solve it?

Transformers slides 43–45

▾

Problem: Static embeddings assign one fixed vector per word regardless of context. "Bank" gets the same vector in "river bank" and "bank account" — polysemy is lost.

Solution: Contextual embeddings generate a different vector for each word depending on its context. Instead of a lookup table, a model (ELMo, BERT) dynamically computes the embedding from the full sentence.

Learn more

R2 SLIDEConceptual

What is ELMo? How does it produce contextual word representations?

Transformers slides 45–48

▾

ELMo (Embeddings from Language Models, Peters et al. 2018) uses a bidirectional LSTM (BiLSTM): a forward LSTM reads left-to-right, a backward LSTM reads right-to-left. The contextualized embedding for each word is the concatenation of the forward and backward hidden states. Because each direction sees different context, the combined representation captures the full surrounding context.

Learn more

R2 SLIDEConceptual

What are BERT's two pre-training objectives? How does BERT differ from ELMo architecturally?

Transformers slide 50

▾

BERT (Devlin et al. 2019) replaces ELMo's BiLSTM with a Transformer encoder. Two pre-training objectives:

1. Masked Language Model (MLM): Randomly mask 15% of input tokens; train the model to predict them. Unlike left-to-right LMs, this allows true bidirectional context.

2. Next Sentence Prediction (NSP): Given sentences A and B, predict whether B follows A. Trains inter-sentence understanding.

Pre-trained BERT is a powerful feature extractor that can be fine-tuned with relatively little task-specific data.

Learn more

R2 SLIDEConceptual

How is BERT fine-tuned for downstream tasks? Name the four task types shown in the lecture.

Transformers slide 51

▾

Add a small task-specific output layer on top of pre-trained BERT and train end-to-end. Four task types:

1. Sentence pair classification (MNLI, QQP): [CLS] + Sentence A + [SEP] + Sentence B → class label from [CLS].

2. Single sentence classification (SST-2, CoLA): [CLS] + Sentence → class label from [CLS].

3. Question answering (SQuAD): Question + [SEP] + Paragraph → predict start/end span.

4. Token-level tagging (CoNLL NER): [CLS] + Tokens → per-token labels (B-PER, O, etc.).

Learn more

✏️ Exercise 10 ✏️ Solution 10 📊 Slides: Seq2seq+Attention

Calculation

3 questions

R3 EXERCISECalculation

Seq2seq parameter counting — encoder + decoder + attention.

Exercise Sheet 10, Q10.2

▾

Encoder RNN: h(n+h+1) params. Decoder RNN: h(m+h+1) params (m = target vocab embedding size). Attention: depends on type — additive has W₁(h), W₂(h), v(h) params. Output projection: vocab_size × h. Don't forget embeddings: vocab × d_embed for both source and target.

Learn more

R3 EXERCISECalculation

Machine translation with attention — compute context vector.

Exercise Sheet 10, Q10.3

▾

Given encoder hidden states h₁, h₂, h₃ and decoder state sₜ: (1) Compute scores: eᵢ = score(sₜ, hᵢ) using dot product or additive method, (2) Softmax: αᵢ = exp(eᵢ)/Σexp(eⱼ), (3) Context: cₜ = Σ αᵢhᵢ. Practice the full computation with specific numbers.

Learn more

✏️ Exercise 10 ✏️ Solution 10 📊 Slides: Seq2seq+Attention

R2 SLIDECalculation

Multi-head attention dimensions: given d_model = 512 and h = 8 heads, compute d_k per head and count total parameters in W_Q, W_K, W_V, and W^O.

Transformers slides 32–34

▾

d_k = d_model / h = 512 / 8 = 64 per head.

Parameters: Each of W_Q, W_K, W_V projects from d_model → d_model (all heads combined): 512 × 512 = 262,144 params each. W^O projects concatenated output back: 512 × 512 = 262,144 params. Total: 4 × 262,144 = 1,048,576 parameters for one multi-head attention sub-layer (excluding biases).

Learn more

📊 Slides: Transformers 📊 Transformers + PyTorch Notes 📊 Notes: Transformers

Code / Bug-Finding

5 questions

R2 QUIZCode

Self-attention bug: using single weight matrix for Q, K, V (need SEPARATE projections).

Week 14 Fri review

▾

Bug: Using W for all three: Q=XW, K=XW, V=XW. Fix: Need separate matrices: Q=XW_Q, K=XW_K, V=XW_V. With a single matrix, Q and K would be identical, making the attention scores trivial (each token attends equally or only to itself).

Learn more

R2 QUIZCode

Self-attention bug: using sequence length instead of hidden dim for scaling.

Week 14 Fri review

▾

Bug: Dividing by √(seq_length) instead of √(d_k). Fix: Scale by √d_k (dimension of the key vectors). The scaling prevents dot products from growing too large with higher dimensions, which would push softmax into regions with tiny gradients.

Learn more

📊 Slides: Transformers 📊 Transformers + PyTorch Notes

R2 QUIZCode

Self-attention bug: computing score between Q and V instead of Q and K.

Week 14 Fri review

▾

Bug: score = QVᵀ instead of QKᵀ. Fix: Attention scores are computed between queries and keys: score = QKᵀ/√d_k. V is only used after softmax to compute the weighted output. Q-K matching determines "where to look"; V provides "what to read".

Learn more

📊 Slides: Transformers 📊 Transformers + PyTorch Notes

R2 QUIZCode

Self-attention bug: missing softmax before weighted sum.

Week 14 Fri review

▾

Bug: Using raw scores to weight V: output = scores · V. Fix: Must apply softmax first: output = softmax(scores) · V. Without softmax, the weights aren't normalized to sum to 1, and the attention mechanism doesn't produce proper probability-weighted combinations.

Learn more

📊 Slides: Transformers 📊 Transformers + PyTorch Notes

R2 QUIZCode

Seq2seq bug: decoder not taking encoder output.

Week 14 Fri review

▾

Bug: Decoder runs independently without receiving encoder information. Fix: Decoder must be initialized with the encoder's final hidden state (basic seq2seq) or receive attention-weighted encoder states at each timestep (attention mechanism). Without this connection, the decoder has no knowledge of the input.

Learn more

📊 Slides: Seq2seq+Attention 📝 Notes: Seq2Seq

True/False

3 questions

R2 SLIDETrue/False

"Self-attention layers are always computationally cheaper than recurrent layers."

Transformers slide 37

▾

False. Self-attention has O(n²·d) complexity; recurrent has O(n·d²). When sequence length n exceeds representation dimension d, self-attention is more expensive. Self-attention wins on parallelization (O(1) sequential ops) and path length, but not always on raw computation cost.

Learn more

R2 SLIDETrue/False

"In the transformer decoder, self-attention is masked so that each position can only attend to earlier positions in the output sequence."

Transformers slide 28

▾

True. The decoder uses masked self-attention: future positions are set to −∞ before softmax, zeroing their attention weights. This preserves the autoregressive property — the prediction for position t depends only on positions < t (no peeking at future tokens during generation).

Learn more

R2 SLIDETrue/False

"BERT uses the full encoder-decoder transformer architecture."

Transformers slide 50

▾

False. BERT uses only the transformer encoder (no decoder). It is designed for understanding tasks (classification, NER, QA), not autoregressive generation. The bidirectional context in BERT — seeing both left and right via masked language modeling — is possible precisely because it doesn't generate tokens left-to-right like a decoder.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: Prob & Optimization 📝 Notes: Training Tricks

Conceptual

20 questions

R1 MOCK1 ptConceptual

6a. What is the difference between SGD and mini-batch gradient descent?

Mock Exam 6a

▾

Verbatim from mock exam.

SGD: Updates parameters using gradient from ONE sample at a time. Noisy but fast updates. Mini-batch GD: Updates using gradient averaged over a small batch (e.g., 32 samples). Less noisy than SGD, more efficient than full batch. Full batch GD: uses ALL training data per update — stable but slow.

Learn more

R1 MOCK3 ptsConceptual

6b. Why does parameter initialization matter? Name and describe two methods.

Mock Exam 6b

▾

Verbatim from mock exam.

Why: Bad initialization → vanishing/exploding gradients, slow convergence, or getting stuck. All-zeros = all neurons learn the same thing (symmetry problem).

Xavier/Glorot: \(W \sim \mathcal{N}(0, \frac{2}{n_{in}+n_{out}})\) — designed for sigmoid/tanh. Keeps variance stable across layers.

He/Kaiming: \(W \sim \mathcal{N}(0, \frac{2}{n_{in}})\) — designed for ReLU. Accounts for ReLU killing half the values.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: Initialization Methods 📝 Notes: Training Tricks

R1 MOCK2 ptsConceptual

6c. What is weight decay? What is its purpose?

Mock Exam 6c

▾

Verbatim from mock exam.

Weight decay adds a penalty proportional to the squared magnitude of weights to the loss function: \(L_{total} = L_{original} + \frac{\lambda}{2}||w||^2\). This encourages smaller weights, preventing any single weight from dominating. Purpose: Regularization — prevents overfitting by penalizing model complexity.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: Overfitting 📝 Notes: Training Tricks

R1 MOCK2 ptsConceptual

6d. Dropout: if we don't scale during training, how do we scale at test time?

Mock Exam 6d

▾

Verbatim from mock exam.

If dropout rate = p (e.g., 0.5) and we DON'T scale during training, then at test time we must multiply all weights by (1-p). Why? During training, on average only (1-p) fraction of neurons are active. At test time all neurons are active, so outputs would be too large by factor 1/(1-p). Multiplying by (1-p) compensates.

Alternative (inverted dropout): scale by 1/(1-p) during TRAINING, then no scaling needed at test time.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: Overfitting 📝 Notes: Training Tricks

R2 QUIZConceptual

Why do we use ReLU? Why are there variations?

Week 7 Thu

▾

Why ReLU: (1) Simple to compute: max(0,z), (2) No vanishing gradient for z>0 (gradient = 1), (3) Sparse activation (many zeros = efficient). Why variations: ReLU has the "dying ReLU" problem — neurons with z<0 always output 0 and stop learning. Variations like Leaky ReLU (small slope for z<0) and ELU address this.

Learn more

📊 Slides: Neural Nets 1 📝 Notes: Training Tricks

R2 QUIZConceptual

What variations of ReLU exist?

Week 7 Thu

▾

Leaky ReLU: f(z) = max(αz, z) with small α like 0.01. Parametric ReLU (PReLU): same but α is learned. ELU: f(z) = z for z>0, α(eᶻ-1) for z≤0 — smooth and can output negative values. SELU: self-normalizing, maintains mean/variance across layers. GELU: used in transformers, smooth approximation.

Learn more

📊 Slides: Neural Nets 1 📝 Notes: Training Tricks

R2 QUIZConceptual

"How many layers? How many nodes? Which activation?" — architecture design questions.

Week 7 Thu

▾

Architecture choices are hyperparameters selected via experimentation: (1) Depth: deeper = more abstract features but harder to train, (2) Width: wider layers = more capacity per layer, (3) Activation: ReLU for hidden layers (default), softmax for multi-class output, sigmoid for binary output. Use validation performance to guide choices.

Learn more

📊 Slides: Neural Nets 1 📝 Notes: Training Tricks

R2 QUIZConceptual

"What does patience mean?" (early stopping)

Week 10 Fri transcript

▾

Patience in early stopping = the number of epochs to wait after the last improvement in validation loss before stopping training. E.g., patience=5 means: if validation loss doesn't improve for 5 consecutive epochs, stop. Prevents overfitting by stopping before the model starts memorizing training data.

Learn more

📊 Slides: Overfitting 📝 Notes: Seq2Seq

R2 QUIZConceptual

"What is the ResNet formula?"

Week 10 Fri transcript

▾

\(y = f(x) + x\) — the skip/residual connection. The network learns the residual f(x) = y - x, which is easier to optimize. Helps with vanishing gradients in deep networks because gradients can flow directly through the skip connection.

Learn more

📊 Slides: Very Deep Neural Nets 📝 Notes: Training Tricks

R3 EXERCISEConceptual

Dropout theory — how and why it works.

Exercise Sheet 10, Q10.1

▾

During training: randomly set neurons to 0 with probability p. Each forward pass uses a different "thinned" network. Effect: (1) Prevents co-adaptation — neurons can't rely on specific other neurons, (2) Implicit ensemble — averaging over 2ⁿ possible sub-networks, (3) Adds noise → regularization. At test time: use all neurons (with appropriate scaling).

Learn more

✏️ Exercise 10 ✏️ Solution 10 📊 Slides: Overfitting

R4 MEMORYConceptual

Kaiming initialization — explain the method.

Memory Protocol 2021

▾

Kaiming/He initialization: \(W \sim \mathcal{N}(0, \sqrt{2/n_{in}})\) where n_in is the number of input connections. Designed specifically for ReLU activations. The factor of 2 compensates for ReLU zeroing out roughly half the values, which would otherwise halve the variance at each layer.

Learn more

📊 Slides: Initialization Methods 📝 Notes: Training Tricks 📋 Memory Protocol 2021

R4 MEMORYConceptual

AdaGrad: how it works.

Memory Protocol 2021

▾

AdaGrad adapts the learning rate per parameter. It divides the learning rate by the square root of the sum of all past squared gradients: \(w_t = w_{t-1} - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot g_t\). Parameters with large past gradients get smaller updates. Problem: learning rate monotonically decreases → can stop learning too early. RMSProp and Adam fix this.

Learn more

📊 Slides: Adaptive LR 📝 Notes: Training Tricks 📋 Memory Protocol 2021

R4 MEMORYConceptual

Gradient clipping — what is it and why?

Memory Protocol 2021 + 2022

▾

When gradients become very large (exploding gradients), clip them to a maximum norm. If ||g|| > threshold, scale g down to g · (threshold / ||g||). Prevents unstable training, especially in RNNs where BPTT can cause gradient magnitudes to grow exponentially across timesteps.

Learn more

📊 Slides: RNN Training 📝 Notes: Training Tricks 📋 Memory Protocol 2021

R4 MEMORYConceptual

Data augmentation — why use it?

Memory Protocol 2021

▾

Artificially increase training data size by applying transformations (rotation, flipping, noise, cropping for images; synonym replacement, back-translation for text). Helps prevent overfitting by exposing the model to more variation. Particularly important when training data is limited.

Learn more

📊 Slides: Data Augmentation 📋 Memory Protocol 2021

R4 MEMORYConceptual

Residual networks (ResNet) — explain the concept.

Memory Protocol 2021 + 2022

▾

Add skip connections that bypass one or more layers: y = F(x) + x. Benefits: (1) Gradients flow directly through skip connections → no vanishing gradient even in very deep networks (100+ layers), (2) Network only needs to learn the residual F(x) = y - x, which is often easier, (3) Worst case: if F(x) ≈ 0, the layer becomes identity → doesn't hurt performance.

Learn more

📊 Slides: Very Deep Neural Nets 📝 Notes: Training Tricks 📋 Memory Protocol 2021

R4 MEMORYConceptual

Batch normalization: motivation and how it works.

exam_example.pdf 1a

▾

Motivation: Internal covariate shift — each layer's input distribution changes as previous layers update, slowing training. How: For each mini-batch, normalize activations: \(\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\), then scale and shift: y = γx̂ + β (γ, β are learned). At test time: use running averages of μ and σ² from training.

Learn more

📋 Older Mock Exam 📊 Slides: Very Deep Neural Nets 📝 Notes: Training Tricks

R4 MEMORYConceptual

Overfitting: draw the curves and name strategies to combat it.

DL_mock_exam.pdf Ex5

▾

Curves: Training loss decreases, validation loss decreases then increases — the gap is overfitting. Strategies: (1) More training data, (2) Regularization (L1/L2/weight decay), (3) Dropout, (4) Early stopping, (5) Data augmentation, (6) Reduce model capacity, (7) Batch normalization.

Learn more

📋 DL Mock Exam 📊 Slides: Overfitting 📝 Notes: Training Tricks

R4 MEMORYConceptual

Vanishing gradient diagnosis: deep network with sigmoid activations.

DL_mock_exam.pdf Ex5

▾

Sigmoid saturates at 0 and 1, where gradient ≈ 0. In a deep network, gradients multiply through layers: ∂L/∂w₁ = ∂L/∂yₙ · σ'(zₙ) · ... · σ'(z₁). Since σ'(z) ≤ 0.25, the gradient shrinks exponentially with depth. Solutions: Use ReLU activation, residual connections, proper initialization (Xavier/He), or LSTM for recurrent networks.

Learn more

📋 DL Mock Exam 📊 Slides: Neural Nets 1 📊 Slides: Initialization Methods

R1 MOCK EXAMConceptual2 pts

[2024 Mock 5a] Training curves show: training cost decreasing, development cost decreasing then increasing (classic overfitting curve). Do the curves behave as expected? Explain two strategies to get a good model.

Source: 2024 Mock Exam 5a

From 2024 Mock Exam.

Behavior: This shows classic overfitting. Training loss keeps decreasing but dev loss starts increasing after a point — the model memorizes training data instead of learning generalizable patterns.

Two strategies:

1. Early stopping: Stop training at the point where dev loss is minimal (before it starts rising). Monitor validation loss and save the best checkpoint.

2. Regularization: Add L2 weight decay, dropout, or data augmentation to constrain the model and reduce overfitting.

Other valid answers: reduce model complexity, get more training data, batch normalization.

Learn more

R1 MOCK EXAMConceptual3 pts

[2024 Mock 5b] Given a multi-layer perceptron with 20 hidden layers with 200 neurons each using sigmoid activation. Which problem might occur during training, how can you detect it, and how could you solve it?

Source: 2024 Mock Exam 5b

From 2024 Mock Exam.

Problem: Vanishing gradients. Sigmoid's derivative has max value 0.25. With 20 layers, gradients get multiplied ~20 times by values ≤ 0.25 → gradients shrink to near zero → early layers stop learning.

Detection: (1) Monitor gradient magnitudes per layer — if early layers have near-zero gradients, that's vanishing. (2) Training loss plateaus very early. (3) Early layer weights barely change across epochs.

Solutions: (1) Replace sigmoid with ReLU (gradient = 1 for positive inputs). (2) Use residual connections (skip connections bypass the gradient bottleneck). (3) Use He initialization designed for ReLU. (4) Use batch normalization.

Learn more

📋 Mock Exam ✏️ Mock Solutions 📊 Slides: Overfitting 📝 Notes: Training Tricks

Code / Bug-Finding

1 questions

R1 MOCK4 ptsCode

6e. Early stopping code — find 4 conceptual mistakes.

Mock Exam 6e

▾

Verbatim from mock exam.

 1no_improvement = 0
 2patience = 10
 3best_loss = 0
 4
 5for epoch in range(1000):
 6    model.train()
 7    train(model, train_loader)
 8
 9    model.eval()
10    val_loss = evaluate(model, val_loader)
11
12    if val_loss < best_loss:
13        best_loss = val_loss
14        no_improvement = 0
15        save_model(model, 'best.pt')
16    else:
17        no_improvement += 1
18
19    if no_improvement >= patience:
20        break
21
22save_model(model, 'final.pt')

Find 4 conceptual mistakes. NOTE: Conceptual errors are logical errors of the algorithm, NOT syntax mistakes, typos, or runtime bugs.

Bugs:

1. Line 3: best_loss = 0 should be best_loss = float('inf') — loss is always positive, so val_loss < 0 is never true.

2. Line 22: Saves the LAST model (possibly overfit) instead of loading the BEST model. Should load 'best.pt' at the end.

3. Missing: No torch.no_grad() context during evaluation — wastes memory computing gradients during validation.

4. Missing: After break, should load best model before any final evaluation or saving.

Learn more

True/False

1 questions

R1 MOCK EXAMTrue/False5 pts

[2024 Mock 5c] True or False? (5 statements on training tricks)

Source: 2024 Mock Exam 5c

Statement	T	F
Gradient clipping helps in the case of exploding gradients.	✓
Given a training set with 100 examples, stochastic gradient descent performs one update step per epoch.		✓
Dropout is a regularization technique.	✓
A fully-connected layer with ReLU activation and residual connection has the form relu(Wx + b) + x.	✓
Highway Network automatically removes unnecessary layers.		✓

From 2024 Mock Exam. Key traps:

• SGD with 100 examples: SGD processes ONE example at a time → 100 updates per epoch, not 1. (Batch GD would do 1 update per epoch.)

• ResNet formula: relu(Wx + b) + x is the correct residual form — TRUE.

• Highway Network: does NOT remove layers. It learns gating functions that control how much information flows through vs. bypasses each layer. The layers are still there; the network learns to route around them.

Learn more