Exam Hints from Prof. Vu
- "I will only ask you the content that is on the slide." No transfer questions.
- Coding questions: explain what a line does, find bugs — NOT write from scratch.
- T/F: Each statement = 1 point. If unsure, write brief reasoning keywords for partial credit.
- "In the exam, I showed you an example — if I ask you the same example, you should be able to do it."
- Keep responses short. It is not necessary to be too wordy.
- Only pen and paper allowed. 90 minutes. 77 points total.
Study Principles
- Verbal First: Say the answer out loud before writing it down. The verbal step is where learning happens.
- Active Recall: Click a question, try to answer it from memory, THEN check.
- Breadth First: Cover every topic at surface level before going deep on any one.
- Calculation Muscle Memory: Knowing the formula conceptually ≠ executing it under time pressure. Drill the computations.
- Priority Order: R1 Mock → R2 Quizzes → R3 Exercises → R4 Memory Protocols.
Linear Algebra & Calculus (supports Tasks 2–5)
13 questionsMatrix-vector multiplication practice. Row 1: 1·2 + 2·0 + 0·(-2) + 1 = 3. Row 2: (-3)·2 + 2·0 + 1·(-2) + 0 = -8. Result: [3, -8]ᵀ
3×3 Jacobian matrix. The Jacobian of g: ℝⁿ→ℝᵐ has dimensions m×n. Here m=3, n=3 → 3×3.
The Jacobian is a 3×2 matrix where entry (i,j) = ∂gᵢ/∂xⱼ. Compute each partial derivative and arrange in the matrix. Practice with specific functions like g(x₁,x₂) = [x₁², x₁x₂, x₂³].
Favorable outcomes for sum=7: (1,6), (2,5), (3,4), (4,3), (5,2), (6,1) = 6 outcomes. Total outcomes: 6×6 = 36. P = 6/36 = 1/6.
Doubles: (1,1), (2,2), (3,3), (4,4), (5,5), (6,6) = 6 outcomes. Total: 36. P = 6/36 = 1/6.
No. Adding two one-hot vectors gives [2,0,0...] or similar — violates the 0/1 constraint. One-hot vectors don't form a vector space because the space isn't closed under addition.
Cosine similarity = (a·b)/(||a|| · ||b||). If result = 0, vectors are orthogonal (perpendicular). If = 1, parallel and same direction. If = -1, parallel and opposite direction. Practice computing dot products and norms for specific vectors.
Derivative: For f: ℝ→ℝ, a single number f'(x). Gradient: For f: ℝⁿ→ℝ, a vector of partial derivatives ∇f. Jacobian: For f: ℝⁿ→ℝᵐ, an m×n matrix of all partial derivatives. The Jacobian generalizes the derivative to vector-valued functions.
Practice: dot products, matrix multiplication, transpose, determinant, inverse, norm, cross product, eigenvalues. These are foundational operations used in every forward/backward pass calculation.
Practice: power rule, chain rule, product rule. Key for backprop: d/dx(eˣ) = eˣ, d/dx(ln x) = 1/x, d/dx(σ(x)) = σ(x)(1-σ(x)), d/dx(tanh x) = 1-tanh²(x), d/dx(ReLU) = 1 if x>0 else 0.
Find eigenvalues by solving det(A - λI) = 0. For 2×2 matrix [[a,b],[c,d]]: λ² - (a+d)λ + (ad-bc) = 0. Use quadratic formula. Then find eigenvectors by solving (A - λI)v = 0 for each eigenvalue.
E[X] = Σ xᵢ · P(xᵢ). Var(X) = E[X²] - (E[X])². For continuous: use integrals. Key distributions: Bernoulli, Gaussian (normal), Uniform. Practice computing these for discrete probability tables.
Update rule: w_{t+1} = w_t - η · ∂L/∂w. Practice: (1) Compute the gradient of a loss function, (2) Apply the update rule for 2-3 steps with a given learning rate. Watch how the parameter moves toward the minimum.
Conceptual
12 questionsVerbatim from mock exam.
Key distinction: Regression predicts a continuous value. Classification predicts a discrete category/label.
NLP examples — Regression: predicting a sentiment score (1.0–5.0). Classification: spam detection, language identification, next-word prediction.
Classification. Words are discrete categories, not continuous numbers. The output is a probability distribution over the vocabulary.
Underfitting. The model can't even fit the training data well. Overfitting would show LOW training error but HIGH test error.
Average loss on the TRAINING data. Not the dev set. Empirical risk = (1/N) Σ L(f(xᵢ), yᵢ) over training samples. The "empirical" part means we use the actual observed data, not the true distribution.
No. The curse of dimensionality: more features means the data becomes sparser in high-dimensional space. Need exponentially more data to maintain the same density. Can lead to overfitting. Feature selection/reduction (PCA, etc.) can help.
K-means stops when: (1) No data points change cluster assignment, (2) Centroids don't move significantly, (3) Maximum number of iterations reached, or (4) Average distance to centroids falls below a threshold.
Feature engineering is the process of creating/selecting/transforming input features to improve model performance. Includes normalization, encoding categorical variables, creating interaction terms. Good features can make simple models perform well; bad features make even complex models fail.
Training curves plot loss/accuracy vs. epochs. Key patterns: (1) Both curves converging = good fit, (2) Training low but val high = overfitting, (3) Both high = underfitting, (4) Gap between curves = generalization gap. Use to decide: more data, more complexity, or regularization.
Kernel: A function that computes dot products in a higher-dimensional space without explicitly mapping to it (kernel trick). Support vectors: The data points closest to the decision boundary — they "support" it. Maximal margin: The largest possible gap between the decision boundary and the nearest data points of each class.
SVMs are natively binary classifiers. For multiclass: (1) One-vs-All (OvA): Train K classifiers, each separating one class from all others. (2) One-vs-One (OvO): Train K(K-1)/2 classifiers for each pair. Use voting for final decision.
Decision trees produce human-readable if/then rules. You can trace exactly why a prediction was made by following the path from root to leaf. Neural networks are "black boxes" — millions of parameters interact in non-linear ways, making it nearly impossible to explain individual predictions.
Feedforward NN with bag-of-words input, or CNN/RNN for sequence. Different random initializations lead to different local minima during training, so models may learn different features. Common practice: train multiple and ensemble, or pick best on validation.
Calculation
5 questionsVerbatim from mock exam.
Step 1 — predictions: \(f(x_1)=1\cdot1+0+1\cdot0+0=1\), \(f(x_2)=0+0+0+0=0\), \(f(x_3)=0+0+1+0=1\)
Step 2 — squared errors: \((2-1)^2=1\), \((1-0)^2=1\), \((3-1)^2=4\)
Step 3 — MSE: \(\frac{1}{3}(1+1+4) = \frac{6}{3} = 2\)
K-means algorithm: (1) Initialize K centroids randomly, (2) Assign each point to nearest centroid (Euclidean distance), (3) Recompute centroids as mean of assigned points, (4) Repeat until convergence. Practice the distance calculations and centroid updates by hand.
Logistic regression: \(\hat{y} = \sigma(\mathbf{w}^T\mathbf{x} + b)\) where \(\sigma(z) = \frac{1}{1+e^{-z}}\). Output is a probability ∈ (0,1). Decision boundary at 0.5. Practice computing the sigmoid for specific values.
Same MSE formula as mock exam: \(\text{MSE} = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2\). Practice computing predictions from weight vectors, then computing the loss.
Given weights w, bias b, and input x: compute z = wᵀx + b, then ŷ = σ(z) = 1/(1+e^(-z)). Remember: σ(0) = 0.5, σ is monotonic, σ(-z) = 1 - σ(z).
True/False
14 questions| Statement | T | F |
|---|---|---|
| The goal of machine learning is to overfit the training data. | ✓ | |
| Regularization is used to prevent the model from overfitting. | ✓ | |
| Support Vector Machines (SVM) can only be used for binary classification. | ✓ | |
| K-means clustering is a partitional clustering algorithm. | ✓ | |
| Hyperparameters should be tuned on the test set. | ✓ |
Verbatim from mock exam. Key: goal is generalization not overfitting; SVMs can do multiclass via OvA/OvO; hyperparams tuned on validation set, NOT test set.
TRUE — but nuanced. Parameters (weights) are learned during training and are model-dependent. Hyperparameters (learning rate, etc.) are set before training. Most hyperparameters ARE model-dependent (e.g., number of layers), but learning rate is a general hyperparameter not tied to a specific model architecture.
FALSE. k is a hyperparameter — it's set before training and not learned during the algorithm. The trainable "parameters" are the centroid positions.
FALSE. More data doesn't always help. Noisy data can hurt. Also, SVM performance depends on finding good support vectors — beyond a point, additional data points far from the decision boundary don't change the model.
TRUE. The entire point of regularization, early stopping, dropout, etc. is to ensure the model performs well on data it hasn't seen during training.
FALSE. SVMs can be extended to multiclass via One-vs-All or One-vs-One strategies. But natively, a single SVM is binary.
FALSE. KNN is non-parametric — it doesn't learn fixed parameters. k is a hyperparameter (set before "training"). KNN stores all training data and computes distances at prediction time.
TRUE. This is a valid convergence criterion. When no point is farther than a threshold from its centroid, the clustering is stable enough.
TRUE. The decision boundary depends only on the support vectors — the points closest to it. All other points could be removed without changing the boundary.
FALSE (in general). Gradient descent can get stuck in local minima for non-convex loss functions (like neural networks). For convex functions (like linear regression), it does find the global minimum.
TRUE. Linear Discriminant Analysis is supervised — it uses class labels to find a linear transformation that maxim
| Statement | T | F |
|---|---|---|
| K-nearest neighbors algorithms is a parametric method. | ✓ | |
| Hyperparameters are trained to optimize the results on the training set. | ✓ | |
| Generalisation is a central problem of machine learning. | ✓ | |
| An overfitted model perfectly matches the evaluation data. | ✓ | |
| Regularization typically increases the error on the training set. | ✓ | |
| Logistic regression is typically used to predict real values. | ✓ | |
| Empirical risk is an average of a loss function on a finite development set. | ✓ | |
| Support vector machine can be used only for binary classification tasks. | ✓ | |
| Each data item is assigned to exactly one cluster with k-means clustering. | ✓ | |
| In active learning, systems actively select queries and request feedback from human. | ✓ |
From 2024 Mock Exam. Key traps:
• Overfitted model ≠ matches evaluation data. Overfitting means matching TRAINING data too well, performing POORLY on evaluation/test data.
• Empirical risk uses TRAINING set, not development set — this is the same trick from the in-class quiz.
• 10 statements with MINUS scoring — our existing mock had only 5. Budget more time for this.
Practice T/F statements from exercise sheets. These cover similar ground to the mock exam: overfitting, regularization, model selection, bias-variance tradeoff.
Conceptual
7 questionsYes, somewhat. A single neuron computes a weighted sum + bias + activation — you can look at the weights to see which inputs matter most. But as networks get deeper, individual neurons become less interpretable because they represent increasingly abstract features.
MSE: regression tasks. Cross-entropy (CE): classification tasks. Binary CE: two-class problems. Categorical CE: multi-class (with softmax). CE is preferred for classification because it penalizes confident wrong predictions more heavily and has nicer gradients with softmax.
Sigmoid: σ(z) = 1/(1+e^(-z)), output (0,1), vanishing gradient for large |z|. Tanh: output (-1,1), zero-centered, still vanishing gradient. ReLU: max(0,z), no vanishing gradient for z>0, but "dying ReLU" for z<0. ReLU is most common in hidden layers; sigmoid/softmax for output layers.
The backward pass computes gradients of the loss w.r.t. each parameter using the chain rule. Starting from the output layer: (1) Compute error signal δ at output, (2) Propagate δ backward through each layer, (3) At each layer, compute ∂L/∂W = δ · aᵀ (activation from previous layer), (4) Update δ for next layer back using weights and activation derivative.
Without non-linear activations, stacking multiple linear layers collapses to a single linear transformation: W₂(W₁x + b₁) + b₂ = W'x + b'. The network couldn't learn any non-linear patterns regardless of depth. Non-linear activations let the network approximate arbitrary functions.
Deeper networks can represent increasingly abstract features hierarchically. Early layers learn simple patterns (edges), later layers combine them into complex concepts (faces). A single wide layer would need exponentially more neurons to represent the same functions. Depth = compositional power.
(1) Number of layers (depth), (2) Number of neurons per layer (width), (3) Learning rate. Others: activation function, batch size, number of epochs, optimizer choice, regularization strength.
Calculation
11 questionsVerbatim from mock exam.
Step 1 — BoW: x = [1,1,1,1]ᵀ (each word appears once)
Step 2 — Hidden: h = ReLU(W¹x) = ReLU([0.5, 0.5, 0.5]ᵀ) = [0.5, 0.5, 0.5]ᵀ
Step 3 — Output: z = W²h + b² = [0.25, 0.75]ᵀ
Step 4 — Softmax: \(\hat{y} = [\frac{e^{0.25}}{e^{0.25}+e^{0.75}}, \frac{e^{0.75}}{e^{0.25}+e^{0.75}}]\)
Step 5 — CE Loss: \(L = -\log(\hat{y}_{\text{class 1}})\) where class 1 = positive
Verbatim from mock exam.
Step 1 — Error signal: δ² = ŷ - y = [0.88, 0.12] - [0, 1] = [0.88, -0.88]
Step 2 — Weight gradient: \(\nabla_{W^2} = \delta^2 \cdot (a^1)^T\) where a¹ is the hidden layer output from forward pass
Step 3 — Bias gradient: \(\nabla_{b^2} = \delta^2 = [0.88, -0.88]\)
Key insight: For softmax + CE, the error signal simplifies to ŷ - y (prediction minus target).
Same process as mock 2a but with different dimensions. (1) Compute z¹ = W¹x + b¹, (2) Apply ReLU: a¹ = max(0, z¹), (3) Compute z² = W²a¹ + b², (4) Apply softmax: ŷ = softmax(z²). Practice this until automatic.
For a layer mapping from n inputs to m outputs: Weights: m × n parameters, Biases: m parameters. Total per layer: m(n+1). For the whole network, sum across all layers. Example: 4→3→2 network = 3×4 + 3 + 2×3 + 2 = 12+3+6+2 = 23 parameters.
Same procedure as mock exam. Practice with different weight matrices and activation functions. Key steps: (1) Linear: z = Wx + b, (2) Activation: a = f(z), (3) Repeat for each layer, (4) Final output with appropriate activation (softmax for classification).
Full backprop: (1) Compute δ at output layer (depends on loss function), (2) For each layer going backward: ∂L/∂W = δ · aᵀ (previous activation), ∂L/∂b = δ, then propagate: δ_prev = (Wᵀδ) ⊙ f'(z). The ⊙ is element-wise multiplication with the activation derivative.
Variant of mock exam 2a but with f(x) = x² as activation instead of ReLU. Same steps: encode input → matrix multiply → apply activation → matrix multiply → softmax → CE loss. Be careful: x² activation means f'(x) = 2x for the backward pass.
The key formulas: (1) Output error: δᴸ = ŷ - y (for softmax+CE), (2) Hidden error: δˡ = (Wˡ⁺¹)ᵀδˡ⁺¹ ⊙ f'(zˡ), (3) Weight gradient: ∂L/∂Wˡ = δˡ(aˡ⁻¹)ᵀ, (4) Bias gradient: ∂L/∂bˡ = δˡ. Make sure you can fill these in from memory.
From 2024 Mock Exam. Key differences from our existing mock:
• Same f=x² activation trick, but different weight matrices:
\(W^1 = \begin{bmatrix} 0 & 1 & -1 & 0 \\ 1 & -1 & 0 & 1 \end{bmatrix}\), \(W^2 = \begin{bmatrix} -1 & 1 \\ 1 & 0 \end{bmatrix}\), \(W^3 = \begin{bmatrix} 0 & 1 \\ 1 & -1 \\ -1 & 1 \end{bmatrix}\)
• 3-class output (positive/negative/neutral) with softmax
• Uses log₁₀ for cross-entropy (not ln) — watch the base!
• Round to 1 decimal point
Procedure: (1) Encode tweet as BoW → x = [1,0,0,1]ᵀ, (2) z¹ = W¹x, a¹ = (z¹)², (3) z² = W²a¹, a² = (z²)², (4) z³ = W³a², y = softmax(z³), (5) CE = -log₁₀(ŷ_true_class)
From 2024 Mock Exam. Fill-in-the-blank format.
Box 1 (Forward pass):
\(z_i^l = \sum_j w_{ij}^l \cdot a_j^{l-1}\) (for l > 1), or \(z_i^l = \sum_j w_{ij}^l \cdot x_j\) (for l = 1)
Box 2 (Backward pass / δ):
\(\delta_i^l = \frac{\partial C}{\partial z_i^l}\) — the error signal at neuron i in layer l
This is a diagram-based question testing whether you understand how the chain rule decomposes into a forward pass component and a backward pass component.
From 2024 Mock Exam.
Why initialization matters: Neural networks use gradient descent to find local minima. Different initializations start the optimization at different points in the loss landscape, leading to different local minima → different final results. Bad initialization can cause vanishing/exploding gradients from the very start.
Xavier/Glorot initialization: \(W \sim \mathcal{N}(0, \frac{2}{n_{in} + n_{out}})\) — keeps variance of activations stable across layers. Good for sigmoid/tanh.
He initialization: \(W \sim \mathcal{N}(0, \frac{2}{n_{in}})\) — designed for ReLU activations.
Key point: we NEVER initialize all weights to the same value (e.g., all zeros) because then all neurons compute the same thing → symmetry problem → network can't learn different features.
Code / Bug-Finding
1 questionsVerbatim from mock exam.
1for epoch in range(num_epochs): 2 for batch_x, batch_y in train_loader: 3 output = model(batch_x) 4 loss = criterion(output, batch_y) 5 loss.backward() 6 7 print(f'Epoch {epoch}, Loss: {loss.item()}')
loss.backward() computes gradients of the loss w.r.t. all parameters via backpropagation (fills .grad attributes).
Missing operations:
1. optimizer.zero_grad() — must clear old gradients before backward() (otherwise they accumulate)
2. optimizer.step() — must update parameters using the computed gradients (after backward())
True/False
4 questions| Statement | T | F |
|---|---|---|
| A neural network layer can be described as a linear transformation followed by a nonlinear activation. | ✓ | |
| Cross-entropy is often used as loss function for multi-class, multi-label classification. | ✓ | |
| In the backward pass, we start computing from δ¹. | ✓ | |
| A neural network typically consists of multiple layers. | ✓ | |
| ReLU activation is defined as min(0, z). | ✓ |
Verbatim from mock exam. Key: backward starts from OUTPUT (δᴸ), not δ¹. ReLU = max(0,z) not min.
TRUE — this is tricky. Binary cross-entropy can be applied per-label independently for multi-label problems. Each label gets its own sigmoid output and BCE loss. The losses are then summed.
TRUE. By definition, a neural network has at least an input layer, one or more hidden layers, and an output layer. Even the simplest useful NN has multiple layers.
From 2024 Mock Exam.
Why initialization matters: Neural networks use gradient descent to find local minima. Different initializations start the optimization at different points in the loss landscape, leading to different local minima → different final results. Bad initialization can cause vanishing/exploding gradients from the very start.
Xavier/Glorot initialization: \(W \sim \mathcal{N}(0, \frac{2}{n_{in} + n_{out}})\) — keeps variance of activations stable across layers. Good for sigmoid/tanh.
He initialization: \(W \sim \mathcal{N}(0, \frac{2}{n_{in}})\) — designed for ReLU activations.
Key point: we NEVER initialize all weights to the same value (e.g., all zeros) because then all neurons compute the same thing → symmetry problem → network can't learn different features.
Conceptual
7 questionsVerbatim from mock exam.
RNNs have recurrent connections — they maintain a hidden state that is updated at each timestep, creating a "memory" of previous inputs. FFNNs process each input independently with no memory.
RNNs handle sequential/variable-length input data (text, speech, time series) where order matters. FFNNs require fixed-size input.
Verbatim from mock exam.
\(a_t = f(W_i x_t + W_h a_{t-1} + b)\)
Where: \(a_t\) = hidden state at time t, \(x_t\) = input at time t, \(a_{t-1}\) = previous hidden state, \(W_i\) = input-to-hidden weights, \(W_h\) = hidden-to-hidden (recurrent) weights, \(b\) = bias, \(f\) = activation function (e.g., tanh, ReLU).
LSTM uses gates to control information flow: (1) Forget gate: decides what to discard from cell state (sigmoid → 0=forget, 1=keep), (2) Input gate: decides what new info to store (sigmoid × tanh candidate), (3) Output gate: decides what to output based on cell state. The cell state acts as a "conveyor belt" — information can flow through unchanged, solving vanishing gradients.
A simple RNN has one set of weight matrices (Wᵢ, Wₕ, b). An LSTM has 4 sets — one for each gate: forget gate (Wf), input gate (Wi), candidate cell state (Wc), and output gate (Wo). Each gate has its own input weights, recurrent weights, and bias. So ~4× the parameters.
Elman RNN: Hidden state feeds back to itself. \(a_t = f(W_i x_t + W_h a_{t-1})\). Jordan RNN: Output feeds back to hidden layer. \(a_t = f(W_i x_t + W_h y_{t-1})\). Elman is more common in practice. Key distinction: what gets fed back — hidden state (Elman) vs. output (Jordan).
Forget gate: fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf) — controls what to forget from cell state. Input gate: iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi) — controls what new info to add. Output gate: oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo) — controls what to output. All gates use sigmoid (0-1 range) to act as "valves".
No. The correct term is Backpropagation Through Time (BPTT). We "unroll" the RNN across timesteps and backpropagate through the unrolled graph. "Through space" is not a real concept — it's a trick question.
Calculation
8 questionsVerbatim from mock exam.
Step 1 — t=1: \(a_1 = \text{ReLU}(W_i x_1 + W_h a_0) = \text{ReLU}([1,0]^T + [0,0]^T) = [1,0]^T\)
Step 2 — t=2: \(a_2 = \text{ReLU}(W_i x_2 + W_h a_1) = \text{ReLU}([0,1]^T + [0,1]^T) = [0,2]^T\)
Step 3 — output: \(y_2 = W_o a_2 = [1,1] \cdot [0,2]^T = 2\)
For an RNN with input size n and hidden size h: Wᵢ has h×n params, Wₕ has h×h params, bias has h params. Total: h(n+h+1). For LSTM: multiply by 4 (four gates). Plus output layer Wₒ with o×h params. Practice counting for specific dimensions.
Same process as mock exam 3c. Given specific weight matrices and inputs, compute hidden states step by step: aₜ = f(Wᵢxₜ + Wₕaₜ₋₁ + b). Then compute outputs yₜ = Wₒaₜ. Practice with different sizes and activations.
For one LSTM timestep: (1) fₜ = σ(Wf·[hₜ₋₁, xₜ] + bf), (2) iₜ = σ(Wi·[hₜ₋₁, xₜ] + bi), (3) c̃ₜ = tanh(Wc·[hₜ₋₁, xₜ] + bc), (4) cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ, (5) oₜ = σ(Wo·[hₜ₋₁, xₜ] + bo), (6) hₜ = oₜ ⊙ tanh(cₜ). Practice computing each gate value for given inputs.
Similar to mock exam but with RNN architecture and x² activation. One-hot encode each word → feed through RNN timesteps → apply f(z)=z² → output with softmax. Remember: f'(z) = 2z for the backward pass variant.
Common errors in LSTM equations: (1) Wrong activation (using ReLU instead of sigmoid for gates), (2) Missing element-wise multiplication ⊙, (3) Wrong concatenation in gate inputs, (4) Forget gate applied to wrong thing, (5) Output gate formula errors. Check each gate formula carefully against the standard LSTM.
From 2024 Mock Exam.
Weights: \(W_i = \begin{bmatrix} 1 & -1 & 0 \\ -1 & 0 & 1 \end{bmatrix}\), \(W_h = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}\), \(W_{out} = \begin{bmatrix} -1 & 1 \\ 0 & -1 \\ 1 & 0 \end{bmatrix}\)
Important: The activation is f=x² (element-wise square), NOT sigmoid! This simplification is repeated from the other mock exam — expect this on the real exam.
Procedure: For each word in sequence: (1) encode as one-hot, (2) compute \(z_t = W_i x_t + W_h a_{t-1}\), (3) apply \(a_t = z_t^2\), (4) after last word: \(y = \text{softmax}(W_{out} \cdot a_{last})\)
From 2024 Mock Exam. Given equations:
\(f_t = \text{sigmoid}(W_f[h_{t-1}, x_{t-1}] + b_f)\) — Eq.2
\(i_t = \text{sigmoid}(W_i[h_{t-1}, x_{t-1}] + b_i)\) — Eq.3
\(\tilde{C}_t = \tanh(W_C[h_{t-1}, x_t] + b_C)\) — Eq.4
\(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\) — Eq.5
\(o_t = \tanh(W_o[h_{t-1}, x_t] + b_o)\) — Eq.6
\(h_t = o_t \odot \tanh(C_t)\) — Eq.7
Three errors:
1. Eq.2 & 3: Should be \(x_t\) not \(x_{t-1}\) — LSTM gates use CURRENT input
2. Eq.6: Output gate uses sigmoid, not tanh — gates are always sigmoid (values 0-1)
This is a tricky "spot the bug" question. Know the correct LSTM equations cold.
True/False
3 questions| Statement | T | F |
|---|---|---|
| Recurrent neural networks are often trained using Backpropagation Through Time (BPTT). | ✓ | |
| LSTMs completely solve the vanishing gradient problem. | ✓ | |
| The last hidden state of an RNN always captures the information of the whole input sequence. | ✓ | |
| In practice, we always need to pad the input for an RNN to work. | ✓ |
Verbatim from mock exam. Key: LSTMs mitigate but don't completely solve vanishing gradients. Last hidden state may lose early info. Padding is practical necessity for batching, not theoretical requirement.
No. In theory, RNNs can process variable-length sequences one at a time. Padding is a practical requirement for batched processing — you need uniform tensor dimensions within a batch. A single sequence needs no padding.
From 2024 Mock Exam. Given equations:
\(f_t = \text{sigmoid}(W_f[h_{t-1}, x_{t-1}] + b_f)\) — Eq.2
\(i_t = \text{sigmoid}(W_i[h_{t-1}, x_{t-1}] + b_i)\) — Eq.3
\(\tilde{C}_t = \tanh(W_C[h_{t-1}, x_t] + b_C)\) — Eq.4
\(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\) — Eq.5
\(o_t = \tanh(W_o[h_{t-1}, x_t] + b_o)\) — Eq.6
\(h_t = o_t \odot \tanh(C_t)\) — Eq.7
Three errors:
1. Eq.2 & 3: Should be \(x_t\) not \(x_{t-1}\) — LSTM gates use CURRENT input
2. Eq.6: Output gate uses sigmoid, not tanh — gates are always sigmoid (values 0-1)
This is a tricky "spot the bug" question. Know the correct LSTM equations cold.
Conceptual
10 questionsVerbatim from mock exam.
(1) Parameter sharing / reduction: Same filter applied across entire input → far fewer parameters than fully connected. (2) Translation invariance: Features detected regardless of position (a cat is a cat whether left or right in the image). Also: local connectivity captures spatial patterns.
Verbatim from mock exam.
Speech: e.g., speech recognition or speaker identification. Input = spectrogram (2D: time × frequency). Text: e.g., sentiment analysis or text classification. Input = word embeddings stacked as a matrix (2D: sequence length × embedding dimension).
Verbatim from mock exam.
Stack word embeddings into a matrix (rows = words, columns = embedding dimensions). Apply 1D filters that span the full embedding width but vary in height (n-gram size). A filter of height 2 captures bigram patterns, height 3 captures trigrams, etc. This lets CNNs capture local n-gram features without explicit n-gram engineering.
Verbatim from mock exam.
The gradient only flows through the position that was selected as maximum. For the max element: gradient passes through unchanged (derivative = 1). For all non-max elements: gradient = 0. In practice, you store the indices of max elements during forward pass ("switches") and route gradients back through those positions.
A fully connected layer would need m×n parameters for every element of the matrix to every neuron. For a 100×100 image → 10,000 inputs → a hidden layer of 1000 neurons needs 10 million weights! This motivates CNNs: parameter sharing through filters dramatically reduces parameters while capturing spatial structure.
If an object shifts position in the input, a fully connected network treats it as completely different input (different neurons activate). CNNs are translation-invariant — the same filter detects the same feature regardless of position. This is why CNNs are essential for vision and signal processing.
Hyperparameters: (1) Number of filters (channels), (2) Filter/kernel size, (3) Stride, (4) Padding (same/valid), (5) Pooling size and type (max/average). Note: filter size is a hyperparameter, but filter weights are learned parameters.
Output size = \(\lfloor\frac{n - k + 2p}{s}\rfloor + 1\) where n = input size, k = kernel size, p = padding, s = stride. Same formula for both conv and pooling layers. For 2D: apply independently to height and width.
For variable-length inputs: (1) Padding: pad shorter sequences to max length, (2) Global pooling: apply global max/average pooling over the sequence dimension to get fixed-size output regardless of input length, (3) 1-max pooling: take maximum value from each filter's output across all positions.
From 2024 Mock Exam.
Intuition: In images, CNNs detect local spatial patterns (edges, textures). In language, local patterns are n-grams — groups of adjacent words that carry meaning. A CNN filter sliding over a sentence detects local word patterns just like it detects visual patterns in images.
Input representation: Each word is represented as a dense vector (word embedding). The sentence becomes a 2D matrix: rows = words in sequence, columns = embedding dimensions. So a sentence of length L with embedding dimension D gives an L × D input matrix.
Filters then slide vertically (across words) with full width (across all embedding dimensions), detecting n-gram patterns of different sizes.
Calculation
3 questionsVerbatim from mock exam.
Step 1 — Conv output size: height = (3-2)/1 + 1 = 2, width = (4-2)/2 + 1 = 2 → 2×2 output
Step 2 — Conv computation (stride 1 vertically, 2 horizontally):
Position (0,0): 0·1 + 0·0 + 1·0 + 1·1 = 1
Position (0,1): 1·1 + 2·0 + 2·0 + 1·1 = 2
Position (1,0): 1·1 + 1·0 + 1·0 + 0·1 = 1
Position (1,1): 2·1 + 1·0 + 1·0 + 0·1 = 2
Conv output: \(\begin{bmatrix} 1 & 2 \\ 1 & 2 \end{bmatrix}\)
Step 3 — Max pool (1×2): pool each row → [2, 2] → Result: \(\begin{bmatrix} 2 \\ 2 \end{bmatrix}\)
Practice the same convolution process with different inputs and filters. For each position: element-wise multiply filter with input patch, sum all products. Move filter by stride amount. Remember: output dimensions use the formula ⌊(n-k+2p)/s⌋ + 1.
From 2024 Mock Exam.
Input: \(\begin{pmatrix} 2 & 2 & 1 & 0 \\ -1 & 2 & 1 & 1 \\ 1 & -1 & 1 & -1 \\ 0 & 1 & 2 & -2 \end{pmatrix}\), Filter: \(\begin{pmatrix} 1 & -1 \\ 0 & 2 \end{pmatrix}\), Pooling: 2×2 max
Key differences from our existing mock: stride=2 for conv (not 1), activation f=x², pooling stride=1
Procedure:
1. Slide 2×2 filter with stride 2 → output size = ⌊(4-2)/2⌋+1 = 2 → 2×2 conv output
2. Apply activation f=x² element-wise
3. Apply 2×2 max pooling with stride 1 → output size = ⌊(2-2)/1⌋+1 = 1 → 1×1 output
True/False
3 questions| Statement | T | F |
|---|---|---|
| The number of filters is a hyperparameter. | ✓ | |
| The filter size determines the parameters of the model. | ✓ | |
| A CNN is a special case of a feedforward neural network. | ✓ | |
| The weights of the filter are the parameters of the model. | ✓ |
Verbatim from mock exam. Key: filter SIZE is a hyperparameter (you choose it), but filter WEIGHTS are the learned parameters. The tricky one is "filter size determines parameters" — FALSE, because while filter size affects the count of parameters, it doesn't determine which values the parameters take.
TRUE. You (the designer) choose how many filters to use — it's not learned from data. This was an in-class quiz question testing the same concept as Mock 4f statement 1.
| Statement | T | F |
|---|---|---|
| The filter weights are trainable parameters of a CNN. | ✓ | |
| A convolutional layer with 10 filters (with bias) of size 3×3 has 100 trainable parameters. | ✓ | |
| Zero padding is only necessary when processing several input matrices at a time (in a batch or mini-batch). | ✓ | |
| A typical convolutional layer for language spans the whole sentence. | ✓ | |
| The average pooling layer can be used to downsample a matrix. | ✓ |
From 2024 Mock Exam. Key traps:
• 10 filters of 3×3 with bias ≠ 100 params. It's 10 × (3×3×C + 1) where C = number of input channels. With C=1: 10×(9+1) = 100 would be TRUE, but the question doesn't specify single-channel. In general CNN context (e.g., RGB), C>1 so it's FALSE.
• Zero padding is used to control output spatial dimensions, NOT just for batching.
• Conv for language spans the full embedding width but NOT the whole sentence — filters slide across words.
Conceptual
18 questionsVerbatim from mock exam.
Property: Variable-length input maps to variable-length output (lengths can differ). Example: Machine translation ("Ich bin müde" → "I am tired"), text summarization, speech recognition (audio → text).
Verbatim from mock exam.
Scoring methods: (1) Dot product: score = hₑᵀ · hd, (2) Additive (Bahdanau): score = vᵀ · tanh(W₁hₑ + W₂hd). Other option: (3) Scaled dot product: score = (hₑᵀ · hd) / √d.
How encoder states are used: Compute attention scores between decoder hidden state and ALL encoder hidden states → softmax → weighted sum of encoder states → context vector fed to decoder.
Verbatim from mock exam.
Q = XW_Q, K = XW_K, V = XW_V where X is the input matrix and W_Q, W_K, W_V are separate learnable weight matrices. Each input token gets its own query, key, and value by linear projection. The three weight matrices W_Q, W_K, W_V are the learnable parameters.
Verbatim from mock exam (4 pts). Based on the jalammar.github.io illustration:
Step 1: Compute Q, K, V by multiplying input embeddings with weight matrices W_Q, W_K, W_V.
Step 2 (purpose): Compute attention scores by taking dot product of query with all keys: score = Q · Kᵀ. Then scale by √d_k to prevent softmax saturation. This determines how much each token should "attend to" every other token.
Step 3: Apply softmax to the scaled scores to get attention weights (probabilities summing to 1).
Step 4: Multiply attention weights by V to get weighted sum — the output representation for each token.
Full formula: \(\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\)
2: the weight matrix W and the bias vector b. nn.Linear(in, out) stores W of shape (out, in) and b of shape (out). If bias=False, then just 1 tensor (W only).
Seq2seq uses two RNNs: an encoder reads the input sequence and compresses it into a context vector (final hidden state), and a decoder generates the output sequence one token at a time, conditioned on the context vector. Problem: fixed-size context vector is a bottleneck for long sequences → solved by attention.
Self-attention lets each token attend to all other tokens in the same sequence. For each token: (1) Create Q, K, V vectors via learned projections, (2) Score = dot product of Q with all K's, (3) Softmax to get weights, (4) Weighted sum of V's = output. Captures long-range dependencies without recurrence. O(n²) complexity in sequence length.
Parameters (learned): W_Q, W_K, W_V weight matrices, output projection weights. Hyperparameters (chosen): d_model (model dimension), d_k (key/query dimension), d_v (value dimension), number of attention heads, number of layers.
Transformers & BERT
The transformer is based solely on attention mechanisms — no recurrence, no convolution. Advantages over RNNs: (1) fully parallelizable (no sequential dependency), (2) O(1) maximum path length for long-range dependencies vs O(n). Advantage over CNNs: O(1) path vs O(log_k(n)). The trade-off: self-attention has O(n²·d) complexity per layer, which is expensive for very long sequences.
1. Encoder self-attention: Each encoder position attends to all positions in the encoder input. Captures relationships within the source sequence.
2. Masked decoder self-attention: Each decoder position attends only to previous decoder positions (future tokens are masked with −∞ before softmax). Ensures autoregressive generation.
3. Encoder-decoder (cross) attention: Queries come from the decoder, keys and values come from the encoder output. This is how the decoder "reads" the source — analogous to attention in seq2seq.
Self-attention treats the input as a set, not a sequence — it has no built-in notion of word order. Without positional encoding, "the cat sat on the mat" and "the mat sat on the cat" produce identical representations. Positional encodings are added to the input embeddings to inject position information. The original paper uses sinusoidal functions so the model can generalize to unseen sequence lengths.
\(PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)\)
\(PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)\)
Where pos = position in the sequence, i = dimension index, d_model = model dimension. Each dimension uses a different frequency. For any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos), letting the model learn to attend to relative positions.
Instead of one attention function with d_model-dimensional keys/values/queries, multi-head attention runs h parallel attention heads, each with reduced dimension d_k = d_v = d_model / h.
Formula: MultiHead(Q, K, V) = Concat(head₁, …, headₕ) · W^O, where each headᵢ = Attention(Q·Wᵢ^Q, K·Wᵢ^K, V·Wᵢ^V).
Why multiple heads: Different heads can learn to attend to different types of relationships (e.g., syntactic vs. semantic). The output projection W^O (d_model × d_model) combines information from all heads.
From "Attention Is All You Need" (Table 1 on the slides):
| Layer Type | Complexity/Layer | Sequential Ops | Max Path Length |
|---|---|---|---|
| Self-Attention | O(n²·d) | O(1) | O(1) |
| Recurrent | O(n·d²) | O(n) | O(n) |
| Convolutional | O(k·n·d²) | O(1) | O(log_k(n)) |
n = sequence length, d = representation dimension, k = kernel size. Self-attention wins on parallelization and long-range paths, but costs more per layer for long sequences (n > d).
Problem: Static embeddings assign one fixed vector per word regardless of context. "Bank" gets the same vector in "river bank" and "bank account" — polysemy is lost.
Solution: Contextual embeddings generate a different vector for each word depending on its context. Instead of a lookup table, a model (ELMo, BERT) dynamically computes the embedding from the full sentence.
ELMo (Embeddings from Language Models, Peters et al. 2018) uses a bidirectional LSTM (BiLSTM): a forward LSTM reads left-to-right, a backward LSTM reads right-to-left. The contextualized embedding for each word is the concatenation of the forward and backward hidden states. Because each direction sees different context, the combined representation captures the full surrounding context.
BERT (Devlin et al. 2019) replaces ELMo's BiLSTM with a Transformer encoder. Two pre-training objectives:
1. Masked Language Model (MLM): Randomly mask 15% of input tokens; train the model to predict them. Unlike left-to-right LMs, this allows true bidirectional context.
2. Next Sentence Prediction (NSP): Given sentences A and B, predict whether B follows A. Trains inter-sentence understanding.
Pre-trained BERT is a powerful feature extractor that can be fine-tuned with relatively little task-specific data.
Add a small task-specific output layer on top of pre-trained BERT and train end-to-end. Four task types:
1. Sentence pair classification (MNLI, QQP): [CLS] + Sentence A + [SEP] + Sentence B → class label from [CLS].
2. Single sentence classification (SST-2, CoLA): [CLS] + Sentence → class label from [CLS].
3. Question answering (SQuAD): Question + [SEP] + Paragraph → predict start/end span.
4. Token-level tagging (CoNLL NER): [CLS] + Tokens → per-token labels (B-PER, O, etc.).
Calculation
3 questionsEncoder RNN: h(n+h+1) params. Decoder RNN: h(m+h+1) params (m = target vocab embedding size). Attention: depends on type — additive has W₁(h), W₂(h), v(h) params. Output projection: vocab_size × h. Don't forget embeddings: vocab × d_embed for both source and target.
Given encoder hidden states h₁, h₂, h₃ and decoder state sₜ: (1) Compute scores: eᵢ = score(sₜ, hᵢ) using dot product or additive method, (2) Softmax: αᵢ = exp(eᵢ)/Σexp(eⱼ), (3) Context: cₜ = Σ αᵢhᵢ. Practice the full computation with specific numbers.
d_k = d_model / h = 512 / 8 = 64 per head.
Parameters: Each of W_Q, W_K, W_V projects from d_model → d_model (all heads combined): 512 × 512 = 262,144 params each. W^O projects concatenated output back: 512 × 512 = 262,144 params. Total: 4 × 262,144 = 1,048,576 parameters for one multi-head attention sub-layer (excluding biases).
Code / Bug-Finding
5 questionsBug: Using W for all three: Q=XW, K=XW, V=XW. Fix: Need separate matrices: Q=XW_Q, K=XW_K, V=XW_V. With a single matrix, Q and K would be identical, making the attention scores trivial (each token attends equally or only to itself).
Bug: Dividing by √(seq_length) instead of √(d_k). Fix: Scale by √d_k (dimension of the key vectors). The scaling prevents dot products from growing too large with higher dimensions, which would push softmax into regions with tiny gradients.
Bug: score = QVᵀ instead of QKᵀ. Fix: Attention scores are computed between queries and keys: score = QKᵀ/√d_k. V is only used after softmax to compute the weighted output. Q-K matching determines "where to look"; V provides "what to read".
Bug: Using raw scores to weight V: output = scores · V. Fix: Must apply softmax first: output = softmax(scores) · V. Without softmax, the weights aren't normalized to sum to 1, and the attention mechanism doesn't produce proper probability-weighted combinations.
Bug: Decoder runs independently without receiving encoder information. Fix: Decoder must be initialized with the encoder's final hidden state (basic seq2seq) or receive attention-weighted encoder states at each timestep (attention mechanism). Without this connection, the decoder has no knowledge of the input.
True/False
3 questionsFalse. Self-attention has O(n²·d) complexity; recurrent has O(n·d²). When sequence length n exceeds representation dimension d, self-attention is more expensive. Self-attention wins on parallelization (O(1) sequential ops) and path length, but not always on raw computation cost.
True. The decoder uses masked self-attention: future positions are set to −∞ before softmax, zeroing their attention weights. This preserves the autoregressive property — the prediction for position t depends only on positions < t (no peeking at future tokens during generation).
False. BERT uses only the transformer encoder (no decoder). It is designed for understanding tasks (classification, NER, QA), not autoregressive generation. The bidirectional context in BERT — seeing both left and right via masked language modeling — is possible precisely because it doesn't generate tokens left-to-right like a decoder.
Conceptual
20 questionsVerbatim from mock exam.
SGD: Updates parameters using gradient from ONE sample at a time. Noisy but fast updates. Mini-batch GD: Updates using gradient averaged over a small batch (e.g., 32 samples). Less noisy than SGD, more efficient than full batch. Full batch GD: uses ALL training data per update — stable but slow.
Verbatim from mock exam.
Why: Bad initialization → vanishing/exploding gradients, slow convergence, or getting stuck. All-zeros = all neurons learn the same thing (symmetry problem).
Xavier/Glorot: \(W \sim \mathcal{N}(0, \frac{2}{n_{in}+n_{out}})\) — designed for sigmoid/tanh. Keeps variance stable across layers.
He/Kaiming: \(W \sim \mathcal{N}(0, \frac{2}{n_{in}})\) — designed for ReLU. Accounts for ReLU killing half the values.
Verbatim from mock exam.
Weight decay adds a penalty proportional to the squared magnitude of weights to the loss function: \(L_{total} = L_{original} + \frac{\lambda}{2}||w||^2\). This encourages smaller weights, preventing any single weight from dominating. Purpose: Regularization — prevents overfitting by penalizing model complexity.
Verbatim from mock exam.
If dropout rate = p (e.g., 0.5) and we DON'T scale during training, then at test time we must multiply all weights by (1-p). Why? During training, on average only (1-p) fraction of neurons are active. At test time all neurons are active, so outputs would be too large by factor 1/(1-p). Multiplying by (1-p) compensates.
Alternative (inverted dropout): scale by 1/(1-p) during TRAINING, then no scaling needed at test time.
Why ReLU: (1) Simple to compute: max(0,z), (2) No vanishing gradient for z>0 (gradient = 1), (3) Sparse activation (many zeros = efficient). Why variations: ReLU has the "dying ReLU" problem — neurons with z<0 always output 0 and stop learning. Variations like Leaky ReLU (small slope for z<0) and ELU address this.
Leaky ReLU: f(z) = max(αz, z) with small α like 0.01. Parametric ReLU (PReLU): same but α is learned. ELU: f(z) = z for z>0, α(eᶻ-1) for z≤0 — smooth and can output negative values. SELU: self-normalizing, maintains mean/variance across layers. GELU: used in transformers, smooth approximation.
Architecture choices are hyperparameters selected via experimentation: (1) Depth: deeper = more abstract features but harder to train, (2) Width: wider layers = more capacity per layer, (3) Activation: ReLU for hidden layers (default), softmax for multi-class output, sigmoid for binary output. Use validation performance to guide choices.
Patience in early stopping = the number of epochs to wait after the last improvement in validation loss before stopping training. E.g., patience=5 means: if validation loss doesn't improve for 5 consecutive epochs, stop. Prevents overfitting by stopping before the model starts memorizing training data.
\(y = f(x) + x\) — the skip/residual connection. The network learns the residual f(x) = y - x, which is easier to optimize. Helps with vanishing gradients in deep networks because gradients can flow directly through the skip connection.
During training: randomly set neurons to 0 with probability p. Each forward pass uses a different "thinned" network. Effect: (1) Prevents co-adaptation — neurons can't rely on specific other neurons, (2) Implicit ensemble — averaging over 2ⁿ possible sub-networks, (3) Adds noise → regularization. At test time: use all neurons (with appropriate scaling).
Kaiming/He initialization: \(W \sim \mathcal{N}(0, \sqrt{2/n_{in}})\) where n_in is the number of input connections. Designed specifically for ReLU activations. The factor of 2 compensates for ReLU zeroing out roughly half the values, which would otherwise halve the variance at each layer.
AdaGrad adapts the learning rate per parameter. It divides the learning rate by the square root of the sum of all past squared gradients: \(w_t = w_{t-1} - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot g_t\). Parameters with large past gradients get smaller updates. Problem: learning rate monotonically decreases → can stop learning too early. RMSProp and Adam fix this.
When gradients become very large (exploding gradients), clip them to a maximum norm. If ||g|| > threshold, scale g down to g · (threshold / ||g||). Prevents unstable training, especially in RNNs where BPTT can cause gradient magnitudes to grow exponentially across timesteps.
Artificially increase training data size by applying transformations (rotation, flipping, noise, cropping for images; synonym replacement, back-translation for text). Helps prevent overfitting by exposing the model to more variation. Particularly important when training data is limited.
Add skip connections that bypass one or more layers: y = F(x) + x. Benefits: (1) Gradients flow directly through skip connections → no vanishing gradient even in very deep networks (100+ layers), (2) Network only needs to learn the residual F(x) = y - x, which is often easier, (3) Worst case: if F(x) ≈ 0, the layer becomes identity → doesn't hurt performance.
Motivation: Internal covariate shift — each layer's input distribution changes as previous layers update, slowing training. How: For each mini-batch, normalize activations: \(\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}\), then scale and shift: y = γx̂ + β (γ, β are learned). At test time: use running averages of μ and σ² from training.
Curves: Training loss decreases, validation loss decreases then increases — the gap is overfitting. Strategies: (1) More training data, (2) Regularization (L1/L2/weight decay), (3) Dropout, (4) Early stopping, (5) Data augmentation, (6) Reduce model capacity, (7) Batch normalization.
Sigmoid saturates at 0 and 1, where gradient ≈ 0. In a deep network, gradients multiply through layers: ∂L/∂w₁ = ∂L/∂yₙ · σ'(zₙ) · ... · σ'(z₁). Since σ'(z) ≤ 0.25, the gradient shrinks exponentially with depth. Solutions: Use ReLU activation, residual connections, proper initialization (Xavier/He), or LSTM for recurrent networks.
From 2024 Mock Exam.
Behavior: This shows classic overfitting. Training loss keeps decreasing but dev loss starts increasing after a point — the model memorizes training data instead of learning generalizable patterns.
Two strategies:
1. Early stopping: Stop training at the point where dev loss is minimal (before it starts rising). Monitor validation loss and save the best checkpoint.
2. Regularization: Add L2 weight decay, dropout, or data augmentation to constrain the model and reduce overfitting.
Other valid answers: reduce model complexity, get more training data, batch normalization.
From 2024 Mock Exam.
Problem: Vanishing gradients. Sigmoid's derivative has max value 0.25. With 20 layers, gradients get multiplied ~20 times by values ≤ 0.25 → gradients shrink to near zero → early layers stop learning.
Detection: (1) Monitor gradient magnitudes per layer — if early layers have near-zero gradients, that's vanishing. (2) Training loss plateaus very early. (3) Early layer weights barely change across epochs.
Solutions: (1) Replace sigmoid with ReLU (gradient = 1 for positive inputs). (2) Use residual connections (skip connections bypass the gradient bottleneck). (3) Use He initialization designed for ReLU. (4) Use batch normalization.
Code / Bug-Finding
1 questionsVerbatim from mock exam.
1no_improvement = 0 2patience = 10 3best_loss = 0 4 5for epoch in range(1000): 6 model.train() 7 train(model, train_loader) 8 9 model.eval() 10 val_loss = evaluate(model, val_loader) 11 12 if val_loss < best_loss: 13 best_loss = val_loss 14 no_improvement = 0 15 save_model(model, 'best.pt') 16 else: 17 no_improvement += 1 18 19 if no_improvement >= patience: 20 break 21 22save_model(model, 'final.pt')
Find 4 conceptual mistakes. NOTE: Conceptual errors are logical errors of the algorithm, NOT syntax mistakes, typos, or runtime bugs.
Bugs:
1. Line 3: best_loss = 0 should be best_loss = float('inf') — loss is always positive, so val_loss < 0 is never true.
2. Line 22: Saves the LAST model (possibly overfit) instead of loading the BEST model. Should load 'best.pt' at the end.
3. Missing: No torch.no_grad() context during evaluation — wastes memory computing gradients during validation.
4. Missing: After break, should load best model before any final evaluation or saving.
True/False
1 questions| Statement | T | F |
|---|---|---|
| Gradient clipping helps in the case of exploding gradients. | ✓ | |
| Given a training set with 100 examples, stochastic gradient descent performs one update step per epoch. | ✓ | |
| Dropout is a regularization technique. | ✓ | |
| A fully-connected layer with ReLU activation and residual connection has the form relu(Wx + b) + x. | ✓ | |
| Highway Network automatically removes unnecessary layers. | ✓ |
From 2024 Mock Exam. Key traps:
• SGD with 100 examples: SGD processes ONE example at a time → 100 updates per epoch, not 1. (Batch GD would do 1 update per epoch.)
• ResNet formula: relu(Wx + b) + x is the correct residual form — TRUE.
• Highway Network: does NOT remove layers. It learns gating functions that control how much information flows through vs. bypasses each layer. The layers are still there; the network learns to route around them.