TIL: MLP is but one kind of feedforward network, specifically the kind with fully connected layers. Other kinds of feedforward network include CNNs.
TIL: MLP is but one kind of feedforward network, specifically the kind with fully connected layers. Other kinds of feedforward network include CNNs.
TIL: The policy gradient used to update policy takes the general form of an expected weighted sum over the trajectory. The main summation term is the gradient of log-likelihood of policy actions. The summation weights depend on the policy optimization approach.
TIL: The policy gradient used to update policy takes the general form of an expected weighted sum over the trajectory. The main summation term is the gradient of log-likelihood of policy actions. The summation weights depend on the policy optimization approach.
TIL: In on-policy, the action for updating target policy becomes the next action (target = behavior policy). In off-policy, the action for updating target policy is not necessarily the next action (sampled from separate behavior policy).
TIL: In on-policy, the action for updating target policy becomes the next action (target = behavior policy). In off-policy, the action for updating target policy is not necessarily the next action (sampled from separate behavior policy).
TIL: Attention mechanisms were initially developed to augment the RNN EncDec architecture by addressing the limitation of how the original RNN Decoder could not access the RNN Encoder's previous hidden states over the input sequence.
TIL: Attention mechanisms were initially developed to augment the RNN EncDec architecture by addressing the limitation of how the original RNN Decoder could not access the RNN Encoder's previous hidden states over the input sequence.
TIL: While the OG transformer model used a pre-defined positional encoder that remained fixed during training, early OpenAI GPT models used absolute positional embeddings that were optimized during training.
TIL: While the OG transformer model used a pre-defined positional encoder that remained fixed during training, early OpenAI GPT models used absolute positional embeddings that were optimized during training.
TIL: Raw text is first processed into words and special character tokens. Then a tokenizer uses a vocabulary to map tokens to integer IDs and vice versa. Special context tokens (`endoftext`) are included in the vocabulary.
TIL: Raw text is first processed into words and special character tokens. Then a tokenizer uses a vocabulary to map tokens to integer IDs and vice versa. Special context tokens (`endoftext`) are included in the vocabulary.
TIL: "when we say language models "understand," we mean that they can process and generate text in ways that appear coherent and contextually relevant, not that they possess human-like consciousness or comprehension."
TIL: "when we say language models "understand," we mean that they can process and generate text in ways that appear coherent and contextually relevant, not that they possess human-like consciousness or comprehension."
Section: 04_mnist_basics
TIL: Up until the 1990s, ML research usually involved neural nets with only one nonlinear layer with varying widths, not depth. This may have been caused by a misunderstanding of the universal approximation theorem.
Section: 04_mnist_basics
TIL: Up until the 1990s, ML research usually involved neural nets with only one nonlinear layer with varying widths, not depth. This may have been caused by a misunderstanding of the universal approximation theorem.
Section: 04_mnist_basics
TIL: In classification, using accuracy as a loss function is not a good idea because it likely does not change after model weight updates, resulting in zero gradient and "no learning."
Section: 04_mnist_basics
TIL: In classification, using accuracy as a loss function is not a good idea because it likely does not change after model weight updates, resulting in zero gradient and "no learning."
Section: 04_mnist_basics
TIL: "Gradient" in ML usually refers to the **computed value** of the function's derivative given input values, rather than the function's derivative expression per math/physics convention.
Section: 04_mnist_basics
TIL: "Gradient" in ML usually refers to the **computed value** of the function's derivative given input values, rather than the function's derivative expression per math/physics convention.
Section: 04_mnist_basics
TIL: Arthur Samuel describes machine learning as "a mechanism for altering the weight assignment so as to maximize the performance." Kinda cool it doesn't rely on very formal math language.
Section: 04_mnist_basics
TIL: Arthur Samuel describes machine learning as "a mechanism for altering the weight assignment so as to maximize the performance." Kinda cool it doesn't rely on very formal math language.
Section: 04_mnist_basics
TIL: After ensuring differences between two tensors are between 0 and 1, the squared error "ups the contrast" of those differences relative to the absolute error. This will have implications on using L1 vs L2 norm.
Section: 04_mnist_basics
TIL: After ensuring differences between two tensors are between 0 and 1, the squared error "ups the contrast" of those differences relative to the absolute error. This will have implications on using L1 vs L2 norm.
TIL: Best practices to consider for evaluating models: 1) test with unseen tasks rather than data instances (continual learning), 2) use metrics informed by domain, not just ML, 3) test on tasks downstream from basic predictions (esp LLMs).
TIL: Best practices to consider for evaluating models: 1) test with unseen tasks rather than data instances (continual learning), 2) use metrics informed by domain, not just ML, 3) test on tasks downstream from basic predictions (esp LLMs).
Section: 02_production.ipynb
TIL: For an object detection model, training images can be resized via crop, squishing, or padding, none of which are ideal. So the "best" solution is to crop **randomly** (above a min fraction of each image).
Section: 02_production.ipynb
TIL: For an object detection model, training images can be resized via crop, squishing, or padding, none of which are ideal. So the "best" solution is to crop **randomly** (above a min fraction of each image).
Section: 01_intro.ipynb
TIL: "Computers, as any programmer will tell you, are giant morons, not giant brains." - Arthur Samuel, "Artificial Intelligence: A Frontier of Automation" doi.org/10.1177/0002...
Section: 01_intro.ipynb
TIL: "Computers, as any programmer will tell you, are giant morons, not giant brains." - Arthur Samuel, "Artificial Intelligence: A Frontier of Automation" doi.org/10.1177/0002...
Section: Chapter 3 - Probability and Information Theory
TIL: While KL-Divergence is sometimes referred to as a "distance" between distributions P and Q, this is not the best mental model since KL-divergence is asymmetric.
Section: Chapter 3 - Probability and Information Theory
TIL: While KL-Divergence is sometimes referred to as a "distance" between distributions P and Q, this is not the best mental model since KL-divergence is asymmetric.
Section: Chapter 3 - Probability and Information Theory
TIL: In information theory, cross-entropy quantifies overall information needed to encode messages with symbols sampled from distribution P when wrongly assuming distribution Q.
Section: Chapter 3 - Probability and Information Theory
TIL: In information theory, cross-entropy quantifies overall information needed to encode messages with symbols sampled from distribution P when wrongly assuming distribution Q.
Section: Chapter 3 - Probability and Information Theory
TIL: In mixture distribution models, the component identity variable c is a kind of latent variable!
Section: Chapter 3 - Probability and Information Theory
TIL: In mixture distribution models, the component identity variable c is a kind of latent variable!
Section: Chapter 3 - Probability and Information Theory
TIL: Covariance measures a *linear* relationship between two variables. So Cov(x,y)=0 does not exclude the possibility of a non-linear relationship between x and y.
Section: Chapter 3 - Probability and Information Theory
TIL: Covariance measures a *linear* relationship between two variables. So Cov(x,y)=0 does not exclude the possibility of a non-linear relationship between x and y.
Section: Chapter 3 - Probability and Information Theory
TIL: Frequentist probability refers to observed outcomes over infinitely repeatable events. Bayesian probability refers to a degree of belief about a future unobserved outcome.
Section: Chapter 3 - Probability and Information Theory
TIL: Frequentist probability refers to observed outcomes over infinitely repeatable events. Bayesian probability refers to a degree of belief about a future unobserved outcome.