It delves into theoretical underpinnings of probability theory and ML, IMO better than any other course I have seen. (Yeah, Andrew Ng is legendary, but his course demands some mathematical familarity with linear algebra topics)
> That guy has a serious gift for explaining stuff
I'd like to challenge this idea.
I don't believe he's more gifted than other people. I strongly believe that the point is he spent a lot of time and effort to get better at explaining stuff.
He contemplated feedback and improved his explanations throughout the years.
His videos are excellent because he poured himself into making them excellent, not because he has a gift.
In my experience the professors who lack this ability do so because they don't put enough effort into it, not because they were born without it.
To be very good at something it is necessary, but not sufficient, to have a talent for it. The other 85% is hard work. You aren't going to pull just anyone off the street and have the same level of instruction, no matter how motivated they are.
That's because they posted them somewhere else (easy mistake to make.. HN doesn't show you the full link in a comment, so copy/paste just copies the ellipsis)
That seems utterly bizarre to me. I don't use "delve" frequently myself, but it is common enough that it doesn't jump out as an unusual word. Perhaps it is overused or used in a not-exactly-usual context that tips one off that it is LLM-generated, but by itself it signifies nothing to me.
It is a very common word used in Nigerian style English which was a very common place they were outsourcing RLHF tasks to. A sibling comment has a link but it is also easy to google.
Native English speaker here. It was the right word. At the same time, while “delve” is common enough to be recognized, it’s not that commonly used in American English, so I also was wondering if this was AI generated.
I highly recommend the course you've mentioned (by Yaser Abu-Mostafa). In fact I still recommend it for picking up the basics; very good mix of math and intuition, Abu-Mostafa himself is a terrific teacher, and he is considerate and thoughtful in responding to questions at the end of his presentations. The last part is important if you're a beginner: it builds confidence in you that its probably ok to ask what you might consider a simple question - it still deserves a good answer. The series is a bit dated now in terms of what it covers, but still solid as a foundational course.
Just watched the whole thing. Thanks! I can't get in to my Masters CS: AI program at UC Berkeley because I'm dumb, but seeing this 1st day of a Probability class kinda felt like I was beginning that program haha.
I will add a great find for starting one's AI journey https://www.youtube.com/watch?v=_xIwjmCH6D4 . Kind of needs one to know intermediate CS since 1st step is "learn Python".
and if anyone is interested in delving more deeply into the statistical concepts & results referenced in the paper of this post (e.g. VC-dimension, PAC-learning, etc), I can recommend this book: https://amzn.eu/d/7Zwe6jw
Fully agree! 3blue1brown is who have single-handedly thought me a majority of what I've needed to know about it.
I actually started building my own neural network framework last week in C++! It's a great way to delve into the details of how they work. It currently supports only dense MLP's, but does so quite well, and work is underway for convolutional layers and pooling layers on a separate branch.
Agreed, but PAC-Bayes or other descendants of VC theory is probably not the best explanation. The notion of algorithmic stability provides a (much) more compelling explanation. See [1] (particularly Sections 11 and 12)
I'm a huge fan of HN just for replies such as this that smash the OP's post/product with something better. It's like at least half the reason I stick around here.
Hard disagree. Your link relies on gradient descent as an explanation, whereas OP explains why optimization is not needed to understand DL generalization. PAC-Bayes, and the other different countable hypothesis bounds in OP also are quite divergent from VC dimension. The whole point of OP seems to be that these other frameworks, unlike VC dimension, can explain generalization with an arbitrarily flexible hypothesis space.
Anyone who wants to demystify ML should read: The StatQuest Illustrated Guide to Machine Learning [0] By Josh Starmer.
To this day I haven't found a teacher who could express complex ideas as clearly and concisely as Starmer does. It's written in an almost children's book like format that is very easy to read and understand. He also just published a book on NN that is just as good. Highly recommend even if you are already an expert as it will give you great ways to teach and communicate complex ideas in ML.
The best tutorials give you a clear sense that the teacher has a clear understanding of the underlying principles and how/why they are applied.
I have seen a fair few things that are effectively.
'To do X, you {math thing}' While also creating the impression that they don't understand why {math thing} is the right thing to do, just that {math thing} has a name and it produces the result. Meticulously explaining the minutiae of {math thing} substitutes for a understanding of what it is doing.
It really stood out to me when looking at UMAP and seeing a bunch of things where they got into the weeds in the math without explaining why these were the particular weeds to be looking in.
Then I found a talk by Leland McInnes that had the format.
{math thing} is a tool to do {objective}. It works, there is a proof, you don't need to understand it to use the tool but the info for that is over there if you want to tale a look. These are our objectives, let's use these tools to achieve them.
The tools are neither magical black boxes, nor confused with the actual goal. It really showed the power of fully understanding the topic.
A decade ago the paper "Understanding deep learning requires rethinking generalization" [0] was published. The submission is a response to that paper and subsequent literature.
Deep neural nets are notable for their strong generalization performance: despite being highly overparametrized they do not seem to overfit the training data. They still perform well on hold-out data and very often on out of distribution data "in the wild". The paper [0] noted a particularly odd feature of neural net training: one can train neural nets on standard datasets to fit random labels. There does not seem to be an inductive bias strong enough to rule out bad overfitting. It is in principle possible to train a model which performs perfectly on the training data but gives nonsense on the test data. But this doesn't seem to happen in practice.
The submission argues that this is unsurprising, and fits within standard theoretical frameworks for machine learning. In section 4 it is claimed that this kind of thing ("benign overfitting") is common to any learning algorithm with "a flexible hypothesis space, combined with a loss function that demands we fit the data, and a simplicity bias: amongst solutions that are consistent with the data (i.e., fit the data perfectly), the simpler ones are preferred".
The fact that the third of these conditions is satisfied, however, is non-trivial, and in my opinion is still not well understood. The results of [0] are reproducible with a wide variety of architectures, with or without any form of explicit regularization. If there is an inductive bias toward "simpler solutions" in fitting deep neural nets it has to come either from SGD itself or from some bias which is very generic in architecture. It's not something like "CNNs generalize well on image data because of an inductive bias toward translation invariant features." While there is some work on implicit smoothing by SGD, for example, in my opinion this is not sufficient to explain the phenomena observed in [0]. What I would find satisfying is a reproducible ablation study of neural net training that removed benign overfitting (+), so that it was clear what exactly are the necessary and sufficient conditions for this behavior in the context of neural nets. As far as I know this still has never been done, because it is not known what this would even entail.
(+) To be clear, I think this would not look like "the fit model still generalizes, but we can no longer fit random labels" but rather "the fit model now gives nonsense on holdout data".
> rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem.
How does deep learning do this? The last time I was deeply involved in machine learning, we used a penalized likelihood approach. To find a good model for data, you would optimize a cost function over model space, and the cost function was the sum of two terms: one quantifying the difference between model predictions and data, and the other quantifying the model's complexity. This framework encodes exactly a "soft preference for simpler solutions that are consistent with the data", but is that how deep learning works? I had the impression that the way complexity is penalized in deep learning was more complex, less straightforward.
You're correct, and the term you're looking for is "regularisation".
There are two common ways of doing this:
* L1 or L2 regularisation: penalises models whose weight matrices are complex (in the sense of having lots of large elements)
* Dropout: train on random subsets of the neurons to force the model to rely on simple representations that are distributed robustly across its weights
Dropout is roughly equivalent to layer-specific L2 regularization, and it's easy to see why: asymptotically, dropping out random neurons will achieve something similar to shrinking weights towards zero proportional to their (squared) magnitude.
Trevor Hastie's Elements of Statistical Learning has a nice proof that (for linear models) L2 regularization is also semi-equivalent to dimensionality reduction, which you could use to motivate a "simplicity prior" idea in deep learning.
Yet another way of thinking about it, in the context of ReLU units, is that a layer of ReLUs forms a truncated hyper-plane basis (like splines but in higher dimensions) in feature space, and regularization induces smoothness in this N-dimensional basis by shrinking that basis towards being a flat hyper-plane
Wow! I think I dimly intuited your first paragraph already; I directionally get why your second might be true (although I'd have thought L1 was even more so, since it encourages zeros which is kind of like choosing a subspace).
Your third paragraph took me ages to get an intuition for - is the idea that regularisation penalises having "sharp elbows" at the join points of your hyper-spline thing? That's mind blowing and such an interesting way to think about what a ReLU layer is doing.
Thanks so much for a thought provoking comment, that's incredibly cool.
The solution to the L1 regularization problem is actually a specific form of the classical ReLU nonlinearity used in deep learning. I’m not sure if similar results hold for other nonlinearities, but this gave me good intuition for what thresholding is doing mathematically!
Vision transformers have a more flexible hypothesis space, but they tend to have worse sample complexity than convolutional networks which have a strong architectural inductive bias. A "soft inductive bias" would be something like what this paper does where they have a special scheme for initializing vision transformers. So schemes like initialization that encourage the model to find the right solution without excessively constraining it would be a soft preference for simpler solutions.
I'm not a guru myself, but I'm sure someone will correct me if I'm wrong. :-)
The usual approach to supervised ML is to "invent" the model (layers, their parameters) or more often copy one from known good reference, then define the cost function and feed it data. "Deep" learning just means that instead of a few layers you use a big number of them.
What you describe sounds like an automated way of tweaking the architecture, IIUC? Never done that, usually the cost of a run was too high to let an algorithm do that for me. But I'm curious if this approach is being used?
Yeah, it's straightforward to reproduce the results of the paper whose conclusion they criticize, "Understanding deep learning requires rethinking generalization", without any (explicit) regularization or anything else that can be easily described as a "soft preference for simpler solutions".
Well it's not called Mysterious Learning or Different Learning for a reason.
In fact, with how many misnomers there are in the world, I think Deep Learning is actually a pretty great name, all things considered.
It properly communicates (imo) that the training data and resulting weights are complex enough that just looking at the learning/training process on its own is not sufficient to understand the resulting system (vs other "less deep" machine learning where it mostly is).
An example, which is interesting, in which "deep" networks are necessary, is discussed in this fascinating and popular recent paper on RNNs [1]. Despite the fact that the minGRU and minLSTM models they propose don't explicitly model ordered state dependency, they can learn them as long as they are deep enough (deep >= 3):
> Instead of explicitly modelling dependencies on previous states to capture long-range dependencies, these kinds of recurrent models can learn them by stacking multiple layers.
DNNs do not have special generalization powers. If anything, their generalization is likely weaker than more mathematically principled techniques like the SVM.
If you try to train a DNN to solve a classical ML problem like the "Wine Quality" dataset from the UCI Machine Learning repo [0], you will get abysmal results and overfitting.
The "magic" of LLMs comes from the training paradigm. Because the optimization is word prediction, you effectively have a data sample size equal to the number of words in the corpus - an inconceivably vast number. Because you are training against a vast dataset, you can use a proportionally immense model (e.g. 400B parameters) without overfitting. This vast (but justified) model complexity is what creates the amazing abilities of GPT/etc.
What wasn't obvious 10 years ago was the principle of "reusability" - the idea that the vastly complex model you trained using the LLM paradigm would have any practical value. Why is it useful to build an immensely sophisticated word prediction machine, who cares about predicting words? The reason is that all those concepts you learned from word-prediction can be reused for related NLP tasks.
You may want to look at this. Neural network models with enough capacity to memorize random labels are still capable of generalizing well when fed actual data
Zhang et al (2021) 'Understanding deep learning (still) requires rethinking generalization'
When I was first getting into Deep Learning, learning the proof of the universal approximation theorem helped a lot. Once you understand why neural networks are able to approximate functions, it makes everything built on top of them much easier to understand.
I created this tool last year to listen to a machine learning book, now I use it for ML papers. The explanations are still a bit repetitive, its not perfect yet.
2.) For each word, for each next $N words, store distance from current word, and increment count for word pair/distance.
3.) For each word, store most frequent word for each $N distance. [a]
4.) Create a prediction algorithm that determines the next word (or set of words) to output from any user input. Basically this would compare word pairs/distance and find most probable next set of word(s)
How close would this be to GPT 2?
[a] You could go one step further and store multiple words for each distance, ordered by frequency
The scaling is brutal. If you have a 20k word vocabulary and want to do 3 grams, you need a 20000^3 matrix of elements (8 trillion). Most of which is going to be empty.
GPT and friends cheat by not modeling each word separately, but a large dimensional “embedding” (just a vector if you also find new vocabulary silly). The embedding represents similar words near each other in this space. The famous king-man-queen example. So even if your training set has never seen “The Queen ordered the traitor <blank>”, it might have previously seen, “The King ordered the traitor beheaded”. The vector representation lets the model use words that represent similar concepts without concrete examples.
Essentially you just count every n-gram that's actually in the corpus, and "fill in the blanks" for all the 0s with some simple rules for smoothing out the probability.
Claude Shannon was interested in this kind of thing and had a paper on the entropy per letter or word of English. He also has a section in his famous "A Mathematical Theory of Communication that has experiments using the conditional probability of the next word based on the previous n=1,2 words from a few books. I wonder if the conditional entropy approaches zero as n increases assuming ergodicity. But the number of entries in the conditional probability table blows up exponentially. The trick of combining multiple n=1 of different distances sounds interesting, and reminds me a bit of contrastive prediction ml methods.
Anyway the experiments in Shannon's paper sound similar to what you describe but with less data and distance, so it should give some idea of how it would look:
From the text:
* 5. First-order word approximation. Rather than continue with tetragram, : : : , n-gram structure it is easier and better to jump at this point to word units. Here words are chosen independently but with their appropriate frequencies.
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NAT-
URAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES
THE LINE MESSAGE HAD BE THESE.
6. Second-order word approximation. The word transition probabilities are correct but no further structure is included.
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHAR-
ACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT
THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED
*
There is some recent work [0] that explores this idea, scaling up n-gram models substantially while using word2vec vectors to understand similarity. Used to compute something the authors call the Creativity Index [1].
this is pretty close to how language models worked in the 90s-2000s. deep language models -- even GPT 2 -- are much much better. on the other hand, the n-gram language models are "surprisingly good" even for small n.
Pretty sure this wouldn't produce anything useful. Pretty sure this would generate incoherent gibberish that looks and sounds like English but makes no sense. This ignores perhaps the most important element of LLM's, the attention mechanism.
The problem is that for any reasonable value of N (>100) you will need prohibitive amounts of storage. And it will be extremely sparse. And you won’t capture any interactions between N-99 and N-98.
Transformers do that fairly well and are pretty efficient in training.
Aren't they technically the same? GPT picks the next token given the state of current context, based on probabilities and a random factor. That is mathematically equivalent to a Markov chain, isn't it?
Markov chains don't account for the full history. While all LLMs do have a context length, this is more a practical limitation based on resources rather than anything implicit in the model.
Every thing has meaning in precise relation to the frequency of cooccurrence to every other thing.
I, too, have been mulling this. Word to word, paragraph to paragraph. Even letter to letter.
Also what if you processed text in signal space? I keep wondering if that’s possible. Then you get it all at once rather than windows. Use a derivative of change for every page, so the phase space is the signal end to end.
Correct me if I'm wrong, but an artificial neuron is just good old linear regression followed by an activation function to make it non linear. Make a network out of it and cool stuff happens.
In a sense; linear regression can be computed exactly so refers to a specific technique for producing a linear model.
Most artificial neurons are trained stochastically rather than holistically, i.e. rather than looking at the entire training set and computing the gradient to minimize the squared loss or something similar, they look at each training example and compute the local gradient and make small changes in that direction.
In addition, the "activation function" almost universally used now is the rectified linear unit, which is linear for positive input and zero for negative input. This is non-decreasing at least as a function, but the fact that it is not monotonic means that there is no additional loss accrued for overcorrecting in the negative direction.
Given this, using the term "linear regression" to describe the model of an artificial neuron is not really a useful heuristic.
This is like saying "the human brain is just some chemistry." You have the general idea correct, but there's a lot more going on that just that, and the emergent system is so much more complex that it deserves its own separate field.
Although with extra irony. "linear regression followed by an activation function to make it non linear". So it isn't good old linear regression because it is explicitly delinearised.
MLPs are compositions of generalized linear models. That's not very enlightening though; the "mysterious" part is the macroscopics of the composition, which you can't really understand with the tools of statistics.
Yes. An artificial neuron, as a mathematical function f, is defined by f(x) = g(wx + b) where x is the input, w is the weight, b is the bias, and g is some non-linear activation function. Is that "good old linear regression followed by an activation function to make it non linear"? Yes, it is exactly that.
The implication that any software is "mysterious" is problematic - there is no "woo" here - the exact state of the machine running the software may be determined at every cycle. The exact instruction and the data it executed with may be precisely determined, as can the next instruction. The entire mythos of any software being a "black box" is just so much advertising jargon, perpetuated by tech bros who want to believe they are part of some Mr. Robot self-styled priestly class.
You're misunderstanding. A level of abstraction is necessary for operation of modern systems. There is no human alive who, given an intermediate step in the middle of some running learning algorithm, is able to understand and mentally model the full system at full man-made resolution, that is, down to the transistor level, on a modern CPU. Someone wishing to understand a piece of software in 2025 is forced to, at some point, accept that something somewhere "does what it says on the tin" and model it thusly rather than having a full understanding.
It's not misunderstanding at all - but your response is certainly an attempt to obfuscate the point being made. The moment you represent anything in code, you are abstracting a real thing into it's digital representation. That digital representation if fully formed at every cycle of the digital system processing it, and the state of the system - all the way down to the transistor level may be precisely determined. To say otherwise is to make the same error as those who claim that consciousness or understanding are indefinable "extra-ordinary" things that we have to just accept exist without any justification or evidence.
By an individual person, yes. I claim that there exists no single human capable of fully understanding the totality of the software and hardware down to the individual transistor level.
I agree and never claimed that "a single person" could - but just because something is too complex for a single person to fully understand does not make it "mysterious" or a "black box". So what is the claim you are making? Anything beyond the complexity of a single person to understand = magic?
The mystery was never in the "how do computers calculate the probabilities of next tokens" but rather in the "why is it able to work so well" and "what does this individual neuron contribute to the whole model"
But the weights trained from machine learning are a black box, in the sense that no human designed e.g. the image processing kernels that those weights represent.
That is one reason people are skeptical of them, not only is training a large model at home expensive, not only is the data too big to trivially store, but the weights are not trivial to debug either
The mystery is in how the data is encoded in the parameters and why LLMs performance scales so well with parameters. The key seems to be almost orthogonal vectors that allow neural networks to store so much data. They allow 2^(cn) vectors to be learned in an n-dimensional space with c being a constant.Since almost orthogonal vectors have very small dot products, they minimally interfere with each other, allowing many concepts to coexist with limited cross talk which enables superposition
If anyone wants to delve into machine learning, one of the superb resources I have found is, Stanfords "Probability for computer scientists"(https://www.youtube.com/watch?v=2MuDZIAzBMY&list=PLoROMvodv4...).
It delves into theoretical underpinnings of probability theory and ML, IMO better than any other course I have seen. (Yeah, Andrew Ng is legendary, but his course demands some mathematical familarity with linear algebra topics)
And of course, for deep learning, 3b1b is great for getting some visual introduction (https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQ...).
I watched the 3b1b series on neural nets years ago, and it still accounts for 95% of my understanding of AI in general.
I’m not an ML person, but still. That guy has a serious gift for explaining stuff.
His video on the uncertainty principle explained stuff to me that my entire undergrad education failed to!
> That guy has a serious gift for explaining stuff
I'd like to challenge this idea.
I don't believe he's more gifted than other people. I strongly believe that the point is he spent a lot of time and effort to get better at explaining stuff.
He contemplated feedback and improved his explanations throughout the years.
His videos are excellent because he poured himself into making them excellent, not because he has a gift.
In my experience the professors who lack this ability do so because they don't put enough effort into it, not because they were born without it.
You're probably reading too much into previous poster's choice of the word "gift".
Most likely it is a slightly misused idiom rather than intending to convey that the teaching ability was obtained without effort.
it could one or the other or be both,
gifted and spending time to get it right are not mutually exclusive
To be very good at something it is necessary, but not sufficient, to have a talent for it. The other 85% is hard work. You aren't going to pull just anyone off the street and have the same level of instruction, no matter how motivated they are.
It helps that 3b1b doesn't start with a curriculum and then has to figure out how to teach it. Instead he can select topics to suit his style.
https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_6700...
From, a comment I posted elsewhere for written versions.
There is a course reader for CS109 [1]. You can download pdf version of this.
There is also book[2] for excellent caltech course[3].
[1] https://chrispiech.github.io/probabilityForComputerScientist...
[2] https://www.amazon.com/Learning-Data-Yaser-S-Abu-Mostafa/dp/...
[3] https://work.caltech.edu/telecourse
Your first two links don't work
That's because they posted them somewhere else (easy mistake to make.. HN doesn't show you the full link in a comment, so copy/paste just copies the ellipsis)
https://chrispiech.github.io/probabilityForComputerScientist...
https://www.amazon.com/Learning-Data-Yaser-S-Abu-Mostafa/dp/...
Thanks. Sorry for the oversight.
Apparently the word “delve” is the biggest indicator of the use of ChatGPT according to Paul Graham.
That seems utterly bizarre to me. I don't use "delve" frequently myself, but it is common enough that it doesn't jump out as an unusual word. Perhaps it is overused or used in a not-exactly-usual context that tips one off that it is LLM-generated, but by itself it signifies nothing to me.
It is a very common word used in Nigerian style English which was a very common place they were outsourcing RLHF tasks to. A sibling comment has a link but it is also easy to google.
As a non native speaker, I didn't know the word "delve" but now I know this word. I think internet community is learning from LLM?
I’d love to see an article delve into why that is.
https://pshapira.net/2024/03/31/delving-into-delve/
Because it's common in Nigerian English, which is where they outsourced a lot of the RLHF conditioning work to.
Really!? Do you have a source for this? This would be really interesting if true.
https://www.theguardian.com/technology/2024/apr/16/techscape...
It does kind of go with "deep" though when Deep Learning is the topic. Delve into the depths.
Saying that kind of stuff is the biggest indicator of Paul Graham (pg) himself
Non native speaker here. Will remember this.
Hm... Saw that, I have used it multiple times in my comment. I was just trying to convey the meaning.
What is right use of word? What would be right word to use here?
Native English speaker here. It was the right word. At the same time, while “delve” is common enough to be recognized, it’s not that commonly used in American English, so I also was wondering if this was AI generated.
So ChatGPT or Nigerians or me apparently... :`(
Absolutely, here’s why.
Nonsense. Chatgpt uses the word a lot precisely because people used it a lot.
Caltech's learning from data was really good too, if someone is looking for theoretical understanding of ML topics.
https://work.caltech.edu/telecourse
I highly recommend the course you've mentioned (by Yaser Abu-Mostafa). In fact I still recommend it for picking up the basics; very good mix of math and intuition, Abu-Mostafa himself is a terrific teacher, and he is considerate and thoughtful in responding to questions at the end of his presentations. The last part is important if you're a beginner: it builds confidence in you that its probably ok to ask what you might consider a simple question - it still deserves a good answer. The series is a bit dated now in terms of what it covers, but still solid as a foundational course.
Just watched the whole thing. Thanks! I can't get in to my Masters CS: AI program at UC Berkeley because I'm dumb, but seeing this 1st day of a Probability class kinda felt like I was beginning that program haha.
I will add a great find for starting one's AI journey https://www.youtube.com/watch?v=_xIwjmCH6D4 . Kind of needs one to know intermediate CS since 1st step is "learn Python".
and if anyone is interested in delving more deeply into the statistical concepts & results referenced in the paper of this post (e.g. VC-dimension, PAC-learning, etc), I can recommend this book: https://amzn.eu/d/7Zwe6jw
Yeah I took CS109 (through SCPD), it was a blast. But it took some serious time commitment.
Looks nice - are there written versions?
There is a course reader for CS109 [1]. You can download pdf version of this.
There is also book[2] for excellent caltech course[3].
[1] https://chrispiech.github.io/probabilityForComputerScientist...
[2] https://www.amazon.com/Learning-Data-Yaser-S-Abu-Mostafa/dp/...
[3] https://work.caltech.edu/telecourse
Great recommendations
Fully agree! 3blue1brown is who have single-handedly thought me a majority of what I've needed to know about it.
I actually started building my own neural network framework last week in C++! It's a great way to delve into the details of how they work. It currently supports only dense MLP's, but does so quite well, and work is underway for convolutional layers and pooling layers on a separate branch.
https://github.com/perkele1989/prkl-ann
Agreed, but PAC-Bayes or other descendants of VC theory is probably not the best explanation. The notion of algorithmic stability provides a (much) more compelling explanation. See [1] (particularly Sections 11 and 12)
[1] https://arxiv.org/abs/2203.10036
I'm a huge fan of HN just for replies such as this that smash the OP's post/product with something better. It's like at least half the reason I stick around here.
Thanks for the great read.
>smash with something better
Not a fan of the aggressive rhetoric here...
I too felt threatened
Yeah, and it's not "better", but actually less general, relying on optimization/GD, unlike OP.
Violent disagreement is violence.
Hard disagree. Your link relies on gradient descent as an explanation, whereas OP explains why optimization is not needed to understand DL generalization. PAC-Bayes, and the other different countable hypothesis bounds in OP also are quite divergent from VC dimension. The whole point of OP seems to be that these other frameworks, unlike VC dimension, can explain generalization with an arbitrarily flexible hypothesis space.
Statistical mechanics is the lens that makes most sense to me, and it's well studied.
Good read, thanks for sharing
Anyone who wants to demystify ML should read: The StatQuest Illustrated Guide to Machine Learning [0] By Josh Starmer.
To this day I haven't found a teacher who could express complex ideas as clearly and concisely as Starmer does. It's written in an almost children's book like format that is very easy to read and understand. He also just published a book on NN that is just as good. Highly recommend even if you are already an expert as it will give you great ways to teach and communicate complex ideas in ML.
[0]: https://www.goodreads.com/book/show/75622146-the-statquest-i...
I have followed a fair few StatQuest and other videos (treadmills with Youtube are great for fitness and learning in one)
I find that no single source seems to cover things in a way that I easily understand, but cumulatively they fill in the blanks of each other.
Serrano Academy has been a good source for me as well. https://www.youtube.com/@SerranoAcademy/videos
The best tutorials give you a clear sense that the teacher has a clear understanding of the underlying principles and how/why they are applied.
I have seen a fair few things that are effectively.
'To do X, you {math thing}' While also creating the impression that they don't understand why {math thing} is the right thing to do, just that {math thing} has a name and it produces the result. Meticulously explaining the minutiae of {math thing} substitutes for a understanding of what it is doing.
It really stood out to me when looking at UMAP and seeing a bunch of things where they got into the weeds in the math without explaining why these were the particular weeds to be looking in.
Then I found a talk by Leland McInnes that had the format.
{math thing} is a tool to do {objective}. It works, there is a proof, you don't need to understand it to use the tool but the info for that is over there if you want to tale a look. These are our objectives, let's use these tools to achieve them.
The tools are neither magical black boxes, nor confused with the actual goal. It really showed the power of fully understanding the topic.
Double Bam
Also would like to add that he has a YouTube channel as well https://youtube.com/@statquest
A decade ago the paper "Understanding deep learning requires rethinking generalization" [0] was published. The submission is a response to that paper and subsequent literature.
Deep neural nets are notable for their strong generalization performance: despite being highly overparametrized they do not seem to overfit the training data. They still perform well on hold-out data and very often on out of distribution data "in the wild". The paper [0] noted a particularly odd feature of neural net training: one can train neural nets on standard datasets to fit random labels. There does not seem to be an inductive bias strong enough to rule out bad overfitting. It is in principle possible to train a model which performs perfectly on the training data but gives nonsense on the test data. But this doesn't seem to happen in practice.
The submission argues that this is unsurprising, and fits within standard theoretical frameworks for machine learning. In section 4 it is claimed that this kind of thing ("benign overfitting") is common to any learning algorithm with "a flexible hypothesis space, combined with a loss function that demands we fit the data, and a simplicity bias: amongst solutions that are consistent with the data (i.e., fit the data perfectly), the simpler ones are preferred".
The fact that the third of these conditions is satisfied, however, is non-trivial, and in my opinion is still not well understood. The results of [0] are reproducible with a wide variety of architectures, with or without any form of explicit regularization. If there is an inductive bias toward "simpler solutions" in fitting deep neural nets it has to come either from SGD itself or from some bias which is very generic in architecture. It's not something like "CNNs generalize well on image data because of an inductive bias toward translation invariant features." While there is some work on implicit smoothing by SGD, for example, in my opinion this is not sufficient to explain the phenomena observed in [0]. What I would find satisfying is a reproducible ablation study of neural net training that removed benign overfitting (+), so that it was clear what exactly are the necessary and sufficient conditions for this behavior in the context of neural nets. As far as I know this still has never been done, because it is not known what this would even entail.
(+) To be clear, I think this would not look like "the fit model still generalizes, but we can no longer fit random labels" but rather "the fit model now gives nonsense on holdout data".
[0] https://arxiv.org/abs/1611.03530
Doesn't the simplicity bias come explicitly from regularization techniques, including drop out or l2 norm?
Those are not necessary to reproduce benign overfitting
> rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem.
How does deep learning do this? The last time I was deeply involved in machine learning, we used a penalized likelihood approach. To find a good model for data, you would optimize a cost function over model space, and the cost function was the sum of two terms: one quantifying the difference between model predictions and data, and the other quantifying the model's complexity. This framework encodes exactly a "soft preference for simpler solutions that are consistent with the data", but is that how deep learning works? I had the impression that the way complexity is penalized in deep learning was more complex, less straightforward.
You're correct, and the term you're looking for is "regularisation".
There are two common ways of doing this: * L1 or L2 regularisation: penalises models whose weight matrices are complex (in the sense of having lots of large elements) * Dropout: train on random subsets of the neurons to force the model to rely on simple representations that are distributed robustly across its weights
Dropout is roughly equivalent to layer-specific L2 regularization, and it's easy to see why: asymptotically, dropping out random neurons will achieve something similar to shrinking weights towards zero proportional to their (squared) magnitude.
Trevor Hastie's Elements of Statistical Learning has a nice proof that (for linear models) L2 regularization is also semi-equivalent to dimensionality reduction, which you could use to motivate a "simplicity prior" idea in deep learning.
Yet another way of thinking about it, in the context of ReLU units, is that a layer of ReLUs forms a truncated hyper-plane basis (like splines but in higher dimensions) in feature space, and regularization induces smoothness in this N-dimensional basis by shrinking that basis towards being a flat hyper-plane
Wow! I think I dimly intuited your first paragraph already; I directionally get why your second might be true (although I'd have thought L1 was even more so, since it encourages zeros which is kind of like choosing a subspace).
Your third paragraph took me ages to get an intuition for - is the idea that regularisation penalises having "sharp elbows" at the join points of your hyper-spline thing? That's mind blowing and such an interesting way to think about what a ReLU layer is doing.
Thanks so much for a thought provoking comment, that's incredibly cool.
The solution to the L1 regularization problem is actually a specific form of the classical ReLU nonlinearity used in deep learning. I’m not sure if similar results hold for other nonlinearities, but this gave me good intuition for what thresholding is doing mathematically!
Here is an example for data-efficient vision transformers: https://arxiv.org/abs/2401.12511
Vision transformers have a more flexible hypothesis space, but they tend to have worse sample complexity than convolutional networks which have a strong architectural inductive bias. A "soft inductive bias" would be something like what this paper does where they have a special scheme for initializing vision transformers. So schemes like initialization that encourage the model to find the right solution without excessively constraining it would be a soft preference for simpler solutions.
I'm not a guru myself, but I'm sure someone will correct me if I'm wrong. :-)
The usual approach to supervised ML is to "invent" the model (layers, their parameters) or more often copy one from known good reference, then define the cost function and feed it data. "Deep" learning just means that instead of a few layers you use a big number of them.
What you describe sounds like an automated way of tweaking the architecture, IIUC? Never done that, usually the cost of a run was too high to let an algorithm do that for me. But I'm curious if this approach is being used?
Yeah, it's straightforward to reproduce the results of the paper whose conclusion they criticize, "Understanding deep learning requires rethinking generalization", without any (explicit) regularization or anything else that can be easily described as a "soft preference for simpler solutions".
the AdamW optimizer (basically the default in DL nowadays) is doing exactly that
Well it's not called Mysterious Learning or Different Learning for a reason.
In fact, with how many misnomers there are in the world, I think Deep Learning is actually a pretty great name, all things considered.
It properly communicates (imo) that the training data and resulting weights are complex enough that just looking at the learning/training process on its own is not sufficient to understand the resulting system (vs other "less deep" machine learning where it mostly is).
An example, which is interesting, in which "deep" networks are necessary, is discussed in this fascinating and popular recent paper on RNNs [1]. Despite the fact that the minGRU and minLSTM models they propose don't explicitly model ordered state dependency, they can learn them as long as they are deep enough (deep >= 3):
> Instead of explicitly modelling dependencies on previous states to capture long-range dependencies, these kinds of recurrent models can learn them by stacking multiple layers.
[1] https://arxiv.org/abs/2410.01201
DNNs do not have special generalization powers. If anything, their generalization is likely weaker than more mathematically principled techniques like the SVM.
If you try to train a DNN to solve a classical ML problem like the "Wine Quality" dataset from the UCI Machine Learning repo [0], you will get abysmal results and overfitting.
The "magic" of LLMs comes from the training paradigm. Because the optimization is word prediction, you effectively have a data sample size equal to the number of words in the corpus - an inconceivably vast number. Because you are training against a vast dataset, you can use a proportionally immense model (e.g. 400B parameters) without overfitting. This vast (but justified) model complexity is what creates the amazing abilities of GPT/etc.
What wasn't obvious 10 years ago was the principle of "reusability" - the idea that the vastly complex model you trained using the LLM paradigm would have any practical value. Why is it useful to build an immensely sophisticated word prediction machine, who cares about predicting words? The reason is that all those concepts you learned from word-prediction can be reused for related NLP tasks.
[0] https://archive.ics.uci.edu/dataset/186/wine+quality
You may want to look at this. Neural network models with enough capacity to memorize random labels are still capable of generalizing well when fed actual data
Zhang et al (2021) 'Understanding deep learning (still) requires rethinking generalization'
https://dl.acm.org/doi/10.1145/3446776
When I was first getting into Deep Learning, learning the proof of the universal approximation theorem helped a lot. Once you understand why neural networks are able to approximate functions, it makes everything built on top of them much easier to understand.
You can listen to an explanation of the paper here: https://www.pdftomp3.com/shared/67d8abf0ecf38326f8973e49
I created this tool last year to listen to a machine learning book, now I use it for ML papers. The explanations are still a bit repetitive, its not perfect yet.
I wish I had the time to try this:
1.) Grab many GBs of text (books, etc).
2.) For each word, for each next $N words, store distance from current word, and increment count for word pair/distance.
3.) For each word, store most frequent word for each $N distance. [a]
4.) Create a prediction algorithm that determines the next word (or set of words) to output from any user input. Basically this would compare word pairs/distance and find most probable next set of word(s)
How close would this be to GPT 2?
[a] You could go one step further and store multiple words for each distance, ordered by frequency
The scaling is brutal. If you have a 20k word vocabulary and want to do 3 grams, you need a 20000^3 matrix of elements (8 trillion). Most of which is going to be empty.
GPT and friends cheat by not modeling each word separately, but a large dimensional “embedding” (just a vector if you also find new vocabulary silly). The embedding represents similar words near each other in this space. The famous king-man-queen example. So even if your training set has never seen “The Queen ordered the traitor <blank>”, it might have previously seen, “The King ordered the traitor beheaded”. The vector representation lets the model use words that represent similar concepts without concrete examples.
Importantly, though, LLMs do not take the embeddings as input during training; they take the tokens and learn the embeddings as part of the training.
Specifically all Transformer-based models; older models used things like word2vec or elmo, but all current LLMs train their embeddings from scratch.
And tokens are now going down to the byte level:
https://ai.meta.com/research/publications/byte-latent-transf...
You shouldn't need to allocate every possible combination !_! if you dynamically add new pairs/distance as you find them. Im talkin simple for loops.
you might enjoy this read, which is an up-to-date document from this year laying out what was the state of the art 20 years ago:
https://web.stanford.edu/~jurafsky/slp3/3.pdf
Essentially you just count every n-gram that's actually in the corpus, and "fill in the blanks" for all the 0s with some simple rules for smoothing out the probability.
Claude Shannon was interested in this kind of thing and had a paper on the entropy per letter or word of English. He also has a section in his famous "A Mathematical Theory of Communication that has experiments using the conditional probability of the next word based on the previous n=1,2 words from a few books. I wonder if the conditional entropy approaches zero as n increases assuming ergodicity. But the number of entries in the conditional probability table blows up exponentially. The trick of combining multiple n=1 of different distances sounds interesting, and reminds me a bit of contrastive prediction ml methods.
Anyway the experiments in Shannon's paper sound similar to what you describe but with less data and distance, so it should give some idea of how it would look: From the text:
* 5. First-order word approximation. Rather than continue with tetragram, : : : , n-gram structure it is easier and better to jump at this point to word units. Here words are chosen independently but with their appropriate frequencies.
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NAT- URAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.
6. Second-order word approximation. The word transition probabilities are correct but no further structure is included.
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHAR- ACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED *
There is some recent work [0] that explores this idea, scaling up n-gram models substantially while using word2vec vectors to understand similarity. Used to compute something the authors call the Creativity Index [1].
[0]: https://infini-gram.io [1]: https://arxiv.org/abs/2410.04265v1
this is pretty close to how language models worked in the 90s-2000s. deep language models -- even GPT 2 -- are much much better. on the other hand, the n-gram language models are "surprisingly good" even for small n.
Pretty sure this wouldn't produce anything useful. Pretty sure this would generate incoherent gibberish that looks and sounds like English but makes no sense. This ignores perhaps the most important element of LLM's, the attention mechanism.
And, the attention mechanism scales quadratically with context length. This is where all of the insane memory bandwidth requirements come from.
The problem is that for any reasonable value of N (>100) you will need prohibitive amounts of storage. And it will be extremely sparse. And you won’t capture any interactions between N-99 and N-98.
Transformers do that fairly well and are pretty efficient in training.
> How close would this be to GPT 2
Here's a post from 2015 doing something a bit like this [1]
[1] https://nbviewer.org/gist/yoavg/d76121dfde2618422139
I actually tried sth like that with the Bible back in 2021. scaling is bitch. very difficult to train these types of models.
Markov chains are very very far off from gpt2.
Aren't they technically the same? GPT picks the next token given the state of current context, based on probabilities and a random factor. That is mathematically equivalent to a Markov chain, isn't it?
Markov chains don't account for the full history. While all LLMs do have a context length, this is more a practical limitation based on resources rather than anything implicit in the model.
Every thing has meaning in precise relation to the frequency of cooccurrence to every other thing.
I, too, have been mulling this. Word to word, paragraph to paragraph. Even letter to letter.
Also what if you processed text in signal space? I keep wondering if that’s possible. Then you get it all at once rather than windows. Use a derivative of change for every page, so the phase space is the signal end to end.
I've seen the same patterns in neural networks that I've seen in simpler algorithms. It's less about mystery and more about complexity.
Correct me if I'm wrong, but an artificial neuron is just good old linear regression followed by an activation function to make it non linear. Make a network out of it and cool stuff happens.
In a sense; linear regression can be computed exactly so refers to a specific technique for producing a linear model.
Most artificial neurons are trained stochastically rather than holistically, i.e. rather than looking at the entire training set and computing the gradient to minimize the squared loss or something similar, they look at each training example and compute the local gradient and make small changes in that direction.
In addition, the "activation function" almost universally used now is the rectified linear unit, which is linear for positive input and zero for negative input. This is non-decreasing at least as a function, but the fact that it is not monotonic means that there is no additional loss accrued for overcorrecting in the negative direction.
Given this, using the term "linear regression" to describe the model of an artificial neuron is not really a useful heuristic.
This is like saying "the human brain is just some chemistry." You have the general idea correct, but there's a lot more going on that just that, and the emergent system is so much more complex that it deserves its own separate field.
Although with extra irony. "linear regression followed by an activation function to make it non linear". So it isn't good old linear regression because it is explicitly delinearised.
MLPs are compositions of generalized linear models. That's not very enlightening though; the "mysterious" part is the macroscopics of the composition, which you can't really understand with the tools of statistics.
Yes. An artificial neuron, as a mathematical function f, is defined by f(x) = g(wx + b) where x is the input, w is the weight, b is the bias, and g is some non-linear activation function. Is that "good old linear regression followed by an activation function to make it non linear"? Yes, it is exactly that.
So where is the line that something becomes ‘AI’ and is regulated?
The implication that any software is "mysterious" is problematic - there is no "woo" here - the exact state of the machine running the software may be determined at every cycle. The exact instruction and the data it executed with may be precisely determined, as can the next instruction. The entire mythos of any software being a "black box" is just so much advertising jargon, perpetuated by tech bros who want to believe they are part of some Mr. Robot self-styled priestly class.
You're misunderstanding. A level of abstraction is necessary for operation of modern systems. There is no human alive who, given an intermediate step in the middle of some running learning algorithm, is able to understand and mentally model the full system at full man-made resolution, that is, down to the transistor level, on a modern CPU. Someone wishing to understand a piece of software in 2025 is forced to, at some point, accept that something somewhere "does what it says on the tin" and model it thusly rather than having a full understanding.
It's not misunderstanding at all - but your response is certainly an attempt to obfuscate the point being made. The moment you represent anything in code, you are abstracting a real thing into it's digital representation. That digital representation if fully formed at every cycle of the digital system processing it, and the state of the system - all the way down to the transistor level may be precisely determined. To say otherwise is to make the same error as those who claim that consciousness or understanding are indefinable "extra-ordinary" things that we have to just accept exist without any justification or evidence.
Okay, then, you're just using your own personal definition of "black box" instead of the one everyone else uses.
Something that's a black box is unknown to the speaker. It's not understood to be unknowable to anyone.
So your claim is that there are instructions, data, or both that are unable to be determined in what, is by definition, a fully deterministic machine?
By an individual person, yes. I claim that there exists no single human capable of fully understanding the totality of the software and hardware down to the individual transistor level.
I agree and never claimed that "a single person" could - but just because something is too complex for a single person to fully understand does not make it "mysterious" or a "black box". So what is the claim you are making? Anything beyond the complexity of a single person to understand = magic?
We're just using different definitions for "black box".
My definition is that it's something unknown, yours is that it's something unknowable.
The mystery was never in the "how do computers calculate the probabilities of next tokens" but rather in the "why is it able to work so well" and "what does this individual neuron contribute to the whole model"
I don't know any serious programmer who thinks that, just because each operation is simple, the operation of the whole thing can't be mysterious.
But the weights trained from machine learning are a black box, in the sense that no human designed e.g. the image processing kernels that those weights represent.
That is one reason people are skeptical of them, not only is training a large model at home expensive, not only is the data too big to trivially store, but the weights are not trivial to debug either
The mystery is in how the data is encoded in the parameters and why LLMs performance scales so well with parameters. The key seems to be almost orthogonal vectors that allow neural networks to store so much data. They allow 2^(cn) vectors to be learned in an n-dimensional space with c being a constant.Since almost orthogonal vectors have very small dot products, they minimally interfere with each other, allowing many concepts to coexist with limited cross talk which enables superposition