"The real bottleneck for the information processing of “concepts” is actually attention in humans, not the capacity of our sensory organs."
This sounds right, but would massively reduce the brain's input capacity versus LLMs. I can't say I've looked into the research, but human attention is typically quoted to be only 40–60 bits/sec.
That said, I think you're right to point out the huge amount of pretraining that must be required for us to even know what to pay attention to. This is the problem of signal detection, and is a function of maturity.
One thing I'm not quite getting though - if we say that 5 year olds have typically been exposed to around 25 million spoken words and LLMs to billions of written words - is why (if we convert to the common currency of bits) that's not a fair comparison?
Oh, you helped me see that I should probably clarify in that sentence I meant the more abstracted form of concepts, or "consciously processed" concepts. As in, concepts in the shape of whatever "colour" means to Tommy Edison when he talks about colour. Thanks!
And maybe! Though, it might not reduce the "brain's input capacity" as much as you think when compared to LLMs. As long as you are doing the comparison at the correctly aligned apples-to-apples level of abstraction for the LLM too. For example, there's at least 3 levels of input capacities we can see in LLMs.
There's 1) the input that goes into a reasoning model's "final answer" tokens. The "final answer" is currently kind of slow, taking anywhere from a couple seconds to many minutes, depending on how long and how many chains-of-thought the model is using, and how long/short the chains are. And I'm not sure if there's some hard specified max_length for chains of thoughts that AI devs put behind the scenes. Then there's 2) the actual "raw" capacity of the LLM, which is something like 200K token context length for Claude currently, 1 million for Gemini etc. which is able to process however many tokens per second it can to generate the chain-of-thought tokens, and is faster at roughly 10s-100s of tokens per second right now? And all of this happens on 3) the same hardware/“brain” with electricity whizzing around GPUs at the speed of light-ish...
You can go through the same steps with capacity reductions as in system 2 -> system 1 -> system 0 (perception)…all of which also happens on a “brain”…so maybe? Which level of capacity did you mean to compare in the human with which level in the LLM?
For the last question, the comparison itself might or might not be fair, as long as you keep the expectations on both the human and LLM side of things the same after they get the bits. But people often do not. You can see examples of the subtle mismatch in expectations in this other post of mine, specifically the part where I mention the “reversal curse”paper in LLMs and how something like expecting LLMs to be able to automatically learn “B” “is” “A” from “A” “is” “B”, would be like expecting humans to be able to recite the alphabet backwards from learning the ABCs. https://aliceandbobinwanderland.substack.com/i/143739655/terminology-notes
The tricky subtle thing about doing these comparisons is at the level of what you expect the bits to do for you depending on how they are being processed by the brain, not so much the sheer number of bits.
“Which level of capacity did you mean to compare in the human with which level in the LLM?”
I believe the original question was whether it is true that, with respect to language acquisition, LLMs are trained on ‘orders of magnitude’ more data than humans. I’m not sure which of your 3 levels of input capacity that would correspond to, but what I (and I think the original question) are referring to is the ‘proportion of available data that is sampled’.
For LLMs, the available data is everything it is given because, as far as I’m aware, they do not sample / filter / or otherwise ignore any of their training input. For humans, the data available for training is whatever makes it as far as the learning centres of the central nervous system. Defining where that is exactly is probably difficult, but it’s obviously got to be somewhere between the sensory cells and awareness.
That said, the structures of the sensory pathway and the architecture of the LLMs are themselves both products of pretraining in the form of evolution and computer science respectively. However, this line of thinking risks getting into an infinite regress that obviously isn’t helpful.
So in short, while I’m not really answering anything, what I am trying to do is clarify what I think the original question was, i.e. “Is there a significant difference between how much data a newborn baby and an untrained LLM need in order to acquire a certain equivalent level of language proficiency?”
However, that raises the vexing question of what we mean by ‘equivalent’, since human language prioritises context and connection over and above vocabulary and prediction, and we know that a 'fully trained' human and a 'fully trained' LLM are not exactly equivalent. So where do we go from here?
What might be interesting (but probably unethical and impractical!) would be to have an AI somehow ‘piggy back’ on a baby so it can see and hear everything the baby does, and then periodically assess what the two of them have learnt so far. This is one of the many reasons why I’m not a researcher.
RE: I believe the original question was whether it is true that, with respect to language acquisition, LLMs are trained on ‘orders of magnitude’ more data than humans. I’m not sure which of your 3 levels of input capacity that would correspond to, but what I (and I think the original question) are referring to is the ‘proportion of available data that is sampled’.
> So if you look at the last table in the main post (I’ve now captioned it “Main Table”), you’ll see that there are 3 estimation columns. The one highlighted in grey what I think of as “the lowest reasonable limit” because it assumes an information processing rate of 1 bit per second.
I then have a 10 bits per second column, which I say is a rough estimate of the conscious thoughts a person can have. So I think of that as the rough estimate for the training data size for “conscious” linguistic thoughts, so this would be include the sort of thoughts a person blind from birth might have about colours. I currently consider this on the same level as the input that goes into a reasoning model's "final answer" tokens. Meaning, the size of this input is the number of tokens that go into the chains of thoughts of an LLM before it generates its "final answer". (Though I am not 100% confident on this yet. I’m still working through matching the levels correctly.)
Then I have the rightmost column, which I said was roughly the unconscious sensory processing rate. So this would kind of the training data set to train up your perceptual systems in the right way to perceive whatever language you happen to have learned as your native language. This is the part of the training data a blind person from birth would be missing with regards to colour input. I currently consider this most similar to the entire pretraining dataset of an LLM.
————————
RE: For LLMs, the available data is everything it is given because, as far as I’m aware, they do not sample / filter / or otherwise ignore any of their training input. For humans, the data available for training is whatever makes it as far as the learning centres of the central nervous system. Defining where that is exactly is probably difficult, but it’s obviously got to be somewhere between the sensory cells and awareness.
> So actually, in the course of trying to make the context length of LLMs longer and longer, researchers found that they developed a “Needle in the haystack” problem, where even though a model might be given some really long input (e.g., a whole book), the model isn’t automatically able to fully use all the sentences in that long input prompt. This is tested for by burying some really odd, specific, and unexpected sentence within a long block of text, then asking the LLM questions which requires the LLM to use the information in that unexpected sentence to answer correctly. So their “effective” context length is not actually as long as the context length which is advertised on models. (See these links for more information: https://openreview.net/pdf?id=wHLMsM1SrP, https://arize.com/blog-course/the-needle-in-a-haystack-test-evaluating-the-performance-of-llm-rag-systems/)
So in other words, yes, LLMs do ignore some of their input data. The results above aren’t happening at the pretraining level. But as far as I know, no one’s checked. Still, given the results about the haystack problem, I would guess that LLMs might turn out to functionally ignore some parts of their pretraining input data. Specifically, I would guess that they ignore the parts that are the most undifferentiated, given the rest of the tokens in its pretraining data. I’m guessing this because we know e.g., babies exposed one language with a certain range of phonemes at birth will lose their ability to discriminate between phonemes that are present in another language after some time, but babies exposed to both languages will retain the ability to discriminate amongst all the phones present in both languages. The famous example here is the “r” and “l” sound that’s easily perceived by English native speakers and English-Japanese bilinguals, but not people who grew up monolingual with Japanese. (https://en.wikipedia.org/wiki/Perception_of_English_/r/_and_/l/_by_Japanese_speakers) Consider this my pre-registered hypothesis!
———————————
RE: However, that raises the vexing question of what we mean by ‘equivalent’, since human language prioritises context and connection over and above vocabulary and prediction, and we know that a 'fully trained' human and a 'fully trained' LLM are not exactly equivalent. So where do we go from here?
>So I want to be careful here to distinguish between what a “human brain” does, and what the human mind does in terms of what “human language” prioritises. The human brain when learning to perceive language absolutely does not care about connection. It can’t, because we’re talking about individual neurons, which are not individually conscious entities that can care about something like the warmth of emotional connection. On the neuronal level, “human language” is absolutely about prediction and context. The way baby brains learn to pick up the beginning and end of words from a sea of audio is by using statistics. This part is pretty solid and in textbooks. (https://en.wikipedia.org/wiki/Statistical_learning_in_language_acquisition)
But it is mediated by connection and context in the sense that a typical infant (mind) will care more about sounds that are being emitted by humans around them with whom they socially interact with. However, atypical infants with atypical language acquisition developmental trajectories also exist (e.g., people with autism), and we are now starting to realise that maybe the difference isn’t in the language acquisition ability itself, but a difference in what the infant considers important to pay attention to (e.g., https://www.sciencedirect.com/science/article/pii/S0149763423003536) This part is still emerging research and is TBD. I put something like 60% confidence that this will reach consensus in another 20-40 years or so?
I agree with you that a 'fully trained' human and a 'fully trained' LLM are not exactly equivalent, but the reason I think they are not equivalent is because human brains are never frozen in time. Whereas LLMs can be, and currently are once they are done with pretraining and post training and “served” for inference. This fact is actually the largest functional difference between humans and LLMs — I wish it were appreciated a bit more, but it’s not a flashy enough for headlines, yet because no one has come close to figure out how to do continuous training without going bankrupt.
The question of where we go from here...where do you want to go, what are you willing to do to go there, and do you have the resources to get there? :)
Different people have different answers to those questions, and the current world is such that specific individuals will have the power to pursue answers to those questions their way, and the rest of us get to "find out".
————————————
RE: What might be interesting (but probably unethical and impractical!) would be to have an AI somehow ‘piggy back’ on a baby so it can see and hear everything the baby does, and then periodically assess what the two of them have learnt so far. This is one of the many reasons why I’m not a researcher.
> I hate saying “so actually” more than once in the same comment, so I’ll just drop these links to the beginnings of exactly the line of research you wanted. It’s not the perfect experiment, but it is very suggestive as a proof-of-concept. I definitely recommend reading at least the blog post!
Thank you also for the links, though as you know the first paragraph of the technology review article says:
"Human babies are far better at learning than even the very best large language models. To be able to write in passable English, ChatGPT had to be trained on massive data sets that contain millions or even a trillion words. Children, on the other hand, have access to only a tiny fraction of that data, yet by age three they’re communicating in quite sophisticated ways."
I don't think the paper itself makes this claim, but clearly the author of the review didn't read your post!
Yep! I realised a lot of people making comparative statements about AI and humans are not really doing so on scientific or even mild mathematical basis beyond a frustratingly superficial level — even in papers, conferences, etc! That’s why my substack exists.
"The real bottleneck for the information processing of “concepts” is actually attention in humans, not the capacity of our sensory organs."
This sounds right, but would massively reduce the brain's input capacity versus LLMs. I can't say I've looked into the research, but human attention is typically quoted to be only 40–60 bits/sec.
That said, I think you're right to point out the huge amount of pretraining that must be required for us to even know what to pay attention to. This is the problem of signal detection, and is a function of maturity.
One thing I'm not quite getting though - if we say that 5 year olds have typically been exposed to around 25 million spoken words and LLMs to billions of written words - is why (if we convert to the common currency of bits) that's not a fair comparison?
Oh, you helped me see that I should probably clarify in that sentence I meant the more abstracted form of concepts, or "consciously processed" concepts. As in, concepts in the shape of whatever "colour" means to Tommy Edison when he talks about colour. Thanks!
And maybe! Though, it might not reduce the "brain's input capacity" as much as you think when compared to LLMs. As long as you are doing the comparison at the correctly aligned apples-to-apples level of abstraction for the LLM too. For example, there's at least 3 levels of input capacities we can see in LLMs.
There's 1) the input that goes into a reasoning model's "final answer" tokens. The "final answer" is currently kind of slow, taking anywhere from a couple seconds to many minutes, depending on how long and how many chains-of-thought the model is using, and how long/short the chains are. And I'm not sure if there's some hard specified max_length for chains of thoughts that AI devs put behind the scenes. Then there's 2) the actual "raw" capacity of the LLM, which is something like 200K token context length for Claude currently, 1 million for Gemini etc. which is able to process however many tokens per second it can to generate the chain-of-thought tokens, and is faster at roughly 10s-100s of tokens per second right now? And all of this happens on 3) the same hardware/“brain” with electricity whizzing around GPUs at the speed of light-ish...
You can go through the same steps with capacity reductions as in system 2 -> system 1 -> system 0 (perception)…all of which also happens on a “brain”…so maybe? Which level of capacity did you mean to compare in the human with which level in the LLM?
For the last question, the comparison itself might or might not be fair, as long as you keep the expectations on both the human and LLM side of things the same after they get the bits. But people often do not. You can see examples of the subtle mismatch in expectations in this other post of mine, specifically the part where I mention the “reversal curse”paper in LLMs and how something like expecting LLMs to be able to automatically learn “B” “is” “A” from “A” “is” “B”, would be like expecting humans to be able to recite the alphabet backwards from learning the ABCs. https://aliceandbobinwanderland.substack.com/i/143739655/terminology-notes
The tricky subtle thing about doing these comparisons is at the level of what you expect the bits to do for you depending on how they are being processed by the brain, not so much the sheer number of bits.
“Which level of capacity did you mean to compare in the human with which level in the LLM?”
I believe the original question was whether it is true that, with respect to language acquisition, LLMs are trained on ‘orders of magnitude’ more data than humans. I’m not sure which of your 3 levels of input capacity that would correspond to, but what I (and I think the original question) are referring to is the ‘proportion of available data that is sampled’.
For LLMs, the available data is everything it is given because, as far as I’m aware, they do not sample / filter / or otherwise ignore any of their training input. For humans, the data available for training is whatever makes it as far as the learning centres of the central nervous system. Defining where that is exactly is probably difficult, but it’s obviously got to be somewhere between the sensory cells and awareness.
That said, the structures of the sensory pathway and the architecture of the LLMs are themselves both products of pretraining in the form of evolution and computer science respectively. However, this line of thinking risks getting into an infinite regress that obviously isn’t helpful.
So in short, while I’m not really answering anything, what I am trying to do is clarify what I think the original question was, i.e. “Is there a significant difference between how much data a newborn baby and an untrained LLM need in order to acquire a certain equivalent level of language proficiency?”
However, that raises the vexing question of what we mean by ‘equivalent’, since human language prioritises context and connection over and above vocabulary and prediction, and we know that a 'fully trained' human and a 'fully trained' LLM are not exactly equivalent. So where do we go from here?
What might be interesting (but probably unethical and impractical!) would be to have an AI somehow ‘piggy back’ on a baby so it can see and hear everything the baby does, and then periodically assess what the two of them have learnt so far. This is one of the many reasons why I’m not a researcher.
RE: I believe the original question was whether it is true that, with respect to language acquisition, LLMs are trained on ‘orders of magnitude’ more data than humans. I’m not sure which of your 3 levels of input capacity that would correspond to, but what I (and I think the original question) are referring to is the ‘proportion of available data that is sampled’.
> So if you look at the last table in the main post (I’ve now captioned it “Main Table”), you’ll see that there are 3 estimation columns. The one highlighted in grey what I think of as “the lowest reasonable limit” because it assumes an information processing rate of 1 bit per second.
I then have a 10 bits per second column, which I say is a rough estimate of the conscious thoughts a person can have. So I think of that as the rough estimate for the training data size for “conscious” linguistic thoughts, so this would be include the sort of thoughts a person blind from birth might have about colours. I currently consider this on the same level as the input that goes into a reasoning model's "final answer" tokens. Meaning, the size of this input is the number of tokens that go into the chains of thoughts of an LLM before it generates its "final answer". (Though I am not 100% confident on this yet. I’m still working through matching the levels correctly.)
Then I have the rightmost column, which I said was roughly the unconscious sensory processing rate. So this would kind of the training data set to train up your perceptual systems in the right way to perceive whatever language you happen to have learned as your native language. This is the part of the training data a blind person from birth would be missing with regards to colour input. I currently consider this most similar to the entire pretraining dataset of an LLM.
————————
RE: For LLMs, the available data is everything it is given because, as far as I’m aware, they do not sample / filter / or otherwise ignore any of their training input. For humans, the data available for training is whatever makes it as far as the learning centres of the central nervous system. Defining where that is exactly is probably difficult, but it’s obviously got to be somewhere between the sensory cells and awareness.
> So actually, in the course of trying to make the context length of LLMs longer and longer, researchers found that they developed a “Needle in the haystack” problem, where even though a model might be given some really long input (e.g., a whole book), the model isn’t automatically able to fully use all the sentences in that long input prompt. This is tested for by burying some really odd, specific, and unexpected sentence within a long block of text, then asking the LLM questions which requires the LLM to use the information in that unexpected sentence to answer correctly. So their “effective” context length is not actually as long as the context length which is advertised on models. (See these links for more information: https://openreview.net/pdf?id=wHLMsM1SrP, https://arize.com/blog-course/the-needle-in-a-haystack-test-evaluating-the-performance-of-llm-rag-systems/)
So in other words, yes, LLMs do ignore some of their input data. The results above aren’t happening at the pretraining level. But as far as I know, no one’s checked. Still, given the results about the haystack problem, I would guess that LLMs might turn out to functionally ignore some parts of their pretraining input data. Specifically, I would guess that they ignore the parts that are the most undifferentiated, given the rest of the tokens in its pretraining data. I’m guessing this because we know e.g., babies exposed one language with a certain range of phonemes at birth will lose their ability to discriminate between phonemes that are present in another language after some time, but babies exposed to both languages will retain the ability to discriminate amongst all the phones present in both languages. The famous example here is the “r” and “l” sound that’s easily perceived by English native speakers and English-Japanese bilinguals, but not people who grew up monolingual with Japanese. (https://en.wikipedia.org/wiki/Perception_of_English_/r/_and_/l/_by_Japanese_speakers) Consider this my pre-registered hypothesis!
———————————
RE: However, that raises the vexing question of what we mean by ‘equivalent’, since human language prioritises context and connection over and above vocabulary and prediction, and we know that a 'fully trained' human and a 'fully trained' LLM are not exactly equivalent. So where do we go from here?
>So I want to be careful here to distinguish between what a “human brain” does, and what the human mind does in terms of what “human language” prioritises. The human brain when learning to perceive language absolutely does not care about connection. It can’t, because we’re talking about individual neurons, which are not individually conscious entities that can care about something like the warmth of emotional connection. On the neuronal level, “human language” is absolutely about prediction and context. The way baby brains learn to pick up the beginning and end of words from a sea of audio is by using statistics. This part is pretty solid and in textbooks. (https://en.wikipedia.org/wiki/Statistical_learning_in_language_acquisition)
But it is mediated by connection and context in the sense that a typical infant (mind) will care more about sounds that are being emitted by humans around them with whom they socially interact with. However, atypical infants with atypical language acquisition developmental trajectories also exist (e.g., people with autism), and we are now starting to realise that maybe the difference isn’t in the language acquisition ability itself, but a difference in what the infant considers important to pay attention to (e.g., https://www.sciencedirect.com/science/article/pii/S0149763423003536) This part is still emerging research and is TBD. I put something like 60% confidence that this will reach consensus in another 20-40 years or so?
I agree with you that a 'fully trained' human and a 'fully trained' LLM are not exactly equivalent, but the reason I think they are not equivalent is because human brains are never frozen in time. Whereas LLMs can be, and currently are once they are done with pretraining and post training and “served” for inference. This fact is actually the largest functional difference between humans and LLMs — I wish it were appreciated a bit more, but it’s not a flashy enough for headlines, yet because no one has come close to figure out how to do continuous training without going bankrupt.
The question of where we go from here...where do you want to go, what are you willing to do to go there, and do you have the resources to get there? :)
Different people have different answers to those questions, and the current world is such that specific individuals will have the power to pursue answers to those questions their way, and the rest of us get to "find out".
————————————
RE: What might be interesting (but probably unethical and impractical!) would be to have an AI somehow ‘piggy back’ on a baby so it can see and hear everything the baby does, and then periodically assess what the two of them have learnt so far. This is one of the many reasons why I’m not a researcher.
> I hate saying “so actually” more than once in the same comment, so I’ll just drop these links to the beginnings of exactly the line of research you wanted. It’s not the perfect experiment, but it is very suggestive as a proof-of-concept. I definitely recommend reading at least the blog post!
Blogpost ver: https://www.technologyreview.com/2024/02/01/1087527/baby-ai-language-camera/
Science paper ver: https://www.science.org/doi/10.1126/science.adi1374
Thanks again for detailed and helpful reply.
Thank you also for the links, though as you know the first paragraph of the technology review article says:
"Human babies are far better at learning than even the very best large language models. To be able to write in passable English, ChatGPT had to be trained on massive data sets that contain millions or even a trillion words. Children, on the other hand, have access to only a tiny fraction of that data, yet by age three they’re communicating in quite sophisticated ways."
I don't think the paper itself makes this claim, but clearly the author of the review didn't read your post!
Yep! I realised a lot of people making comparative statements about AI and humans are not really doing so on scientific or even mild mathematical basis beyond a frustratingly superficial level — even in papers, conferences, etc! That’s why my substack exists.