This is really amazing, my one gripe is that I'm skeptical of the informational density a human experience has. You do make a good point about how you are constantly experiencing things from internal sensors and background information in the environment and the lack of sensations that could be present and those certainly count overall but do they count specifically when comparing to LLMs? LLMs are pretty general as far as AI goes but they don't have this kind of background noise artificially inflating the amount of data they are trained on. I don't know if this is enough to bring humans from a few orders of magnitude above LLMs to a few below, as would be my intuition. What I would really like to know is how many bits specific to language a human takes in before being conversationally fluent. After that, or maybe even before, I think the comparison between people and AI is fundamentally unfair for both because they we remember and access information. All that said, this is very interesting and thought provoking and you have won me over to your side more than that comment may look.
I can’t speak for all LLMs because 1) different models have have different pretraining data sources and 2) the tokenizers that LLMs use to convert word fragments to numbers can be trained on weird datasets…
But we definitely know that certain LLM datasets are no more “cleaner” than the human ones because it would be really expensive (economically and/or computationally) to curate that sort of clean data set, and also because people looked into it. Noisy datasets are why you can get shenanigans like the SolidGoldMagikarp incidents in GPT2 (great showcase examples from tweet and paper here
The more curated you want your dataset to be, the smaller it will be. You can’t get 15 trillion tokens from just nice clean “textbook quality” data. You can get roughly 7B like the people who trained phi-1 did (https://openreview.net/pdf?id=Fq8tKtjACC). And now you can get the current models to generate cleaner synthetic data on whatever you ask it to, but when you do that to train new models, then that’s no longer as closely tethered to “reality”. The human analogy would be like having a child where the majority of their experience comes only from words they hear from other people, or words they read in books, and rarely looked at “reality” to double check statement like…idk all swans are white? (Which tbh is the modern human experience for some parts of the world so 🤷♀️)
And yeah, I too would love to know how many bits it takes for humans to be fluent but…it would be hard to decide what counts as “conversationally fluent”, whether you would compare first or second language learners in some language X on the human side, what sort of social demographics of the human you’re looking at etc.
And on the LLM side, you’d have to decide how big of a model you wanted to train, and whether you want to count models that were only given training input from clean conversational text in language X, or whether you’d want a broader mix, depending on how “curated” an environment you want to mimic from the human side etc.
But on a final note, I’m not really sure what you mean the “artificial noise” when we’re talking about the human sensory system. If the ultimate data generating object we’re interested in is “reality” or “nature”, then in what sense can there be “artificial noise”? Like if our eyes detect visible light waves but we don’t understand what it means, the waves still had to come from somewhere….? (Maybe the somewhere was internally generated like schizophrenic hallucinations, but that’s still somewhere. The somewhere just happens to be the part of reality sitting inside your skull).
Thank you for a well thought out reply. When I say back ground noise artificially bumps the human numbers up I mean there are tons of experiences that humans have that either aren't relevant at all or don't affect our reasoning in a way we want to compare to AI. Getting the sick enough that all you can focus on is the various ways you're in pain might have the same informational density in terms of number of bits, but it obviously doesn't have the same use toward reasoning as a week spent in school. Babies do need to learn to parse the sounds coming into their ears but they also need to learn to stand upright which LLMs don't need to do. Humans also intentionally or accidentally filter out a ton of sensory experience in their day to day life. You said the input to LLMs was messy, but is it as messy as listening to someone talk in a crowd? If you filter out all the noise except the conversation, does the background noise count toward your total "training data"? I could give a million little examples like this but it's all speculation until I do a fermi estimate like you did so I'll stop. All this is just to say, I don't know how we could ever take all the information out of the calculation that is either irrelevant to a human's functioning or for a function humans have the AI doesn't.
Ah perhaps the crux here is: are you talking about the “conscious” mind or are you thinking of the unconscious mind when you refer to clean/noisy input?
*Something* has to filter out sounds in a party to make it so you can follow whichever specific person you’re trying to (consciously, actively, purposefully) listen to … but the *something* doing that filtering also happens to be your brain. In my fermi estimate, I mentioned that the right most column should be taken as the estimate for the latter case (everything the brain can sense, including everything you might consider messy or “noise”), whereas the second-from-the-right column (the 10 bits per second one) would better approximate the input to the “conscious” mind.
So feel free to pick the column that works for your question.
Which parts were you not able to follow? If you can articulate it I can see if I can address the holes and add in more context/reasoning steps to the post.
Nice article. At first, I was confused about the section about the number of bits to *process* one piece of information. My understanding is that we want to compare the sizes of the training sets for LLMs and for humans. The amount of computation a human uses to process elements of the training set is seemingly unrelated to the size of the human training set. However, I see that the highlighted portion of the Zheng-Meister abstract clarifies that 10^9 is the size of the input. Is this a valid question and resolution to my question?
Thanks! And to answer your question, yes…and no. Your question is valid but, 10^9 is not the size of the raw input. The raw input is all of the reality *that our organs can sense*. So the input set *excludes* things like wifi and radio waves, because we have no receptors in our eyes that can detect waves at that frequency range. That’s why we can’t see nor hear wifi with our naked eyes and ears. (And our neurons don’t send out spikes in response to radio waves.) The raw input is the slice of reality that you see/hear/smell/taste/feel per second. Your brain detects all that stuff to some degree, even if you aren’t actively paying attention to everything in that specific second. (Eg you can talk to someone and drive at the same time, but you might miss parts of what they said when you have to change lanes or do something on the road you have to pay full attention to, even though your ears will still be picking up the sounds in your environment.)
But to convert slices of sensory “reality” into something useful for comparison, you can do it a couple ways. Zheng-Meister did it with information theory. And so 10^9 bits *per second* is the estimated speed of how much information can be transmitted by your neurons every second, where “information” is the information theory meaning. (I should clarify that, now that I think about it. Thanks for the reminder!). I said “processed” in the post because when it comes to biological human brain, “processing” mostly (but not always) means neurons transmitting electrical signals. So you can say 10^9 bits is the amount of *informative reality* an adult brain gets *per second*?
And yes, you’re right in that the amount of computation a brain uses to process its training data is unrelated to the size of its training data — but only if you can control the brain in certain ways. In LLMs, you absolutely can control it. If you want to waste lots of money for no reason, you can technically pretrain a really big LLM on 200 tokens only, instead of 2 trillion tokens. (The LLM just won’t be really any good if you pretrained on only 200 tokens, even if it had X billion parameters. But you could do it if you wanted to for some reason?)
But you kind of can’t control it humans. Like, we can physically close our eyelids to shut off light input to the eyes when we want to not see something…but we can’t “turn off” our eyes to stop seeing something the same way we can’t “turn off” your ears (unless you use some external thing like earphones to block sound waves reaching your ear drum). The best you could do to shut off input to a human brain is go to sleep. (But even then, you can’t purposefully control whether you will lose consciousness or not without external aid like … anesthesia and similar drugs.)
Great article and I learned a ton. In particular: “Let’s not completely dismiss the years of growth, work, and education it takes a baby human brain to get to the state where it can casually “learn new tasks very quickly”. I think this fact is so under-weighted in the discussions of AI and philosophy.
If you’ve ever had kids, you’ve seen a baby “train”. I can recall my 2month old son just looking around, looking, looking, looking as he slowly moved his head. Then looking at his little hand; opening and closing his fist and staring at it. It’s training!… and it’s based on “stereo-cam video” aka two eyes, not to mention audio and a zillion nerve endings to boot.
We have a both head start on AI due to this kind of pre-k network training.
Thanks, I think that fact is underrated too! I’ve wondered if this was just a result of the people being involved in such discussions being a certain age and/or social group that don’t have much exposure to young children (for whatever reason).
And that’s cool! Yeah, most people who haven’t looked at the developmental literature won’t realise babies don’t automatically see in colour or in 3D with both their eyes (even though we were all there once). That stuff has to be learned over the first year of life (~3 million seconds!) like you observed.
This is really amazing, my one gripe is that I'm skeptical of the informational density a human experience has. You do make a good point about how you are constantly experiencing things from internal sensors and background information in the environment and the lack of sensations that could be present and those certainly count overall but do they count specifically when comparing to LLMs? LLMs are pretty general as far as AI goes but they don't have this kind of background noise artificially inflating the amount of data they are trained on. I don't know if this is enough to bring humans from a few orders of magnitude above LLMs to a few below, as would be my intuition. What I would really like to know is how many bits specific to language a human takes in before being conversationally fluent. After that, or maybe even before, I think the comparison between people and AI is fundamentally unfair for both because they we remember and access information. All that said, this is very interesting and thought provoking and you have won me over to your side more than that comment may look.
Thanks!
I can’t speak for all LLMs because 1) different models have have different pretraining data sources and 2) the tokenizers that LLMs use to convert word fragments to numbers can be trained on weird datasets…
But we definitely know that certain LLM datasets are no more “cleaner” than the human ones because it would be really expensive (economically and/or computationally) to curate that sort of clean data set, and also because people looked into it. Noisy datasets are why you can get shenanigans like the SolidGoldMagikarp incidents in GPT2 (great showcase examples from tweet and paper here
https://x.com/karpathy/status/1789590397749957117?s=46&t=X2oxo-1FTCynf0Xf2f7EZA)
The more curated you want your dataset to be, the smaller it will be. You can’t get 15 trillion tokens from just nice clean “textbook quality” data. You can get roughly 7B like the people who trained phi-1 did (https://openreview.net/pdf?id=Fq8tKtjACC). And now you can get the current models to generate cleaner synthetic data on whatever you ask it to, but when you do that to train new models, then that’s no longer as closely tethered to “reality”. The human analogy would be like having a child where the majority of their experience comes only from words they hear from other people, or words they read in books, and rarely looked at “reality” to double check statement like…idk all swans are white? (Which tbh is the modern human experience for some parts of the world so 🤷♀️)
And yeah, I too would love to know how many bits it takes for humans to be fluent but…it would be hard to decide what counts as “conversationally fluent”, whether you would compare first or second language learners in some language X on the human side, what sort of social demographics of the human you’re looking at etc.
And on the LLM side, you’d have to decide how big of a model you wanted to train, and whether you want to count models that were only given training input from clean conversational text in language X, or whether you’d want a broader mix, depending on how “curated” an environment you want to mimic from the human side etc.
But on a final note, I’m not really sure what you mean the “artificial noise” when we’re talking about the human sensory system. If the ultimate data generating object we’re interested in is “reality” or “nature”, then in what sense can there be “artificial noise”? Like if our eyes detect visible light waves but we don’t understand what it means, the waves still had to come from somewhere….? (Maybe the somewhere was internally generated like schizophrenic hallucinations, but that’s still somewhere. The somewhere just happens to be the part of reality sitting inside your skull).
Thank you for a well thought out reply. When I say back ground noise artificially bumps the human numbers up I mean there are tons of experiences that humans have that either aren't relevant at all or don't affect our reasoning in a way we want to compare to AI. Getting the sick enough that all you can focus on is the various ways you're in pain might have the same informational density in terms of number of bits, but it obviously doesn't have the same use toward reasoning as a week spent in school. Babies do need to learn to parse the sounds coming into their ears but they also need to learn to stand upright which LLMs don't need to do. Humans also intentionally or accidentally filter out a ton of sensory experience in their day to day life. You said the input to LLMs was messy, but is it as messy as listening to someone talk in a crowd? If you filter out all the noise except the conversation, does the background noise count toward your total "training data"? I could give a million little examples like this but it's all speculation until I do a fermi estimate like you did so I'll stop. All this is just to say, I don't know how we could ever take all the information out of the calculation that is either irrelevant to a human's functioning or for a function humans have the AI doesn't.
Ah perhaps the crux here is: are you talking about the “conscious” mind or are you thinking of the unconscious mind when you refer to clean/noisy input?
*Something* has to filter out sounds in a party to make it so you can follow whichever specific person you’re trying to (consciously, actively, purposefully) listen to … but the *something* doing that filtering also happens to be your brain. In my fermi estimate, I mentioned that the right most column should be taken as the estimate for the latter case (everything the brain can sense, including everything you might consider messy or “noise”), whereas the second-from-the-right column (the 10 bits per second one) would better approximate the input to the “conscious” mind.
So feel free to pick the column that works for your question.
I liked this! Even though I couldn't follow everything and had to accept most conclusions blindly
Reminded me of this Dynomight article:
https://dynomight.substack.com/p/data-wall
Which parts were you not able to follow? If you can articulate it I can see if I can address the holes and add in more context/reasoning steps to the post.
Nice article. At first, I was confused about the section about the number of bits to *process* one piece of information. My understanding is that we want to compare the sizes of the training sets for LLMs and for humans. The amount of computation a human uses to process elements of the training set is seemingly unrelated to the size of the human training set. However, I see that the highlighted portion of the Zheng-Meister abstract clarifies that 10^9 is the size of the input. Is this a valid question and resolution to my question?
Thanks! And to answer your question, yes…and no. Your question is valid but, 10^9 is not the size of the raw input. The raw input is all of the reality *that our organs can sense*. So the input set *excludes* things like wifi and radio waves, because we have no receptors in our eyes that can detect waves at that frequency range. That’s why we can’t see nor hear wifi with our naked eyes and ears. (And our neurons don’t send out spikes in response to radio waves.) The raw input is the slice of reality that you see/hear/smell/taste/feel per second. Your brain detects all that stuff to some degree, even if you aren’t actively paying attention to everything in that specific second. (Eg you can talk to someone and drive at the same time, but you might miss parts of what they said when you have to change lanes or do something on the road you have to pay full attention to, even though your ears will still be picking up the sounds in your environment.)
But to convert slices of sensory “reality” into something useful for comparison, you can do it a couple ways. Zheng-Meister did it with information theory. And so 10^9 bits *per second* is the estimated speed of how much information can be transmitted by your neurons every second, where “information” is the information theory meaning. (I should clarify that, now that I think about it. Thanks for the reminder!). I said “processed” in the post because when it comes to biological human brain, “processing” mostly (but not always) means neurons transmitting electrical signals. So you can say 10^9 bits is the amount of *informative reality* an adult brain gets *per second*?
And yes, you’re right in that the amount of computation a brain uses to process its training data is unrelated to the size of its training data — but only if you can control the brain in certain ways. In LLMs, you absolutely can control it. If you want to waste lots of money for no reason, you can technically pretrain a really big LLM on 200 tokens only, instead of 2 trillion tokens. (The LLM just won’t be really any good if you pretrained on only 200 tokens, even if it had X billion parameters. But you could do it if you wanted to for some reason?)
But you kind of can’t control it humans. Like, we can physically close our eyelids to shut off light input to the eyes when we want to not see something…but we can’t “turn off” our eyes to stop seeing something the same way we can’t “turn off” your ears (unless you use some external thing like earphones to block sound waves reaching your ear drum). The best you could do to shut off input to a human brain is go to sleep. (But even then, you can’t purposefully control whether you will lose consciousness or not without external aid like … anesthesia and similar drugs.)
Great article and I learned a ton. In particular: “Let’s not completely dismiss the years of growth, work, and education it takes a baby human brain to get to the state where it can casually “learn new tasks very quickly”. I think this fact is so under-weighted in the discussions of AI and philosophy.
If you’ve ever had kids, you’ve seen a baby “train”. I can recall my 2month old son just looking around, looking, looking, looking as he slowly moved his head. Then looking at his little hand; opening and closing his fist and staring at it. It’s training!… and it’s based on “stereo-cam video” aka two eyes, not to mention audio and a zillion nerve endings to boot.
We have a both head start on AI due to this kind of pre-k network training.
Thanks, I think that fact is underrated too! I’ve wondered if this was just a result of the people being involved in such discussions being a certain age and/or social group that don’t have much exposure to young children (for whatever reason).
And that’s cool! Yeah, most people who haven’t looked at the developmental literature won’t realise babies don’t automatically see in colour or in 3D with both their eyes (even though we were all there once). That stuff has to be learned over the first year of life (~3 million seconds!) like you observed.