Posts in this series:
Why you should think about the outcome of simulating algorithms? (Motivating the question) [0/5]
What do you get if you simulate an algorithm? (Philosophy) [1/5]
Do (modern) artificial brains implement algorithms? (AI) [2/5]
Do biological brains implement algorithms? (Neuroscience) [3/5]
The Implications of Simulating Algorithms: Is “simulated addition” really an oxymoron? (Philosophy) [4/5]
When science fiction becomes science fact (Major book spoilers!) [5/5]
The ideal philosophical argument starts with obviously true premises, does some reasoning, then arrives at unobvious conclusions.
What happens if the true premises are not obvious and requires a decent amount of study from at least 2 broadly different disciplines, and maybe 3-5 nicher subfields?1
In the past 3 posts, I’ve asked targeted questions and answered them as succinctly as I could, in order to establish these premises as true:
P1: When you simulate an algorithm you get another algorithm.
P1.5: A simulated algorithm can have the same consequence as the original algorithm, regardless of what the material the simulation or original thing was made of. But it can also not. (It depends on whether the consequence you’re interested in involves the material a recipe is written on. Like if you need a strawberry cake recipe that will be visible for more than 15mins, maybe don’t write it in the sky.)
Premises 1 and 1.5 establish what is possible, in the same sense that it is possible that the 505th trillionth digit of pi is 9. (We currently only know up to ~202 trillion digits.) Whether the 505th trillionth digit of pi is or isn’t 9 however, already is or isn’t true in our world.
Premises 2 and 3 aim to establish what we now know to be true in our specific world, given our particular laws of physics/chemistry/biology etc., thanks to the process we call science.2
P2: Modern AIs implement algorithms (but not the traditional sort of GOFAI algorithms that use clean symbolic rules).
P3: Biological brains implement algorithms (including vector addition!)
I have not addressed each of the premises as extensively as I could have. I have tried to choose very specific pieces of evidence from the mechanistic interpretability and neuroscience to answer the relevant scientific questions as succinctly as possible, so that I could get to their implications. Which are:
I1: We will not get A.I. values aligned with human values, assuming we keep using the same deep-learning models we’re currently using for A.I.
I2: There is no metaphysical barrier to A.I. becoming as good at “reasoning”, as “sentient” nor as “conscious” as biological organisms (including humans).
I3: Be precise on what the simuland, simulator, and simulatee is when you think about simulation of any kind. (It will save you a lot of mental confusion.)
I1: We will not get A.I. values aligned with human values, assuming we keep using the same deep-learning models we’re currently using for A.I.
This one is pretty easy now that we have the empirical evidence. Remember AliceTT from Nanda et al. (2023)? This is a descriptive account of what happened.
Nanda et al. (2023) used some “original” algorithm A (from cpython or whichever library) to create the dataset for training their tiny transformer (which I named AliceTT) on modular addition problems.
AliceTT learned to implement a different algorithm, the FMA, which still functionally generates the same outputs from matching inputs.
Zhong et al. (2023) replicated and extended these results. When they trained other tiny transformers that had very similar architecture to AliceTT, they found that some would indeed learn to implement the FMA (or the “Clock” algorithm, as Zhong et al. call the FMA), but other models would learn a different algorithm they call the “Pizza” algorithm.
The main takeaway of all this is that we now have hard empirical evidence that models trained via deep-learning methods can and will learn to implement different algorithms B/C/D from whatever “original” algorithm A generated the dataset.
How does this translate to (current) AI alignment doom?
We have long known that it is possible to implement different algorithms that would take the same input and give you the same output (there are literally 40+ different human-designed sorting algorithms out there). But this was when humans were designing the algorithms.
The new thing here is: we now see that deep-learning optimizers will train models to implement algorithms that are different to the original generating algorithm, given the same exact training data.
How does this fact matter to humans?
Imagine this very unrealistic best case scenario, where we magically gather a complete, comprehensive, “perfect” dataset from Barbara, a human, whose brain contains the “original” algorithms that generate all the values/actions/feelings/thoughts/etc. Barbara has or does.
We now have empirical evidence that if you threw all that data into a transformer as input, the transformer might still implement different algorithms while generating the same outputs. This is the same way neither the “Pizza” or “Clock” algorithm look anything like whichever algorithm generated the original dataset in Nanda or Zhang et al.
Which means this Barbara-prediction-transformer is going to be able to perfectly predict the next value/action/feeling/thought the original Barbara will have. But it is likely to be doing the prediction using algorithms different from those implemented in Barbara’s original brain.
So then the next question is, is it going to be a problem that the transformer is implementing a different algorithm to the “original brain” if it generates the same outputs anyways? Does it matter that AliceTT implements modular addition using FMA and not the Pizza algorithm or the “original” C++ one, when in the end, the model still generates the correct output when adding numbers?
Will a human(‘s brain) and their hypothetical transformer counterpart always prioritize things in the same way on edge-case, out-of-training-distribution hard decisions?
Maybe yes, but also, maybe no.
I’m not saying we can’t get lucky with AI, the same way I might have luckily guessed that the 505th trillionth digit of pi is actually 9.
I am simply saying that empirically, we now know, even if we did have an AI (transformer) model that perfectly predicted what Barbara would do in the next millisecond or time point, we have no guarantee that the AI model would generate its prediction using the same algorithm as the algorithm(s) in Barbara’s brain.
Even for something as relatively straightforward and unambiguous as modular addition where there is such a thing like a consensus correct answer, deep-learning trained neural networks have implemented at least 2 different algorithms to get the same correct output.
What are we supposed to do when we’re dealing with dataset where there aren’t obvious correct answers? Or non-obvious, correctness-unknown answers?
And I’m not even going into the complication of trying to aggregate multiple humans’ values yet.
This is why I think we will not get A.I. values aligned with human values except by extremely happy accident, if we keep using the deep-learning models we’re currently using for A.I.
Whether this is acceptable to you will depend on your tolerance for extreme high-stakes gambling, I suppose.
Caveats: Why I1 not be true? Let’s see.
Nanda et al and Zhang et al showed these results for transformer models, but maybe other deep-learning based models like deep reinforcement learning (RL) will be different? This is possible. I have yet to really look at the RL specific literature. So if you know any mechanistic interpretability work on models trained with deep RL, let me know in the comments. Especially if you know any mechanistic interpretability work on decision transformers! For deep RL, my gut says this uncertainty will be just as applicable to any RL that also uses deep-learning optimizers, because it doesn’t seem (at first glance) like the transformer architecture is any more constrained than the RL stack of algorithms? I simply doubt this result is specific to the transformer architecture. It’s probably a feature of anything trained via deep-learning. But if that detail concerns you, feel free to wait for someone to do the mechanistic interpretablity work on other non-transformer architectures, and consider this post as me pre-registering my hypothesis/prediction.
I2: There is no metaphysical barrier to A.I. becoming as good at “reasoning”, or as “sentient” or as “conscious” as biological organisms (including humans).
There are only mundane physical barriers like data, time, energy, space, and maybe economics.
When we stumble onto a scalable neural architecture that can model its own intermediate values in a strange loop, and gather the training resources to implement it … there’s no telling when we might have an A.I. that is as conscious as a fly, cat, small human, or a large human.
Artificial and Biological intelligence probably looks and feels as different as Diamond is to Graphite, to the average person.
There must be some metaphysical difference between diamond and graphite right? They are so different from each other. Diamond is transparent. Graphite is black and opaque. Diamond ranks as one of the hardest substances on earth, graphite ranks as one of the softest. Diamond doesn’t conduct electricity. Graphite does.
Humans use diamonds in jewelry, and graphite in pencils.
But make no mistake, the same carbon atoms that make up diamond also make up graphite. The difference between diamond and graphite is a matter of how the metaphysically identical carbon atoms are bonded in relation to each other. In other words, the difference between diamond and graphite is a matter of physical structural differences, rather than anything particularly metaphysical.
In the same vein, I’ve shown you how mechanistic interpretability and neuroscience tells us A.I. implement algorithms (P2), and biological brains implement algorithms (P3).
Don’t get distracted by the visible differences between a modern A.I. and a biological animal. It’s not really important that Claude can’t fly or that ChatGPT can’t draw letters perfectly. When you look deeper and interrogate the structure and mechanics of both types of intelligence, you’ll see that the visible differences are like the superficial differences between diamond and graphite.
When it comes to “reasoning”, “sentience”, “consciousness”, or whatever else biological brains are capable of, the important question is, which neural architectures combined with what training data/techniques will implement algorithms similar enough to the ones we recognise as “reasoning”, “sentience”, or “consciousness” in biological brains?
I do not know the answer to this yet, and in this specific scenario, I’d prefer not find out by wild trial-and-error, if possible.
That said, what I do know is already enough to tell me that the answer is not going to be a “metaphysical” difference.
One reason is because as I’ve already shown, the mere fact that X “is a simulation” of Y doesn’t mean much when it comes talking about consequences, algorithms, and the consequences of algorithms (P1 and P1.5). When it comes to algorithms, simulating an algorithm usually just means you’re expending a stupid amount of energy to do the same thing, at a slower speed.
Which is entirely allowed, by the laws of physics.
The second reason is because no matter where you look in biology, from brains to bodies, and everything in between, there is nothing that tells us humans are metaphysically made of a special substance, compared to all the other things in this universe. The DNA base-pairs that we have are the same DNA base-pairs in something like bananas (!), much less flies. The blood, hormones, and proteins we have are roughly the same blood, proteins, and hormones you’d find in a cat.
The electricity that powers our biological intelligence is the same electricity that powers artificial intelligence.
Whatever barrier there might be between conscious and non-conscious beings — that barrier is not going to be one of metaphysics, as uncomfortable as that may be.
I3: Be precise on what the simuland, simulator, and simulatee is when you think about simulation of any kind.
If there is only one thing you take away from these blog posts, I would recommend that it be this compact conceptual table whenever you think about brains, minds, simulation, virtualization, simulated brains (artificial or biological), simulated minds (artificial or biological), etc.
It should help you think about whether you can dismiss a simulation of X as “just” or “mere” or “will never be the the same as Real X in the relevant dimensions” purely from the fact that it is a simulation of X.
Let me go through the 3 examples I mentioned in my previous post to demonstrate how I was using this conceptual framework.
Then I’ll go through 2 more examples to demonstrate how you use this to think specifically about brains and minds (regardless of whether the brain or mind is artificial or biological).
This is probably overkill and too many examples, but I really want to pin down this crucial difference because, selfishly, I am getting a little bored of seeing this specific conceptual mix-up derailing conversations by making people talking past each other.
Example 1: The typical “simulated hurricane”
Simuland: a Real hurricane
Simulator: A computer
Simulatee: A computer program’s outputs (i.e., something like the simulated hurricane below)

Now onto the questions (in video, because substack doesn’t support adding tables yet).
Example 2: Another kind of “simulated hurricane”
Simuland: a Real hurricane (as in example 1)
Simulator: FIU’s Wall of Wind simulator
Simulatee: Environment inside the Wall of Wind, which contains a simulated hurricane
Bonus Example 2.5: I just found out, someone’s gone and made a model of (i.e., simulated) the Wall of Wind simulator itself.
For extra conceptual-clarity brownie points, what would the table look like for this model of the Wall of Wind? What’s the simuland, simulator, and simulatee?
What are the consequences of this simulation of a simulator of a Real hurricane? Can the simulation of a simulator of a Real hurricane still make things “wet”, “wind-blown”, or “destroyed”, like the Real hurricane?
Example 3: Simulating a recipe (for cake, hurricanes, light wave patterns, or anything else, really)
Simuland: A cake recipe
Simulator: A computer
Simulatee: A computer program’s outputs which also happens to contain a functioning cake recipe
Now, let’s apply this framework to brains, of both the artificial and biological variety. I’ll use an image of a dress from a previous post as the simuland.
Example 4: Simulating an image of a dress with an artificial brain
Simuland: A picture of a dress
Simulator: A computer
Simulatee: A computer’s ANN’s outputs which contain some fuzzy amalgamation of algorithms
Here are the outputs I got from ChatGPT and Claude, in case anyone is interested.
Example 5: Simulating an image of a dress with a biological brain
Simuland: A picture of a dress
Simulator: A (living, electrically active) human brain
Simulatee: A perceived colour (of an image of a dress)
Feel free to leave comments on what you think would change (if anything does?) when the simuland is “the Real dress”.
Bonus Homework Example 6: Simulating electricity
What is the simuland, simulator, and simulatee when you use electricity to simulate electricity? If someone generates a simulation of electric fields, and the values or vectors used in the simulation are imprecise (maybe the simulation only goes up to 5 significant figures or something), does that imply anything about the underlying electricity (i.e., the simuland) that powers the simulation? If yes/no, why?
Is “simulated addition” an oxymoron?
Bringing things back to Permutation City and the original question that sparked this series of blog posts:
Is “simulation addition” actually an oxymoron? I have a feeling that the implied answer from 30 years ago was supposed to be Yes.
As in, Yes, simulating addition is an oxymoron because if you can simulate the addition algorithm you are basically doing the same addition as non-simulated Real Addition. I think this was because the most popular methods of simulation back then was mostly based on human-designed, symbolic, ruled-based logic, the same way a person can simulate building an adder from redstone in Minecraft to add numbers in almost exactly the same way physical adder circuits would add. But the human still needs to design the circuit on purpose, bit by bit, in the Minecraft simulation.
On the other hand, we now know deep-learning can successfully give us a different way of simulating things. 30 years ago, we suspected that deep-learning could maybe also be good at simulating things, but the networks were too small, so the simulations were simply not very good. But now we know.
When you train a transformer on pairs of numbers to predict the “correct next token” … that feels like a model that is “simulating addition” to me. Which is why such a model can initially give wrong answers or is “just memorising” examples during training, until it groks a generalising algorithm (FMA/Pizza or Clock or otherwise) to do modular addition more generally.
Of course, even if you’re using deep-learning, once the model simulating addition becomes good enough at the simulation, you get basically perfect addition. Which becomes indistinguishable from Real Addition. Even if it is at the same time, Artificial Addition.
Such is the nature of algorithms.
The relevant question nowadays though is, do human brains work more like binary adders or more like transformer models? Which simulation method is more similar to the human brain’s way of simulating things? The associative deep-learning way, or the symbolic rules-based way?
(Hint: Humans memorise examples and give “wrong” answers too when initially learning things. Even humans have to start somewhere.)
And so I ask again:
You gotta try anyways 🥲.
This is the same way that you need to run certain algorithms to find out whether the 505th trillionth digit of pi is or isn’t 9. You could also try just guessing, which is just another name for a particular algorithm. It’s empirically not a great one for this task though.
Very interesting articles!