Thinking in Uncertainty
What is non-determinism exactly?
One thing that lives rent-free in my head is conflation.
It could be a consequence of PhD training. I really dreaded seeing the term “overloaded” (meaning multiple meanings assigned to a term) as a comment, whether from my supervisor or paper reviewers. It usually meant a painful and laborious scrub and re-structuring of my research paper.
So when I see concepts or terms being conflated, I somehow can’t get it out of my head.
One common example is the conflation of AI regulation, AI governance and AI risk management (a topic for another day).
Another is the loose use of the concept of “determinism” to mean different things in different contexts.
Thinking Machines, the startup founded by Mira Murati, the ex-CTO of OpenAI, published its first piece online titled “Defeating Nondeterminism in LLM Inference” around a month ago.
It triggered a fair number of posts waxing lyrical over it. Some were clear about its contribution - results that showed how to achieve reproducibility for large language models (LLMs), i.e. same input → same output. Even the article was crystal clear on what it meant by determinism. It started with “Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language models.”
But some others, who I am not sure whether they read the paper, started conflating all uncertainty in AI with what Thinking Machines was trying to solve for. Some even linked it to hallucinations, which is kind of ridiculous. All Thinking Machines did was propose a method to make sure that for the same input, we get the same output. It had absolutely nothing to do with LLMs being confidently wrong about an answer (except that we could reproduce the same wrong answer with the original question every time).
So what’s really going on here?
The reality is that AI systems are still fundamentally probabilistic. They still exhibit various forms of uncertainty.
Thinking Machines solved just one specific dimension of non-determinism, focusing on the numerical and implementation aspects of how we run inference. But other fundamental types of uncertainty remain in AI systems, and understanding them is critical for AI risk management.
We use terms like “deterministic,” “probabilistic,” and “uncertain” to mean different things in different contexts. And as AI systems evolve from machine learning to generative AI to agentic AI, the nature of uncertainty evolves with them.
In my previous article, “Thinking in Risks,” I introduced the 3 U’s framework—Uncertainty (known unknowns), Unexpectedness (unknown unknowns), and Unexplainability (difficulty understanding AI decisions).
Today, I want to zoom in on the Uncertainty dimension.
The Four Sources of Uncertainty
We need to separate three fundamentally different sources of non-determinism at the model level, plus one additional source that emerges at the system level:
Implementation Determinism: What we just discussed at length above.
Knowledge Uncertainty: What the model doesn’t know.
Irreducible Uncertainty: Data noise and task ambiguity.
System Uncertainty: Exemplified by the uncertainties of agents interacting with dynamic environments.
Implementation Non-Determinism (What Thinking Machines Was Focusing On)
This isn’t really “uncertainty” about what the model knows. It’s just implementation variability. Different ways of training or running inference on the same model can produce different outputs, even though the underlying model weights and inputs are identical.
There are several sources of this variability:
Random seeds and initialization, e.g., when models sample from probability distributions to initialize their parameters, different random seeds lead to different outputs.
Floating-point arithmetic. In mathematics, (a + b) + c always equals a + (b + c). But in floating-point arithmetic used by computers, different computation orders can yield slightly different results.
Parallel processing variability. GPUs execute operations concurrently, and operations can complete in variable order.
Sampling strategies. For LLMs, when temperature is set to 0 (greedy decoding), the model always picks the highest probability token. But with temperature above 0, the model samples from the probability distribution, introducing randomness.
Thinking Machines developed a technical solution that ensures the same computation order regardless of variations in batching, eliminating this source of variability.
Here’s the crucial nuance: Even with perfect reproducibility, the model still outputs a probability distribution internally. For “What is your mood?”, the model might have [Happy: 49.9%, Depressed: 50.1%].
I’m not saying that implementation variability is not important. It’s critical.
For debugging and testing, for reviews and validations. There is a reason why most computer science research papers devote a section on the implementation details, so that such reproducibility can be achieved. Even when exact reproducibility isn’t required, you need to understand the sources of variability and their limits.
Knowledge Uncertainty (What the Model Doesn’t Know)
This is what most people think of when they hear “AI uncertainty.” The model is not adequately trained. It’s operating outside its training distribution. The technical term is epistemic uncertainty, from epistemology, the theory of knowledge.
Some examples:
A credit scoring model trained on prime borrowers makes confident predictions on subprime applicants, but it’s never seen this type of borrower before.
Fraud detection systems fail on novel fraud patterns not in their training data.
Medical diagnosis AI trained on one demographic shows uncertainty when applied to others.
Medical advice from Generative AI on non-existent conditions or drug interactions.
Unlike Implementation Non-Determinism, this can’t be solved by smart engineering.
You need more and better data, or better training techniques. It’s also partially addressable through better calibration, uncertainty quantification, knowledge grounding, but all models will always have knowledge boundaries.
The question isn’t whether these boundaries exist, but whether we know where they are and can manage them appropriately.
Such uncertainty in Generative AI models may also lead to unexpected behavior, when the model does not know the answer, but generates a believable answer that is entirely wrong, i.e. hallucinations.
Irreducible Uncertainty
This is the type of uncertainty that no amount of training data or model improvement can eliminate. It comes from two fundamental sources:
Data noise: The measurements themselves contain randomness, sensor precision limits, human labeling inconsistencies, natural variation in the phenomena being measured.
Task ambiguity: The problem genuinely has multiple valid answers due to unstated preferences, legitimate expert disagreement, or creative choices.
The technical term is aleatoric uncertainty, which is related to chance.
Unlike Knowledge Uncertainty where more data helps the model learn, here the variability is inherent. Noise in real life data is ever-present. Many problems have multiple valid solutions.
Some examples:
Recommendation systems: “Best product for this user” depends on unstated preferences, current context, and timing that the system can’t fully know. User preferences are inherently variable, not just unknown.
Noisy sensor data: A temperature sensor with ±0.5°C precision will yield different readings even for the same actual temperature. The measurement noise is irreducible.
Medical diagnosis: Two expert radiologists legitimately disagree on a borderline case. This isn’t lack of expertise. It’s genuine ambiguity in the medical evidence.
Generative AI - “Summarize this article”, but what length, which audience, what focus? “Write a marketing email”, but what tone, length, call-to-action, and audience assumptions?
This is irreducible variability in the problem itself.
Less noisy data or clearer tasks could help, to an extent. But it’s, as the name suggests, irreducible.
Expecting a single deterministic answer misunderstands the issue. The solution isn’t to eliminate the variability but to clarify requirements, bound acceptable outputs, and embrace appropriate variability where it makes sense. Calibration and uncertainty quantification could also help.
System-Level Uncertainty
Here’s where things get even more interesting, and more complicated.
The Critical Distinction: Models vs Systems.
A model makes a single prediction or generation: classify this transaction, generate this summary, answer this question. Input to output.
In the extreme, a system could be a multi-step, tool-using, adaptive automaton that orchestrates multiple model calls, tools, and decisions. It maintains memory, perceives its environment, plans actions, and adapts based on feedback. These days, we call this Agentic AI, which we shall focus on here.
Such system-level complexity introduces new sources of uncertainty.
Welcome to the world of autonomous agents that perceive, reason, plan, and act. They use tools, maintain memory, and sometimes coordinate with other agents. They exhibit behaviors that go beyond individual model predictions.
The root cause of uncertainty here - environmental uncertainty and complexity
Unlike the prior source of uncertainty, which generally focuses on the model itself, such uncertainty arises from agents operating in dynamic, uncertain environments. The environment can change between steps. The agent’s actions can change the environment, creating feedback loops. Multiple agents interact, producing multi-agent dynamics. Perfect knowledge of the environment is impossible.
Such uncertainty in Agentic AI spans both Uncertainty and Unexpectedness from my 3 U’s framework.
Known Unknowns: We know the system will behave variably. We can enumerate sources - environment changes, tool states, different reasoning paths, and developers can anticipate some of these behaviors, and evaluate and test the Agentic AI system.
Example: “Agent might take different paths to achieve the same goal” ← This is expected variability. Developers can develop tests to understand and mitigate this, and set up monitoring metrics and thresholds.
Crossing into Unexpectedness (Unknown Unknowns): The system does something we never imagined. Even experts are genuinely surprised.
Example: “Agent discovered a loophole we never considered” ← This is genuine surprise. We may need additional controls such as red-teaming to address this.
There is also the dimension of time that we need to consider across knowledge, irreducible and system uncertainty. Changes in the data distributions, in the environment, or even knowledge itself can also lead to uncertainties. But going deeper into this aspect is beyond the scope of this note, and is a topic for another day.
Conclusion
Different types of uncertainty require fundamentally different controls.
Conflating them leads to misaligned expectations and ineffective risk management.
As we build more and more complex AI, understanding these distinctions becomes critical, not just technically, but for risk management.
When someone claims to have “solved AI uncertainty,” ask: Which type?
When you’re building AI systems, ask: Which types of uncertainty am I dealing with?
When you’re managing AI risk, ask: Do my controls match the uncertainty type?
When you’re communicating with stakeholders, ask: Are we using “probabilistic” to mean the same thing?
Which types of uncertainty do you encounter most in your work?
#AI #MachineLearning #AIRiskManagement #AgenticAI #ThinkingInAI #Uncertainty #AIGovernance #ResponsibleAI
Read more about the full 3 U’s framework (Uncertainty, Unexpectedness, and Unexplainability) in my previous article: “Thinking in Risks”
https://quaintitative.substack.com/p/thinking-in-risks-uncertainty-unexpectedness
The Thinking Machines paper - https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/


Bit of a digression but I think you might be too optimistic when you write "When you’re communicating with stakeholders, ask: Are we using “probabilistic” to mean the same thing?".
Anecdote time: I once had to explain to a fairly senior executive why the average of the entire population wasn't just the average of the sub-group averages. And when he did finally get it, he still said something like well - that's all well and good but all of us "know" the number should really be X, your data must be wrong.
The craziest thing was that wasn't even my main point - I wanted to show that the population had a bi-modal distribution so taking a "point estimate" to feed into a financial calculation that was going to be used to set policies - regardless of whether it was a median or the mean, was not going to be "representative" and a scenario based approach made more sense.
Couldn't agree more, what if misinterpretations of core terms like 'determinism' - especially the precise definiton Thinking Machines presented - lead to a muddled understanding of actual AI capabilites, resulting in ineffective governance frameworks or even misplaced public trust?