This post is the fourth in a series on the topic of Verified Certified Robustness for Neural Networks. These posts accompany and explain our recent paper A Formally Verified Robustness Certifier for Neural Networks that appeared at CAV 2025, and the broader research agenda that that paper initiates.
The series so far comprises the following posts:
- Part I: Verified or Certified Robustness? Why don’t we have both?
- Part II: When “Verified Robust Accuracy” Isn’t, Actually, Verified
- Part III: Formally Verified Certified Robustness
- Part IV: Breaking a Verified Certifier (this post)
In the previous posts I explained the formally verified robustness certifier that we built, which implements a provably sound method for efficiently checking the robustness of neural network outputs.
In this post I am going to dig into the limits of what “provably sound” means in practice, for our verified certifier. I’ll show how—even though our certifier is proved sound—there exist (albeit highly unlikely) situations in which it can be made to give misleading answers, when applied to robust-trained models for problems like MNIST and CIFAR-10. Along the way, we’ll learn some (what I found to be) pretty surprising facts about how modern machine learning substrates like TensorFlow work in reality.
This post assumes you’ve read the third post, “Formally Verified Certified Robustness”, on how we formally specified robustness.
So, how can we break a verified certifier? The short answer is by exploiting the gap between what our certifier’s formal specification says about how neural networks behave, and how they actually behave in practice.
The Real vs Ideal
This gap is something that I explicitly mentioned as a potential weakness of our certifier towards the end of the third blog post in this series. To recap: to prove our certifier sound, we had to formally specify what it means for a neural network to be robust. That involved writing down in logic how (simple, dense feed-forward) neural networks (with no bias terms and only ReLU activations) behave.
Our certifier’s formal specification formalises neural networks using real arithmetic. This choice is incredibly common in work that uses mathematics to reason about neural networks. However, in reality neural networks are not implemented using real arithmetic. Instead, they use floating point arithmetic.
For example, we formalised how a neural network n
operates on an input vector v
by a function called
NN(n,v)
.
NN
’s definition says that v
is a vector of real numbers, and that NN(n,v)
returns a vector of real
numbers. Using this definition, we can reason about how a neural network would behave when given an input
vector containing an irrational number like π.
But in reality no actual, deployed neural network can operate over inputs like π. Instead, common machine learning frameworks like TensorFlow and PyTorch produce neural networks that operate over floating-point numbers, which are designed (only) to approximate real numbers. For example, the (single precision) floating point approximation of π is 3.14159274101257324 (which is a bit larger than π’s actual value, which is 3.1415926535897932384626433…).
Real arithmetic is much simpler to reason about than floating point arithmetic, and this is why everyone uses the former to formalise neural networks. In this sense, real arithmetic formalises how we wish neural networks behaved in an ideal world. How they actually behave is a different matter that we’ll dig into below.
(As a side note, one might contend that using the term “real arithmetic” to refer to how we imagine neural network arithmetic to behave in an ideal world—rather than how it actually behaves in the, ahem, real world—is confusing. I couldn’t agree more.)
Broken Toys
So the way we formalised neural networks doesn’t exactly match reality. This is important because any real situation that differs from the ideal world captured by our formal specification is not guaranteed to be covered by our certifier’s soundness theorem. Put more simply: if we can find a situation where reality differs from the ideal world, then it may be possible to construct scenarios in which our certifier says that something is robust when in reality it is not.
A Real vs Floating Point Discrepancy
Here is but one simple example. Let’s imagine we have two numbers: w1 and w2, such that:
$$0 < w_1 < w_2 < 1$$
Then for any positive number x, we expect that:
$$ w_1 \cdot x < w_2 \cdot x $$
This is certainly true in real arithmetic. However, it is not guaranteed to be true in floating point arithmetic. In particular, in floating point arithmetic, if w1 and w2 are very close together, then it is possible for certain values of x to have that:
$$ w_1 \cdot x = w_2 \cdot x $$
Weaponising this Discrepancy
Can this discrepancy, between how real and floating point arithmetic each behave, undermine the guarantees provided by our verified certifier’s soundness theorem? In short: yes, at least for specially crafted toy neural networks.
One such network is depicted below. It takes 1-element vectors x as its input and produces 2-element vectors y = [y1,y2] as its output. It contains just two neurons, with weights w1 and w2 respectively, no bias terms, and uses no activation functions.
Thus we have for each i ∈ {1,2}:
$$y_i = w_i \cdot x$$
Let’s suppose I am an adversary who can choose the values of the weights for this neural network, and that I also have the ability to influence what inputs x the neural network is given. My goal is to find weights and inputs that cause our verified certifier to produce misleading answers: specifically, to say that an output is robust at some perturbation bound ε even though it is not.
The adversary’s goal is to find weights w1, w2 and inputs xa, xb such that:
$$\mathit{ArgMax}([w_1 \cdot x_a, w_2 \cdot x_a])\not=\mathit{ArgMax}([w_1 \cdot x_b, w_2 \cdot x_b])$$
when computed using floating point arithmetic; however, when computed using real arithmetic the adversary wants the opposite to be true:
$$\mathit{ArgMax}([w_1 \cdot x_a, w_2 \cdot x_a])=\mathit{ArgMax}([w_1 \cdot x_b, w_2 \cdot x_b])$$
Without loss of generality, let’s assume w1 < w2. Then, in real arithmetic, for any positive input x, we have:
$$w_1 \cdot x < w_2 \cdot x$$
and so
$$\mathit{ArgMax}([w_1 \cdot x, w_2 \cdot x])=1$$
always (assuming vectors are indexed from 0, so that 1 is the index of the second element). Thus, in real arithmetic we will have for any positive xa, xb:
$$\mathit{ArgMax}([w_1 \cdot x_a, w_2 \cdot x_a])=\mathit{ArgMax}([w_1 \cdot x_b, w_2 \cdot x_b])$$
since both sides will be equal to 1.
However, as we said above, we know this is not guaranteed to hold for floating point arithmetic, because for certain choices of the weights and inputs x we can have:
$$ w_1 \cdot x = w_2 \cdot x $$
in which case (assuming ArgMax breaks ties by choosing the lowest index, as is practically universal):
$$\mathit{ArgMax}([w_1 \cdot x, w_2 \cdot x])=0$$
With that in mind, the adversary simply needs to find weights w1, w2 and positive inputs xa, xb such that:
- w1 < w2
- w1 · xa = w2 · xa in floating point arithmetic, and
- w1 · xb < w2 · xb
Having found such a combination, it’s possible that our certifier will certify the output [w1 · xb, w2 · xb] produced by input xb as robust at ε = ||xb-xa||, even though in reality xa is a counter-example. (Here we write ||v|| for the Euclidean norm of vector v.)
Using Z3 to Find an Attack
The constraints above can be encoded as a floating point Constraint Satisfaction Problem, including for SMT solvers like Z3, to find values for the weights and inputs that can cause our certifier to produce misleading answers.
I wrote such an encoding as a Z3 Python program to find suitable values for the neural network weights and inputs, and then to confirm that they lead to non-robust results when the neural network is implemented in Keras/TensorFlow, and finally to encode the neural network and the counter-example output for consumption by our verified certifier.
Doing so finds weights and inputs:
- w1 = 0.9890594482421875,
- w2 =0.989059507846832275390625,
- xa = 0.157976210117340087890625, and
- xb = 0.19721232354640960693359375,
such that (in 32-bit floating point arithmetic):
- ||xb - xa|| = 0.03923611342906951904296875 and
- w1 · xb = 0.195054709911346435546875 and
- w2 · xb = 0.19505472481250762939453125,
and for which our certifier happily (and misleadingly) certifies the output vector [w1 · xb, w2 · xb] as robust as ε well above ||xb - xa||. For instance, our certifier certifies this output for ε = 0.2471875250339508056640625, which is more than 6 times ||xb - xa||. This certification is misleading because we know that in practice [w1 · xb, w2 · xb] is not robust since xa is well within ε of xb, yet yields a different answer since w1 · xa = w2 · xb, hence ArgMax([w1 · xa, w2 · xa]) = 0.
Long story short: SMT solvers can be used to construct toy neural networks, inputs and perturbation bounds for which our certifier gives misleading answers. This happens because of the discrepancy between floating point and real arithmetic.
Attacking Robust Image Classifiers
Showing the existence of toy neural networks, specially crafted by an adversary, that cause our certifier to give misleading results is one thing. But what about for more realistic neural networks?
Our certifier is designed to be deployed with robust-trained neural networks. Indeed, we evaluated it against robust-trained classifiers for standard problems like MNIST and CIFAR-10 to show its effectiveness.
Can we exploit the gap between idealised real arithmetic and actual neural network implementation for such models?
Again, the answer is yes—at least for specific (perhaps very unlikely) scenarios.
The Power of Ties
We saw earlier how floating point arithmetic could produce answers that we knew were wrong, in the form of ties: output vectors containing two equal elements that also happened to be the two maximal elements of the vector.
Specifically, in the earlier example we found an input xa that (when run through the neural network evaluated using floating point arithmetic) produced an output vector whose elements were equal, i.e., were tied. The existence of xa meant that the nearby vector xb was mistakenly classified as robust, precisely because the way the model treated xa in floating point (producing a tie, and thereby altering the ArgMax) was different to how it would have treated xa in idealised real arithmetic (in which no tie would have been produced).
Therefore, to attack this same problem for much larger models like MNIST and CIFAR-10, we might seek places where we expect the model to give inaccurate outputs due to floating-point phenomena like ties.
Indeed, the counter-example above consisted of two points xa and xb belonging to different classes, (only) due to the fact that xa produced a tie which caused the ArgMax to change.
In this sense, points xa that produce ties can be seen as footholds from which we might hope to construct certification counter-examples, by searching for nearby points xb for which ArgMax(yb) ≠ ArgMax(ya) but where xb is certified robust at ε ≥ ||xa-xb||. Call this the counter-example condition.
For image classifiers, we want the xa and xb to be, well, images. Ideally drawn from (or close to) the same data distribution on which these models were trained and evaluated.
From Foothold to Attack
This leads to the following search procedure:
- Given a starting image xnat, use an existing adversarial example search algorithm to find a nearby point x~tie that crosses the model’s decision boundary (ArgMax(y~tie) ≠ ArgMax(ynat)) but is very close to the decision boundary, and thereby likely to produce a near-tie.
- Search between xnat and x~tie to find an initial counter-example: a pair of points xa and xb that meet the counter-example condition above.
- Try to extend the counter-example by searching outwards from each of xa and xb to identify a final counter-example pair that meet the counter-example condition above and maximise ε.
We implemented this procedure as a Python program. For step 1 we used the DeepFool search algorithm, implemented in the Adversarial Example Toolbox.
When run on the image classification models over which we evaluated our verified certifier, it finds instances in which our certifier gives misleading answers.
One example for each of the models we tried is depicted below, noting that I have enlarged each image for ease of viewing. You can see both the original test point xnat and the final counter-example point xb, along with the perturbation bound ε for which our certifier said xb was robust. In each case we found a nearby xa (not shown) such that ||xb-xa|| < ε.
In general, this search procedure is able to produce counter-examples certified for ε around c × 10-d, where d is 6 or 7 and 0 < c < 10.
Practical Significance
Does this matter? After all, the perturbation bounds we are talking about are miniscule in practice. One typically considers perturbation bounds of at least 0.14–0.3 for these kinds of images: we are talking about bounds 5 orders of magnitude smaller than that!
Therefore, I think one can rightly contend that these examples show how our certifier can be misleading in principle but perhaps not in practice.
I conjecture that somebody cleverer than me could come up with a better search procedure able to find practical counter-examples at larger perturbation bounds ε.
Indeed, previously clever people have showed how to construct practical counter-examples for certain kinds of robustness checkers. That approach works for robustness checkers that are complete, in the sense that when they are given a point xb that the certifier says is not robust at perturbation bound ε, then they return an example point xa for which ||xb-xa|| ≤ ε showing why xb is not robust at ε.
Our verified certifier is not complete. So we had to come up with a different method for searching for potential counter-examples. Nonetheless, they still demonstrate the potential for our certifier to give misleading answers.
The Difficulty of Formally Modelling the TensorFlow Implementation
What might we do about this?
One obvious avenue would be to try to work out how to precisely model neural network behaviour in logic and then prove theorems that bound the difference between how a neural network might behave in reality (when implemented using a framework like TensorFlow) vs in the ideal world of real arithmetic.
For instance, we could imagine augmenting the NN
specification above that operates over vectors of real numbers
with one that instead operates over vectors of floating point numbers.
However, this alone won’t fully close the specification gap between the ideal world and reality. This is because modern frameworks like TensorFlow that implement machine learning models do not (exactly) implement floating point semantics.
I stumbled on this by accident while writing the code to find the attack on the MNIST model above. Specifically, when running this code, it found inputs xb for which yb ≠ 0 when computed by TensorFlow; however, when computed using real arithmetic (technically arbitrary precision rationals) or 32-bit floating point arithmetic, yb = 0.
I spent far too much time understanding how this discrepancy could arise. Those more familiar with TensorFlow might already be able to guess.
It happens in particular for inputs xb that when processed by the neural network (implemented using standard floating point or real arithmetic) produce pre-activations (i.e. the numbers fed into a layer’s activation function) that are all negative but very close to 0. After processing by the ReLU activation function, those pre-activations become 0, and those 0s then propagate unchanged through the remainder of the network (since the neural network in question uses no bias terms) producing the 0 output (under floating point or real arithmetic).
However, in TensorFlow it is possible for a pre-activation that is negative but close to 0 to be approximated instead by a (very small) positive number. For example, I found cases in which small-magnitude negative pre-activations like -2.676 × 10-9 would (when computed by TensorFlow) instead come out like 1.118 × 10-8. When this happens, the result of applying ReLU is no longer the 0 vector, which means that the final model output yb under TensorFlow would be non-zero.
The following Venn Diagram depicts the situation graphically. Here we imagine the neural network is represented by a function f. The subscripts indicate whether f is implemented using real arithmetic (subscript ℝ) or by TensorFlow (subscript TF). The left overlapping part of the diagram shows how there exist inputs x for which f(x) might be non-zero when computed by TensorFlow but 0 when computed under real arithmetic (which is the surprising case we just discussed). The right overlapping part of the diagram shows the less surprising case in which floating point rounding might produce the 0 vector output even though under real arithmetic the answer may be non-zero. The top and bottom overlapping parts depict cases in which real arithmetic and TensorFlow are in agreement, about whether a model output is zero.
In short: TensorFlow doesn’t implement standard floating point semantics; it sometimes agrees neither with floating point nor real arithmetic about whether an answer is 0. I suspect that formally specifying precisely how frameworks like TensorFlow implement neural networks is likely to be very challenging.
Conclusion
This post leaves many questions open. We showed how our verified certifier can be made to produce misleading answers in principle, by exploiting the gap between how neural networks behave in reality vs in the ideal world of real arithmetic. But it remains open how practical these discrepancies are in real-world use cases.
We also explained why closing this gap is likely to be very challenging, if it involves precisely modelling how practical machine learning frameworks like TensorFlow operate in reality. It remains open how we might close this gap otherwise.
If you’re interested in helping to answer these questions, please get in touch.
In the meantime, thanks for reading.