This post is the fourth in a series on the topic of Verified Certified Robustness for Neural Networks. These posts accompany and explain our recent paper A Formally Verified Robustness Certifier for Neural Networks that appeared at CAV 2025, and the broader research agenda that that paper initiates.

The series so far comprises the following posts:

Part I: Verified or Certified Robustness? Why don’t we have both?
Part II: When “Verified Robust Accuracy” Isn’t, Actually, Verified
Part III: Formally Verified Certified Robustness
Part IV: Breaking a Verified Certifier (this post)

In the previous posts I explained the formally verified robustness certifier that we built, which implements a provably sound method for efficiently checking the robustness of neural network outputs.

In this post I am going to dig into the limits of what “provably sound” means in practice, for our verified certifier. I’ll show how—even though our certifier is proved sound—there exist (albeit highly unlikely) situations in which it can be made to give misleading answers, when applied to robust-trained models for problems like MNIST and CIFAR-10. Along the way, we’ll learn some (what I found to be) pretty surprising facts about how modern machine learning substrates like TensorFlow work in reality.

This post assumes you’ve read the third post, “Formally Verified Certified Robustness”, on how we formally specified robustness.

So, how can we break a verified certifier? The short answer is by exploiting the gap between what our certifier’s formal specification says about how neural networks behave, and how they actually behave in practice.

The Real vs Ideal

This gap is something that I explicitly mentioned as a potential weakness of our certifier towards the end of the third blog post in this series. To recap: to prove our certifier sound, we had to formally specify what it means for a neural network to be robust. That involved writing down in logic how (simple, dense feed-forward) neural networks (with no bias terms and only ReLU activations) behave.

Our certifier’s formal specification formalises neural networks using real arithmetic. This choice is incredibly common in work that uses mathematics to reason about neural networks. However, in reality neural networks are not implemented using real arithmetic. Instead, they use floating point arithmetic.

For example, we formalised how a neural network n operates on an input vector v by a function called NN(n,v). NN’s definition says that v is a vector of real numbers, and that NN(n,v) returns a vector of real numbers. Using this definition, we can reason about how a neural network would behave when given an input vector containing an irrational number like π.

But in reality no actual, deployed neural network can operate over inputs like π. Instead, common machine learning frameworks like TensorFlow and PyTorch produce neural networks that operate over floating-point numbers, which are designed (only) to approximate real numbers. For example, the (single precision) floating point approximation of π is 3.14159274101257324 (which is a bit larger than π’s actual value, which is 3.1415926535897932384626433…).

Real arithmetic is much simpler to reason about than floating point arithmetic, and this is why everyone uses the former to formalise neural networks. In this sense, real arithmetic formalises how we wish neural networks behaved in an ideal world. How they actually behave is a different matter that we’ll dig into below.

(As a side note, one might contend that using the term “real arithmetic” to refer to how we imagine neural network arithmetic to behave in an ideal world—rather than how it actually behaves in the, ahem, real world—is confusing. I couldn’t agree more.)

Broken Toys

So the way we formalised neural networks doesn’t exactly match reality. This is important because any real situation that differs from the ideal world captured by our formal specification is not guaranteed to be covered by our certifier’s soundness theorem. Put more simply: if we can find a situation where reality differs from the ideal world, then it may be possible to construct scenarios in which our certifier says that something is robust when in reality it is not.

A Real vs Floating Point Discrepancy

Here is but one simple example. Let’s imagine we have two numbers: w₁ and w₂, such that:

$$0 < w_1 < w_2 < 1$$

Then for any positive number x, we expect that:

$$ w_1 \cdot x < w_2 \cdot x $$

This is certainly true in real arithmetic. However, it is not guaranteed to be true in floating point arithmetic. In particular, in floating point arithmetic, if w₁ and w₂ are very close together, then it is possible for certain values of x to have that:

$$ w_1 \cdot x = w_2 \cdot x $$

Weaponising this Discrepancy

Can this discrepancy, between how real and floating point arithmetic each behave, undermine the guarantees provided by our verified certifier’s soundness theorem? In short: yes, at least for specially crafted toy neural networks.

One such network is depicted below. It takes 1-element vectors x as its input and produces 2-element vectors y = [y₁,y₂] as its output. It contains just two neurons, with weights w₁ and w₂ respectively, no bias terms, and uses no activation functions.

A toy neural network with two neurons

Thus we have for each i ∈ {1,2}:

$$y_i = w_i \cdot x$$

Let’s suppose I am an adversary who can choose the values of the weights for this neural network, and that I also have the ability to influence what inputs x the neural network is given. My goal is to find weights and inputs that cause our verified certifier to produce misleading answers: specifically, to say that an output is robust at some perturbation bound ε even though it is not.

The adversary’s goal is to find weights w₁, w₂ and inputs x_a, x_b such that:

$$\mathit{ArgMax}([w_1 \cdot x_a, w_2 \cdot x_a])\not=\mathit{ArgMax}([w_1 \cdot x_b, w_2 \cdot x_b])$$

when computed using floating point arithmetic; however, when computed using real arithmetic the adversary wants the opposite to be true:

$$\mathit{ArgMax}([w_1 \cdot x_a, w_2 \cdot x_a])=\mathit{ArgMax}([w_1 \cdot x_b, w_2 \cdot x_b])$$

Without loss of generality, let’s assume w₁ < w₂. Then, in real arithmetic, for any positive input x, we have:

$$w_1 \cdot x < w_2 \cdot x$$

and so

$$\mathit{ArgMax}([w_1 \cdot x, w_2 \cdot x])=1$$

always (assuming vectors are indexed from 0, so that 1 is the index of the second element). Thus, in real arithmetic we will have for any positive x_a, x_b:

$$\mathit{ArgMax}([w_1 \cdot x_a, w_2 \cdot x_a])=\mathit{ArgMax}([w_1 \cdot x_b, w_2 \cdot x_b])$$

since both sides will be equal to 1.

However, as we said above, we know this is not guaranteed to hold for floating point arithmetic, because for certain choices of the weights and inputs x we can have:

$$ w_1 \cdot x = w_2 \cdot x $$

in which case (assuming ArgMax breaks ties by choosing the lowest index, as is practically universal):

$$\mathit{ArgMax}([w_1 \cdot x, w_2 \cdot x])=0$$

With that in mind, the adversary simply needs to find weights w₁, w₂ and positive inputs x_a, x_b such that:

w₁ < w₂
w₁ · x_a = w₂ · x_a in floating point arithmetic, and
w₁ · x_b < w₂ · x_b

Having found such a combination, it’s possible that our certifier will certify the output [w₁ · x_b, w₂ · x_b] produced by input x_b as robust at ε = ||x_b-x_a||, even though in reality x_a is a counter-example. (Here we write ||v|| for the Euclidean norm of vector v.)

Using Z3 to Find an Attack

The constraints above can be encoded as a floating point Constraint Satisfaction Problem, including for SMT solvers like Z3, to find values for the weights and inputs that can cause our certifier to produce misleading answers.

I wrote such an encoding as a Z3 Python program to find suitable values for the neural network weights and inputs, and then to confirm that they lead to non-robust results when the neural network is implemented in Keras/TensorFlow, and finally to encode the neural network and the counter-example output for consumption by our verified certifier.

Doing so finds weights and inputs:

w₁ = 0.9890594482421875,
w₂ =0.989059507846832275390625,
x_a = 0.157976210117340087890625, and
x_b = 0.19721232354640960693359375,

such that (in 32-bit floating point arithmetic):

||x_b - x_a|| = 0.03923611342906951904296875 and
w₁ · x_b = 0.195054709911346435546875 and
w₂ · x_b = 0.19505472481250762939453125,

and for which our certifier happily (and misleadingly) certifies the output vector [w₁ · x_b, w₂ · x_b] as robust as ε well above ||x_b - x_a||. For instance, our certifier certifies this output for ε = 0.2471875250339508056640625, which is more than 6 times ||x_b - x_a||. This certification is misleading because we know that in practice [w₁ · x_b, w₂ · x_b] is not robust since x_a is well within ε of x_b, yet yields a different answer since w₁ · x_a = w₂ · x_b, hence ArgMax([w₁ · x_a, w₂ · x_a]) = 0.

Long story short: SMT solvers can be used to construct toy neural networks, inputs and perturbation bounds for which our certifier gives misleading answers. This happens because of the discrepancy between floating point and real arithmetic.

Attacking Robust Image Classifiers

Showing the existence of toy neural networks, specially crafted by an adversary, that cause our certifier to give misleading results is one thing. But what about for more realistic neural networks?

Our certifier is designed to be deployed with robust-trained neural networks. Indeed, we evaluated it against robust-trained classifiers for standard problems like MNIST and CIFAR-10 to show its effectiveness.

Can we exploit the gap between idealised real arithmetic and actual neural network implementation for such models?

Again, the answer is yes—at least for specific (perhaps very unlikely) scenarios.

The Power of Ties

We saw earlier how floating point arithmetic could produce answers that we knew were wrong, in the form of ties: output vectors containing two equal elements that also happened to be the two maximal elements of the vector.

Specifically, in the earlier example we found an input x_a that (when run through the neural network evaluated using floating point arithmetic) produced an output vector whose elements were equal, i.e., were tied. The existence of x_a meant that the nearby vector x_b was mistakenly classified as robust, precisely because the way the model treated x_a in floating point (producing a tie, and thereby altering the ArgMax) was different to how it would have treated x_a in idealised real arithmetic (in which no tie would have been produced).

Therefore, to attack this same problem for much larger models like MNIST and CIFAR-10, we might seek places where we expect the model to give inaccurate outputs due to floating-point phenomena like ties.

Indeed, the counter-example above consisted of two points x_a and x_b belonging to different classes, (only) due to the fact that x_a produced a tie which caused the ArgMax to change.

In this sense, points x_a that produce ties can be seen as footholds from which we might hope to construct certification counter-examples, by searching for nearby points x_b for which ArgMax(y_b) ≠ ArgMax(y_a) but where x_b is certified robust at ε ≥ ||x_a-x_b||. Call this the counter-example condition. This condition is depicted graphically below.

A certified robustness counter-example

For image classifiers, we want the x_a and x_b to be, well, images. Ideally drawn from (or close to) the same data distribution on which these models were trained and evaluated.

From Foothold to Attack

This leads to the following search procedure:

Given a starting image x_nat, use an existing adversarial example search algorithm to find a nearby point x_~tie that crosses the model’s decision boundary (ArgMax(y_~tie) ≠ ArgMax(y_nat)) but is very close to the decision boundary, and thereby likely to produce a near-tie.
Search between x_nat and x_~tie to find an initial counter-example: a pair of points x_a and x_b that meet the counter-example condition above.
Try to extend the counter-example by searching outwards from each of x_a and x_b to identify a final counter-example pair that meet the counter-example condition above and maximise ε. Thereby producing a final counter-example, certified at some ε ≥ ||x_b-x_a||:

We implemented this procedure as a Python program. For step 1 we used the DeepFool search algorithm, implemented in the Adversarial Example Toolbox.

When run on the image classification models over which we evaluated our verified certifier, it finds instances in which our certifier gives misleading answers.

One example for each of the models we tried is depicted below, noting that I have enlarged each image for ease of viewing. You can see both the original test point x_nat and the final counter-example point x_b, along with the perturbation bound ε for which our certifier said x_b was robust. In each case we found a nearby x_a (not shown) such that ||x_b-x_a|| < ε.

Counter-examples for which our certifier gives misleading results

In general, for models evaluated using the float32 arithmetic (Tensorflow’s default), this search procedure is able to produce counter-examples certified for ε up to c × 10^-6, where 0 < c < 4.

Decreasing the precision at which the model is evaluated allows one to produce counter-examples certified to larger ε. For example, when applied to the same MNIST model but evaluated in bfloat16, we find counter-examples certified robust to ε ≥ 0.139.

Counter-examples produced at different levels of model precision

Practical Significance

These counter-examples show how our certifier can be misleading in principle. Do they translate into practice?

A practical counter-example would involve an image x_b drawn from the natural data distribution over which the model was designed to be deployed. It would also involve a perturbation bound ε close to that used in practice for verified robustness. For MNIST, Fashion MNIST and CIFAR-10 models that would mean an ε of at least 0.1 to 0.3.

Our counter-examples involve images x_b that are nearby data points in the models’ test sets. We see Euclidean distances ||x_nat - x_b|| of anywhere from 1 to above 13, in the examples above.

Indeed, as can be seen above, they are also visually similar to the models’ test points.

However, these x_b cannot really be considered to be “naturally-occurring” or, put another way, one cannot argue that they were drawn from the natural data distributions of their respective models. We didn’t generate the MNIST x_b by following the same procedure used to generate the original MNIST images, for example. Nor did we obtain them by applying natural transformations to MNIST images.

Instead, they were generated by using a targeted search procedure, specifically designed to find robustness counter-examples. In this sense, they are not naturally-generated but rather adversarially-generated.

Moreover, they lie one or two orders of magnitude beyond the adversarial perturbation bounds ε typically used in verified robustness for these kinds of models. Therefore, the counter-example points we generate x_b might look similar to the natural data distribution we might expect to encounter in practice, but are arguably outside it.

Similarly, the perturbation bounds ε for our counter-examples tend to be many orders of magnitude below the values of ε typically used for these kinds of models in verified robustness. The exception would be if we were running one of these models in bfloat16, in which case one might reasonably worry about the potential for (unnatural) counter-examples that threaten practical choices of ε.

Therefore, I would argue that these counter-examples are not necessarily practical demonstrations of our certifier’s unreliability. Yet they hint at the potential for it.

I conjecture that somebody cleverer than me could come up with a better search procedure able to find practical counter-examples x_b at practical perturbation bounds ε.

Indeed, clever people have previously showed how to construct more practical counter-examples for certain kinds of robustness checkers. Their approach works for robustness checkers that are complete, in the sense that when they are given a point x_b that the certifier says is not robust at perturbation bound ε, then they return a witness point x_a for which ||x_b-x_a|| ≤ ε and ArgMax(y_a) ≠ ArgMax(y_b) showing why x_b is not robust at ε.

Our verified certifier is not complete. So we had to come up with a different method for searching for potential counter-examples. Nonetheless, they still demonstrate the potential for our certifier to give misleading answers.

The Difficulty of Formally Modelling the TensorFlow Implementation

What might we do about this?

One obvious avenue would be to try to work out how to precisely model neural network behaviour in logic and then prove theorems that bound the difference between how a neural network might behave in reality (when implemented using a framework like TensorFlow) vs in the ideal world of real arithmetic.

For instance, we could imagine augmenting the aforementioned NN specification above that operates over vectors of real numbers with one that instead operates over vectors of floating point numbers.

However, this alone won’t fully close the specification gap between the ideal world and reality. This is because modern frameworks like TensorFlow that implement machine learning models do not (exactly) implement floating point semantics.

I stumbled on this by accident while writing the code to find the attack on the MNIST model above. Specifically, when running this code, it found inputs x_b for which y_b ≠ 0 when computed by TensorFlow; however, when computed using real arithmetic (technically arbitrary precision rationals) or 32-bit floating point arithmetic, y_b = 0.

I spent far too much time understanding how this discrepancy could arise. Those more familiar with TensorFlow might already be able to guess.

It happens in particular for inputs x_b that when processed by the neural network (implemented using standard floating point or real arithmetic) produce pre-activations (i.e. the numbers fed into a layer’s activation function) that are all negative but very close to 0. After processing by the ReLU activation function, those pre-activations become 0, and those 0s then propagate unchanged through the remainder of the network (since the neural network in question uses no bias terms) producing the 0 output (under floating point or real arithmetic).

However, in TensorFlow it is possible for a pre-activation that is negative but close to 0 to be approximated instead by a (very small) positive number. For example, I found cases in which small-magnitude negative pre-activations like -2.676 × 10^-9 would (when computed by TensorFlow) instead come out like 1.118 × 10^-8. When this happens, the result of applying ReLU is no longer the 0 vector, which means that the final model output y_b under TensorFlow would be non-zero.

The following Venn Diagram depicts the situation graphically. Here we imagine the neural network is represented by a function f. The subscripts indicate whether f is implemented using real arithmetic (subscript ℝ) or by TensorFlow (subscript TF). The left overlapping part of the diagram shows how there exist inputs x for which f(x) might be non-zero when computed by TensorFlow but 0 when computed under real arithmetic (which is the surprising case we just discussed). The right overlapping part of the diagram shows the less surprising case in which floating point rounding might produce the 0 vector output even though under real arithmetic the answer may be non-zero. The top and bottom overlapping parts depict cases in which real arithmetic and TensorFlow are in agreement, about whether a model output is zero.

A Venn diagram regarding TensorFlow vs Real arithmetic

In short: TensorFlow doesn’t implement standard floating point semantics; it sometimes agrees neither with floating point nor real arithmetic about whether an answer is 0. I suspect that formally specifying precisely how frameworks like TensorFlow implement neural networks is likely to be very challenging.

Conclusion

This post leaves many questions open. We showed how our verified certifier can be made to produce misleading answers in principle, by exploiting the gap between how neural networks behave in reality vs in the ideal world of real arithmetic. But it remains open how practical these discrepancies are in real-world use cases.

We also explained why closing this gap is likely to be very challenging, if it involves precisely modelling how practical machine learning frameworks like TensorFlow operate in reality. It remains open how we might close this gap otherwise.

If you’re interested in helping to answer these questions, please get in touch.

In the meantime, thanks for reading.