Breaking a Verified Certifier

Verified Certified Robustness for Neural Networks - Part IV

Note: this post was heavily updated on March 22, 2026 with the inclusion of the counterexamples against the adversarially-biased MNIST model.

This post is the fourth in a series on the topic of Verified Certified Robustness for Neural Networks. These posts accompany and explain our recent paper A Formally Verified Robustness Certifier for Neural Networks that appeared at CAV 2025, and the more recent Lipschitz-Based Robustness Certification Under Floating-Point Execution, and the broader research agenda that those papers initiate.

The series so far comprises the following posts:

In the previous posts I explained the formally verified robustness certifier that we built, which implements a provably sound method for efficiently checking the robustness of neural network outputs.

In this post I am going to dig into the limits of what “provably sound” means in practice, for our verified certifier. I’ll show how—even though our certifier is proved sound—there exist situations in which it can be made to give misleading answers, when applied to robust-trained models for problems like MNIST and CIFAR-10.

This post assumes you’ve read the third post, “Formally Verified Certified Robustness”, on how we formally specified robustness.

So, how can we break a verified certifier? The short answer is by exploiting the gap between what our certifier’s formal specification says about how neural networks behave, and how they actually behave in practice.

The Real vs Ideal

This gap is something that I explicitly mentioned as a potential weakness of our certifier towards the end of the third blog post in this series. To recap: to prove our certifier sound, we had to formally specify what it means for a neural network to be robust. That involved writing down in logic how (simple, dense feed-forward) neural networks (with no bias terms and only ReLU activations) behave.

Our certifier’s formal specification formalises neural networks using real arithmetic. This choice is incredibly common in work that uses mathematics to reason about neural networks. However, in reality neural networks are not implemented using real arithmetic. Instead, they use floating point arithmetic.

For example, we formalised how a neural network n operates on an input vector v by a function called NN(n,v). NN’s definition says that v is a vector of real numbers, and that NN(n,v) returns a vector of real numbers. Using this definition, we can reason about how a neural network would behave when given an input vector containing an irrational number like π.

But in reality no actual, deployed neural network can operate over inputs like π. Instead, common machine learning frameworks like TensorFlow and PyTorch produce neural networks that operate over floating-point numbers, which are designed (only) to approximate real numbers. For example, the (single precision) floating point approximation of π is 3.14159274101257324 (which is a bit larger than π’s actual value, which is 3.1415926535897932384626433…).

Real arithmetic is much simpler to reason about than floating point arithmetic, and this is why everyone uses the former to formalise neural networks. In this sense, real arithmetic formalises how we wish neural networks behaved in an ideal world. How they actually behave is a different matter that we’ll dig into below.

(As a side note, one might contend that using the term “real arithmetic” to refer to how we imagine neural network arithmetic to behave in an ideal world—rather than how it actually behaves in the, ahem, real world—is confusing. I couldn’t agree more.)

Broken Toys

So the way we formalised neural networks doesn’t exactly match reality. This is important because any real situation that differs from the ideal world captured by our formal specification is not guaranteed to be covered by our certifier’s soundness theorem. Put more simply: if we can find a situation where reality differs from the ideal world, then it may be possible to construct scenarios in which our certifier says that something is robust when in reality it is not.

A Real vs Floating Point Discrepancy

Here is but one simple example. Let’s imagine we have two numbers: w1 and w2, such that:

$$0 < w_1 < w_2 < 1$$

Then for any positive number x, we expect that:

$$ w_1 \cdot x < w_2 \cdot x $$

This is certainly true in real arithmetic. However, it is not guaranteed to be true in floating point arithmetic. In particular, in floating point arithmetic, if w1 and w2 are very close together, then it is possible for certain values of x to have that:

$$ w_1 \cdot x = w_2 \cdot x $$

Weaponising this Discrepancy

Can this discrepancy, between how real and floating point arithmetic each behave, undermine the guarantees provided by our verified certifier’s soundness theorem? In short: yes, at least for specially crafted toy neural networks.

One such network is depicted below. It takes 1-element vectors x as its input and produces 2-element vectors y = [y1,y2] as its output. It contains just two neurons, with weights w1 and w2 respectively, no bias terms, and uses no activation functions.

A toy neural network with two neurons

Thus we have for each i ∈ {1,2}:

$$y_i = w_i \cdot x$$

Let’s suppose I am an adversary who can choose the values of the weights for this neural network, and that I also have the ability to influence what inputs x the neural network is given. My goal is to find weights and inputs that cause our verified certifier to produce misleading answers: specifically, to say that an output is robust at some perturbation bound ε even though it is not.

The adversary’s goal is to find weights w1, w2 and inputs xa, xb such that:

$$\mathit{ArgMax}([w_1 \cdot x_a, w_2 \cdot x_a])\not=\mathit{ArgMax}([w_1 \cdot x_b, w_2 \cdot x_b])$$

when computed using floating point arithmetic; however, when computed using real arithmetic the adversary wants the opposite to be true:

$$\mathit{ArgMax}([w_1 \cdot x_a, w_2 \cdot x_a])=\mathit{ArgMax}([w_1 \cdot x_b, w_2 \cdot x_b])$$

Without loss of generality, let’s assume w1 < w2. Then, in real arithmetic, for any positive input x, we have:

$$w_1 \cdot x < w_2 \cdot x$$

and so

$$\mathit{ArgMax}([w_1 \cdot x, w_2 \cdot x])=1$$

always (assuming vectors are indexed from 0, so that 1 is the index of the second element). Thus, in real arithmetic we will have for any positive xa, xb:

$$\mathit{ArgMax}([w_1 \cdot x_a, w_2 \cdot x_a])=\mathit{ArgMax}([w_1 \cdot x_b, w_2 \cdot x_b])$$

since both sides will be equal to 1.

However, as we said above, we know this is not guaranteed to hold for floating point arithmetic, because for certain choices of the weights and inputs x we can have:

$$ w_1 \cdot x = w_2 \cdot x $$

in which case (assuming ArgMax breaks ties by choosing the lowest index, as is practically universal):

$$\mathit{ArgMax}([w_1 \cdot x, w_2 \cdot x])=0$$

With that in mind, the adversary simply needs to find weights w1, w2 and positive inputs xa, xb such that:

  1. w1 < w2
  2. w1 · xa = w2 · xa in floating point arithmetic, and
  3. w1 · xb < w2 · xb

Having found such a combination, it’s possible that our certifier will certify the output [w1 · xb, w2 · xb] produced by input xb as robust at ε = ||xb-xa||, even though in reality xa is a counter-example. (Here we write ||v|| for the Euclidean norm of vector v.)

Using Z3 to Find an Attack

The constraints above can be encoded as a floating point Constraint Satisfaction Problem, including for SMT solvers like Z3, to find values for the weights and inputs that can cause our certifier to produce misleading answers.

I wrote such an encoding as a Z3 Python program to find suitable values for the neural network weights and inputs, and then to confirm that they lead to non-robust results when the neural network is implemented in Keras/TensorFlow, and finally to encode the neural network and the counter-example output for consumption by our verified certifier.

Doing so finds weights and inputs:

  • w1 = 0.9890594482421875,
  • w2 =0.989059507846832275390625,
  • xa = 0.157976210117340087890625, and
  • xb = 0.19721232354640960693359375,

such that (in 32-bit floating point arithmetic):

  • ||xb - xa|| = 0.03923611342906951904296875 and
  • w1 · xb = 0.195054709911346435546875 and
  • w2 · xb = 0.19505472481250762939453125,

and for which our certifier happily (and misleadingly) certifies the output vector [w1 · xb, w2 · xb] as robust as ε well above ||xb - xa||. For instance, our certifier certifies this output for ε = 0.2471875250339508056640625, which is more than 6 times ||xb - xa||. This certification is misleading because we know that in practice [w1 · xb, w2 · xb] is not robust since xa is well within ε of xb, yet yields a different answer since w1 · xa = w2 · xb, hence ArgMax([w1 · xa, w2 · xa]) = 0.

Long story short: SMT solvers can be used to construct toy neural networks, inputs and perturbation bounds for which our certifier gives misleading answers. This happens because of the discrepancy between floating point and real arithmetic.

Attacking Robust Image Classifiers

Showing the existence of toy neural networks, specially crafted by an adversary, that cause our certifier to give misleading results is one thing. But what about for more realistic neural networks?

Our certifier is designed to be deployed with robust-trained neural networks. Indeed, we evaluated it against robust-trained classifiers for standard problems like MNIST and CIFAR-10 to show its effectiveness.

Can we exploit the gap between idealised real arithmetic and actual neural network implementation for such models?

Again, the answer is yes—at least for specific scenarios.

The Power of Ties

We saw earlier how floating point arithmetic could produce answers that we knew were wrong, in the form of ties: output vectors containing two equal elements that also happened to be the two maximal elements of the vector.

Specifically, in the earlier example we found an input xa that (when run through the neural network evaluated using floating point arithmetic) produced an output vector whose elements were equal, i.e., were tied. The existence of xa meant that the nearby vector xb was mistakenly classified as robust, precisely because the way the model treated xa in floating point (producing a tie, and thereby altering the ArgMax) was different to how it would have treated xa in idealised real arithmetic (in which no tie would have been produced).

Therefore, to attack this same problem for much larger models like MNIST and CIFAR-10, we might seek places where we expect the model to give inaccurate outputs due to floating-point phenomena like ties.

Indeed, the counter-example above consisted of two points xa and xb belonging to different classes, (only) due to the fact that xa produced a tie which caused the ArgMax to change.

In this sense, points xa that produce ties can be seen as footholds from which we might hope to construct certification counter-examples, by searching for nearby points xb for which ArgMax(yb) ≠ ArgMax(ya) but where xb is certified robust at ε ≥ ||xa-xb||. Call this the counter-example condition. This condition is depicted graphically below.

A certified robustness counter-example

For image classifiers, we want the xa and xb to be, well, images. Ideally drawn from (or close to) the same data distribution on which these models were trained and evaluated.

From Foothold to Attack

This leads to the following search procedure:

  1. Given a starting image xnat, use an existing adversarial example search algorithm to find a nearby point x~tie that crosses the model’s decision boundary (ArgMax(y~tie) ≠ ArgMax(ynat)) but is very close to the decision boundary, and thereby likely to produce a near-tie. Using an adversarial search algorithm to find a near-tie point
  2. Search between xnat and x~tie to find an initial counter-example: a pair of points xa and xb that meet the counter-example condition above. Finding an initial counter-example via bisection search
  3. Try to extend the counter-example by searching outwards from each of xa and xb to identify a final counter-example pair that meet the counter-example condition above and maximise ε. Counter-example extension by search Thereby producing a final counter-example, certified at some ε ≥ ||xb-xa||: Final counter-example

We implemented this procedure as a Python program. For step 1 we used the DeepFool search algorithm, implemented in the Adversarial Example Toolbox.

When run on the image classification models over which we evaluated our verified certifier, it finds instances in which our certifier gives misleading answers.

One example for each of the models we tried is depicted below, noting that I have enlarged each image for ease of viewing. You can see both the original test point xnat and the final counter-example point xb, along with the perturbation bound ε for which our certifier said xb was robust. In each case we found a nearby xa (not shown) such that ||xb-xa|| < ε.

Counter-examples for which our certifier gives misleading results

In general, for models evaluated using the float32 arithmetic (Tensorflow’s default), this search procedure is able to produce counter-examples certified for ε up to c × 10-6, where 0 < c < 4.

Decreasing the precision at which the model is evaluated allows one to produce counter-examples certified to larger ε. For example, when applied to the same MNIST model but evaluated in bfloat16, we find counter-examples certified robust to ε ≥ 0.139.

Counter-examples produced at different levels of model precision

Practical Significance

These counter-examples show how our certifier can be misleading in principle. But do they translate into practice?

A practical counter-example would involve a perturbation bound ε close to that used in practice for verified robustness. For MNIST, Fashion MNIST and CIFAR-10 models that would mean an ε of at least 0.1 to 0.3.

For unmodified models evaluated at float32, our search procedure produces counter-examples certified for ε up to around 4 × 10-6—many orders of magnitude below practical perturbation bounds. Decreasing the evaluation precision helps (as we saw above with bfloat16), but ε ≈ 0.1 is still borderline.

So for natural (unmodified) models at standard precision, these counter-examples hint at the potential for our certifier to be unreliable, but are not yet practical demonstrations of it.

Can we do better?

Amplifying Floating-Point Errors

The counter-examples above exploit naturally occurring floating-point rounding errors, which tend to be tiny at float32 precision. But what if an adversary could amplify those errors?

As mentioned earlier in this series, our verified certifier handles neural networks with no bias terms, and the models we have been attacking were trained without biases therefore. But there is nothing fundamental about this restriction—a certifier could support bias terms (indeed, our new one does). So: what happens if we add them?

The Construction

Consider a neural network with L dense layers, with weight matrices W1, …, WL and no bias terms. Pick a large constant B (say, 106) and modify the network as follows:

  • At the second-to-last hidden layer (layer L-1), add a flat bias vector bL-1 with every entry equal to B.
  • At the output layer (layer L), add a compensating bias bL = -(bL-1 · WL).
  • Leave all other layers unchanged (with zero biases).

What does this accomplish?

In real arithmetic, the effect of bL-1 is approximately cancelled by bL at the output layer. I say approximately because the ReLU activation at layer L-1 means the cancellation is not exact: the large bias bL-1 changes which neurons are active, so the model computes a slightly different function than the original.

In floating-point arithmetic, things are very different. Adding B = 106 to the second-to-last hidden layer inflates the intermediate values at that layer enormously. When those inflated values are multiplied by WL at the output layer, the matrix-vector multiplication accumulates rounding errors proportional to the inflated values. When the compensating bias bL then subtracts the large contribution, the large values cancel—but the rounding errors, being much smaller than the values being subtracted, survive.

The result: the floating-point outputs of the modified network can differ substantially from their real-arithmetic counterparts, even at float32 precision.

Crucially, the Lipschitz constants of the network are unchanged. Bias terms don’t affect the gradient, so they don’t affect the Lipschitz bounds. This means a real-arithmetic certifier, which uses Lipschitz constants to certify robustness, will certify the modified network exactly as it would have certified the original.

The Model Looks Legitimate

We applied this construction to the same robust-trained MNIST model we attacked earlier. Despite the massive bias values, the modified model retains 98.23% test accuracy (down from 98.40% for the original model—a drop of just 0.17 percentage points). A user inspecting the model’s test accuracy would have no reason to suspect anything is wrong.

The Results

Running the same counter-example search procedure against this adversarial model at float32, we generated 39 counter-examples with ε values reaching up to 0.47. Of these, 35 have ε ≥ 0.1 and 6 have ε ≥ 0.3—the evaluation radius used for MNIST throughout our paper and this blog series.

These are no longer tiny discrepancies. They are counter-examples at semantically meaningful perturbation radii. We observed the same trend for analogous adversarial models constructed for Fashion MNIST and CIFAR-10.

Three examples from MNIST are shown below. In each case, the original MNIST test image xnat is on top, and the counter-example xb is on the bottom. Our verified certifier would certify xb as robust at the indicated ε, even though a nearby point xa (not shown) exists within ε of xb that the model classifies differently.

Counter-examples for the adversarial biased MNIST model

Conclusion

We showed how our verified certifier can be made to produce misleading answers by exploiting the gap between how neural networks behave in reality (using floating-point arithmetic) vs in the ideal world of real arithmetic that our certifier’s formal specification assumes.

For unmodified models at standard float32 precision, the resulting counter-examples have tiny ε values and are not of practical concern. But when an adversary is allowed to choose the model’s weights—for instance by adding large compensating biases that amplify floating-point rounding errors—the situation changes dramatically. The resulting models look legitimate (retaining high test accuracy) yet admit counter-examples at perturbation radii that are semantically meaningful.

In the next post in this series, we will see how to fix this problem by building a certifier that accounts for floating-point effects.

If you’re interested in this line of work, please get in touch.

In the meantime, thanks for reading.