When "Verified Robust Accuracy" Isn't, Actually, Verified

Verified Certified Robustness for Neural Networks - Part II

This post is the second in a series on the topic of Verified Certified Robustness for Neural Networks. These posts accompany and explain our recent paper A Formally Verified Robustness Certifier for Neural Networks that will appear at CAV 2025, and the broader research agenda that that paper initiates.

The series so far comprises the following posts:

In the previous post I explained the scalability challenges faced by much prior work on verified robustness for neural networks, and how approaches instead based on certified robustness might provide a way to circumvent those challenges.

Our work is heavily inspired by the ideas of Leino et al. on training robust neural networks.

As part of their work, they develop an efficient way to certify the robustness of individual model outputs.

As I said in my previous post, we discovered soundness issues in both the design and implementation of their approach. I’ll explain more below what those issues mean but, in short, they make their certifier untrustworthy. An adversary can leverage them to make someone believe that a model’s answers are robust when they are not.

Our work, among other things, overcomes those problems by designing, implementing and formally verifying a robustness certifier heavily inspired by Leino et al.’s approach—though we necessarily make different design and implementation decisions to avoid the problems that we identified in their certifier.

This post is all about those problems. Its purpose is to highlight how tricky this sort of code can be to design and implement correctly, and why doing so is crucial for soundness. In later posts I intend to explain how formal verification helps us to get this code right the first time.

If you’re curious about the title of this post, I explain that at the end. In short: I wish machine learning folks would avoid using the adjective “verified” when no formal verification has taken place.

As I’ve said before, I am a massive fan of Leino et al.’s work, and remain so. They are the giants on whose shoulders our work stands.

Efficient Robustness Certification via Lipschitz Bounds

Before going into the problems we discovered in Leino et al.’s certifier (and that are briefly summarised in our paper), I need to describe a little about how their certifier was designed to enable efficient robustness certification of model outputs.

We can think of a neural network as a function f(x) that maps inputs x to outputs y.

What would it mean for function f to be robust for the input x? Roughly, it means that if we change x a little bit, f’s answer y = f(x) doesn’t change too much as well.

How much a function’s output changes in response to changes in its input is captured by its Lipschitz constant.

We don’t actually need to know what a Lipschitz constant is, precisely. It suffices to know that the smaller the Lipschitz constant is for a neural network, the more robust the neural network will be.

The second thing to know about Lipschitz constants for the kinds of neural networks we consider in our work is that we can bound them by working out the Lipschitz constants for individual layers of the neural network and then multiplying them together.

Finally, there exist algorithms for estimating quantities from which we can derive Lipschitz constants of individual neural network layers. Specifically, for Dense (fully-connected) layers using ReLU activations, the power iteration algorithm can be used for this purpose. That is the algorithm that Leino et al.’s implementation uses, in particular.

So, Leino et al. employ power iteration to estimate Lipschitz bounds. Power iteration is a pretty efficient way of doing this. So cheap, in fact, that it can be employed during neural network training to repeatedly estimate Lipschitz constants in order to incorporate (indirectly) the minimisation of them into the training objective.

Even better, having estimated Lipschitz bounds, Leino et al. then use them to very cheaply certify individual output points as robust, using just a couple of arithmetic operations and comparisons.

So, what could possibly go wrong?

Design-Level Soundness Issues

Broadly speaking, there are two kinds of things that could go wrong: problems with Leino et al.’s design and, separately, problems with how they implemented their design. We’ll start with design-level issues.

Issue 1: Power Iteration Does Not Guarantee Soundness

The first problem that we identified pretty early on is that when using power iteration it is very difficult to guarantee that the quantities it computes will be upper bounds. If we use power iteration to estimate Lipschitz constants but we cannot guarantee that the estimates we compute will be upper bounds, then we cannot guarantee—even in principle—that the answers produced by our certifier will be sound.

Power iteration is an iterative approximation algorithm. It runs a loop that on each iteration attempts to get closer to the true answer. The algorithm typically runs until convergence, i.e., when the answers produced by consecutive iterations are (sufficiently) equal. However, while the algorithm tends to converge in practice, it is not guaranteed to do so.

Worse, it approximates its result from below. This means that intermediate results are likely to be under-estimates. In practice, if we run the algorithm to convergence we might choose to conclude that the answer it produces must be close enough to the true quantity—even if the computed result is a bit of an under-estimate, this may not matter in practice.

However, from the point of view of wanting to prove that power iteration produces a sound result, this doesn’t help us. From the standpoint of soundness, any underestimate is unsound. Therefore we concluded that power iteration was not a suitable foundation for our work on formal verification.

In our work we avoid this issue by using a different iterative approximation algorithm known as Gram iteration, which approximates from above instead.

It remains an open question whether the issues with power iteration are exploitable in practice. I suspect probably not, or at least not in a way distinct from some of the other issues I describe below.

Implementation-Level Soundness Issues

The implementation-level soundness issues we identified are all exploitable, depending on your threat model. Some are easily fixable, while others would require more invasive changes to Leino et al.’s implementation.

Issue 2: A Subtle Normalisation Bug

The first exploitable issue arises from a subtle error in Leino et al.’s power iteration implementation.

We reported this issue in February 2025 but at the time of writing it remains unacknowledged.

Let’s put aside the fact that power iteration is not guaranteed to compute a sound estimate of the Lipschitz constant, and assume that it does so when it is correctly implemented.

Recall that power iteration is an iterative approximation algorithm. One of the things that it iteratively approximates is a vector that we’ll call v. The core of the algorithm involves repeatedly taking the current approximation v and multiplying it by the neural network layer’s weight matrix M, and then normalising the resulting vector by dividing by its length. If we write vi for the current value of v, and vi+1 for the new value of v computed at each iteration, then power iteration computes the following recurrence (where we write ||u|| to denote the length of a vector u):

$$ \mathbf{v}_{i+1} = \frac{M\mathbf{v}_i}{||M\mathbf{v}_i||} $$

But what if the divisor || M vi || is zero? When implementing power iteration, one has to be careful to avoid division by zero. To do that, Leino et al. add a small positive quantity that we’ll call e to the divisor, and so compute:

$$ \mathbf{v}_{i+1} = \frac{M\mathbf{v}_i}{||M\mathbf{v}_i||+e} $$

This is quite efficient. It is also differentiable. This is important because this code is run as part of the training algorithm’s loss function, and differentiable loss functions are key to enabling standard machine learning training methods like gradient descent.

However, it is not sound.

When || M vi || is very small, adding e has the effect of artificially inflating the divisor. Doing so causes the resulting vi+1 to be too small. This can lead to the algorithm as a whole severely under-estimating the Lipschitz constants.

This can happen when the matrix M has very small entries. Recall that M is the matrix of weights in the neural network layer whose Lipschitz constant we are estimating. This issue can therefore cause their implementation to under-estimate the Lipschitz constants of neural network layers that have very small weights.

That means that Leino et al.’s certifier can incorrectly tell you that a model is robust when—in reality—it is not.

Exploitability

I said this issue is exploitable. How, and under which conditions?

Suppose you are an adversary. You decide to post online a machine learning model that you claim enjoys a certain level of accuracy and robustness, when evaluated on a test dataset. I, as a consumer of your model, might decide to deploy it. But first I’d like to validate your claims about the model’s robustness. So I run the model on the test dataset and then use Leino et al.’s certifier to tell me which of the test points were not only accurately classified by the model, but were also robust. If you as the adversary have chosen the model weights carefully to ensure they are sufficiently small, then this issue will be triggered and will cause Leino et al.’s certifier to say that the model is indeed as robust as you (the adversary) have claimed it to be. In reality, the model may be highly non-robust.

Of course, you need a way of training a model that is highly accurate and leads to the model having very small weights (at least in some layers). A simple way to do this is to train your model as normal. Then, you repeatedly halve all of the weights in its second-to-last layer while simultaneously doubling the weights in its final (output) layer. Doing so does not meaningfully change the model’s Lipschitz constants (because the halving and doubling act to cancel each other out). However, eventually the weights in the second-to-last layer get so small that Leino et al.’s power iteration implementation massively under-estimates the Lipschitz constant of the second-to-last layer. Once this happens, their certifier will be misled into reporting non-robust outputs as being robust.

In a simple experiment to demonstrate this issue, we trained a simple MNIST model using standard (non-robust) training, achieving 98.45% test accuracy. After performing the halving and doubling described above for a number of iterations, we then took the resulting model and evaluated it on the 10,000 MNIST test points. We used Leino et al.’s certifier to tell us which of those 10,000 points were both accurately classified and robust (at ε = 1.58). It happily reported that the model was accurate and robust for 98.42% of the 10,000 points, and that it was robust (at ε = 1.58) for 99.96% of them.

Of course, the model was not robust at all—indeed, we didn’t train it to be robust. To confirm this, we then used standard adversarial methods for generating adversarial examples. These methods search for inputs that are within ε of each test point but cause the model to produce a different classification.

We are able to generate adversarial examples at ε = 1.58 for 8,682 of the test points that Leino et al.’s certifier said were robust, despite their certifier reporting that the model’s robustness was 99.96%.

To put it another way, this model when evaluated on the MNIST test set, causes Leino et al.’s certifier to be fooled about 87% of the time!

Here is but one example, of the 8,682, we generated. On the left is the original MNIST test image (image 6,169 of the 10,000 images in the MNIST test set), which is correctly classified as a “4”. In the middle is the noise added to this image, whose norm does not exceed 1.58. On the right is the resulting adversarial example, which is within ε = 1.58 of the original image, but which the MNIST model classifies as a “6”. Leino et al.’s classifier mistakenly certifies the original MNIST test image as robust at ε = 1.58.

MNIST test image 6169 and adversarial example that Leino et al.’s classifier mistakenly certifies as robust.

Below is a second example, created from MNIST test image 9,599. Here the norm of the noise that is added is much smaller, and well below the perturbation bound of ε = 1.58. Leino et al.’s classifier certifies the original image on the left as robust at ε = 1.58, despite the adversarial example pictured on the right.

MNIST test image 9599 and adversarial example that Leino et al.’s classifier mistakenly certifies as robust.

Issue 3: Floating-Point Imprecision

The subtle normalisation bug we believe is relatively easy to fix, and we proposed a fix when we reported it. That issue arises in models that have small weights.

A much more pernicious issue that also arises for models with small weights is that Leino et al.’s implementation uses floating point arithmetic. This is a natural choice because their implementation has to be run at training time, and floating point arithmetic is highly efficient. However, it necessarily limits the numerical precision of their implementation.

Consequently, for models whose weights are exceedingly tiny, Leino et al.’s implementation can mistakenly compute Lipschitz constants of 0. This happens because of loss of precision.

Exploitability

This issue can be exploited in exactly the same way as the subtle normalisation bug described above, by crafting models that are highly non-robust but have very tiny weights. We confirmed this by fixing the normalisation issue and then running the same exploitation procedure as described above.

Doing so produces an MNIST model whose Lipschitz constants Leino et al.’s certifier mistakenly calculates as 0. This causes it to mistakenly conclude that the model is 100% robust! Of course, it is not.

Issue 4: A Subtle Certification Routine Bug

The final issue we discovered has nothing to do with power iteration or Lipschitz constants. Instead, it is in the part of Leino et al.’s certifier that is run to check whether an individual output point is robust, and which does so by using the Lipschitz constants already computed via power iteration.

So for the purposes of this bug, let’s assume Leino et al.’s implementation correctly computes Lipschitz constants.

We originally reported this issue in March 2025 but at the time of writing it also remains unacknowledged.

This bug is a tricky implementation error that is triggered when a neural network produces an output vector whose elements are all equal. In particular, Leino et al.’s implementation will certify as robust any output vector whose components are all equal.

The canonical example is the 0 output vector (i.e., the vector whose elements are all 0).

Importantly, their certifier will certify any such output for any perturbation bound ε.

Exploitability

This issue is also exploitable; however, it is likely far less concerning than the normalisation bug described above.

It can be exploited by an adversary who has the ability to craft model inputs that produce particular outputs. For instance, suppose an adversary has so-called white-box access to a model, i.e., they know the model architecture and its weights. This adversary can then use gradient descent to search for inputs that map to output vectors with equal components.

We were able to exploit this issue by training simple globally robust models (e.g. MNIST, Fashion MNIST, and CIFAR-10) using Leino et al.’s implementation and then run gradient descent to search for inputs that produce the 0 output vector (using the Mean Absolute Error loss function).

Doing so produces inputs that Leino et al.’s certifier will say are robust for any perturbation bound, no matter how large. Of course, no such inputs exist for any meaningful classifier.

Therefore, in a technical sense, this issue makes Leino et al.’s certifier unsound: there exist instances in which it mistakenly says an answer is robust even though it is not. However, the danger posed by this issue is limited.

In practice, the kinds of inputs that produce the 0 output vector tend to be pretty meaningless, even though they certainly seem to exist and to be easy to find for even these simple kinds of models. Below are the images that produce the 0 output vector, which is mistakenly certified as robust for any perturbation bound ε, by Leino et al.’s classifier for simple MNIST, Fashion MNIST and CIFAR-10 models.

Images for which trained classifiers produce the zero output vector.

What’s in this post’s name?

Having described the various issues we discovered, let me close out this post by explaining its title.

When talking about certified robustness, a common measure that is used in the machine learning community is something called Verified Robust Accuracy (VRA). Given a machine learning model evaluated over some test data set, VRA measures the proportion of test points that the model correctly classifies and that are also robust (at some chosen perturbation bound ε).

In fact, we already encountered VRA (though I didn’t call it that) when discussing the exploitability of the normalisation bug. There I described how we exploited that bug by training an MNIST model that was not at all robust but which Leino et al.’s certifier reported a VRA at ε = 1.58 of 98.42%, even though we were able to produce adversarial examples for 8,682 of the 10,000 test points.

A model’s robustness is an upper bound of its VRA (since the model can be robust for points that it is not accurate for). The MNIST exploit for the normalisation bug above shows that the model was robust for at most 1,318 of the 10,000 test points, i.e. its maximum robustness was 13.18%. This means that its VRA is no higher than about 13%—nowhere near 98.42%.

To my knowledge, the best known VRA at ε = 1.58 for an MNIST model is in fact about 63%.

This brings me to two points:

  1. Employing the adjective “verified” when talking about quantities computed by unverified (and indeed buggy) implementations is problematic
  2. This is especially so for fields like Machine Learning whose current progress is so often driven by empirical findings

In particular, by exploiting either the normalisation or floating point issues described above, I could create an ML model that I claim has exceptional VRA, well beyond what has been possible to achieve with existing robust training algorithms. You, as a peer reviewer, might be tempted to check my claim by employing a state-of-the-art robustness certifier. Yet you may very well be misled, if the VRA you’ll compute bears no relation to the model’s actual performance.

This is yet another reminder of why it’s important to be precise when using words like “verified”, in order to avoid creating a false sense of security.

As a formal verification researcher, I would naturally contend that the qualifier “verified” should be reserved for cases in which some verification has taken place.

One way to think about our verified robustness certifier, then, is that it puts the “verified” into “Verified Robust Accuracy”. Indeed, our certifier is immune to the issues I’ve described in this post.

Exactly how it achieves that immunity I intend to be the subject of a future post in this series.

Exploit Code

In the meantime, if you’re curious about any of the bugs or exploits I’ve described in this post, you can find our exploit code on GitHub. Exploit code you’ll find in the scripts/ directory:

One way to to run and play with this code is within the Docker container of the artifact that accompanies our CAV 2025 paper.

As always, please get in touch if you’re interested in this work.

Acknowledgements

Thanks to James Tobler, who is lead author of the paper that is the subject of this series, for insightful feedback on earlier drafts of this post.