Learning Home Catalog Composer Lab
Home Catalog Composer Lab Return to course
Variational algorithm design
Variational Algorithms Reference States Ansatze and Variational Forms Cost Functions Optimization Loops Instances and Extensions Examples and Applications

Cost Functions

During this lesson, we'll learn how to evaluate a cost function:

  • First, we'll learn about Qiskit Runtime primitives
  • Define a cost function C(θ)C(\vec\theta). This is a problem-specific function that defines the problem's goal for the optimizer to minimize (or maximize)
  • Defining a measurement strategy with the Qiskit Runtime primitives to optimize speed vs accuracy


Cost function workflow


All physical systems, whether classical or quantum, can exist in different states. For example, a car on a road can have a certain mass, position, speed, or acceleration that characterize its state. Similarly, quantum systems can also have different configurations or states, but they differ from classical systems in how we deal with measurements and state evolution. This leads to unique properties such as superposition and entanglement that are exclusive to quantum mechanics. Just like we can describe a car's state using physical properties like speed or acceleration, we can also describe the state of a quantum system using observables, which are mathematical objects.

In quantum mechanics, states are represented by normalized complex column vectors, or kets (ψ|\psi\rangle), and observables are hermitian linear operators (H^=H^\hat{H}=\hat{H}^{\dagger}) that act on the kets. An eigenvector (λ|\lambda\rangle) of an observable is known as an eigenstate. Measuring an observable for one of its eigenstates (λ|\lambda\rangle) will give us the corresponding eigenvalue (λ\lambda) as readout.

If you're wondering how to measure a quantum system and what you can measure, Qiskit offers two that can help:

  • Sampler: Given a quantum state ψ|\psi\rangle, this primitive obtains the probability of each possible computational basis state.
  • Estimator: Given a quantum observable H^\hat{H} and a state ψ|\psi\rangle, this primitive computes the expected value of H^\hat{H}.

The Sampler primitive

The Sampler primitive calculates the probability of obtaining each possible state k|k\rangle from the computational basis, given a quantum circuit that prepares the state ψ|\psi\rangle. It calculates

pk=kψ2kZ2n{0,1,,2n1},p_k = |\langle k | \psi \rangle|^2 \quad \forall k \in \mathbb{Z}_2^n \equiv \{0,1,\cdots,2^n-1\},

Where nn is the number of qubits, and kk the integer representation of any possible output binary string {0,1}n\{0,1\}^n (i.e. integers base 22).

Qiskit Runtime's Sampler runs the circuit multiple times on a quantum device, performing measurements on each run, and reconstructing the probability distribution from the recovered bit strings. The more runs (or shots) it performs, the more accurate the results will be, but this requires more time and quantum resources.

However, since the number of possible outputs grows exponentially with the number of qubits nn (i.e. 2n2^n), the number of shots will need to grow exponentially as well in order to capture a dense probability distribution. Therefore, Sampler is only efficient for sparse probability distributions; where the target state ψ|\psi\rangle must be expressible as a linear combination of the computational basis states, with the number of terms growing at most polynomially with the number of qubits:

ψ=kPoly(n)wkk.|\psi\rangle = \sum^{\text{Poly}(n)}_k w_k |k\rangle.

The Sampler can also be configured to retrieve probabilities from a subsection of the circuit, representing a subset of the total possible states.

The Estimator primitive

The Estimator primitive calculates the expectation value of an observable H^\hat{H} for a quantum state ψ|\psi\rangle; where the observable probabilities can be expressed as pλ=λψ2p_\lambda = |\langle\lambda|\psi\rangle|^2, being λ|\lambda\rangle the eigenstates of the observable H^\hat{H}. The expectation value is then defined as the average of all possible outcomes λ\lambda (i.e. the eigenvalues of the observable) of a measurement of the state ψ|\psi\rangle, weighted by the corresponding probabilities:

H^ψ:=λpλλ=ψH^ψ\langle\hat{H}\rangle_\psi := \sum_\lambda p_\lambda \lambda = \langle \psi | \hat{H} | \psi \rangle

However, calculating the expectation value of an observable is not always possible, as we often don't know its eigenbasis. Qiskit Runtime's Estimator uses a complex algebraic process to estimate the expectation value on a real quantum device by breaking down the observable into a combination of other observables whose eigenbasis we do know.

In simpler terms, Estimator breaks down any observable that it doesn't know how to measure into simpler, measurable observables called .

Any operator can be expressed as a combination of 4n4^n Pauli operators.

P^k:=σkn1σk0kZ4n{0,1,,4n1},\hat{P}_k := \sigma_{k_{n-1}}\otimes \cdots \otimes \sigma_{k_0} \quad \forall k \in \mathbb{Z}_4^n \equiv \{0,1,\cdots,4^n-1\}, \\

such that

H^=k=04n1wkP^k\hat{H} = \sum^{4^n-1}_{k=0} w_k \hat{P}_k

where nn is the number of qubits, kkn1k0k \equiv k_{n-1} \cdots k_0 for klZ4{0,1,2,3}k_l \in \mathbb{Z}_4 \equiv \{0, 1, 2, 3\} (i.e. integers base 44), and (σ0,σ1,σ2,σ3):=(I,X,Y,Z)(\sigma_0, \sigma_1, \sigma_2, \sigma_3) := (I, X, Y, Z).

After performing this decomposition, Estimator derives a new circuit VkψV_k|\psi\rangle for each observable P^k\hat{P}_k (i.e. from the original circuit), to effectively diagonalize the Pauli observable in the computational basis and measure it. We can easily measure Pauli observables because we know VkV_k ahead of time, which is not the case generally for other observables.

For each P^k\hat{P}_{k}, the Estimator runs the corresponding circuit on a quantum device multiple times, measures the output state in the computational basis, and calculates the probability pkjp_{kj} of obtaining each possible output jj. It then looks for the eigenvalue λkj\lambda_{kj} of PkP_k corresponding to each output jj, multiplies by wkw_k, and adds all the results together to obtain the expected value of the observable H^\hat{H} for the given state ψ|\psi\rangle.

H^ψ=k=04n1wkj=02n1pkjλkj,\langle\hat{H}\rangle_\psi = \sum_{k=0}^{4^n-1} w_k \sum_{j=0}^{2^n-1}p_{kj} \lambda_{kj},

Since calculating the expectation value of 4n4^n Paulis is impractical (i.e. exponentially growing), Estimator can only be efficient when a large amount of wkw_k are zero (i.e. sparse Pauli decomposition instead of dense). Formally we say that, for this computation to be efficiently solvable, the number of non-zero terms has to grow at most polynomially with the number of qubits nn: H^=kPoly(n)wkP^k.\hat{H} = \sum^{\text{Poly}(n)}_k w_k \hat{P}_k.

The reader may notice the implicit assumption that probability also needs to be efficient as explained for Sampler, which means

H^ψ=kPoly(n)wkjPoly(n)pkjλkj.\langle\hat{H}\rangle_\psi = \sum_{k}^{\text{Poly}(n)} w_k \sum_{j}^{\text{Poly}(n)}p_{kj} \lambda_{kj}.

Guided example to calculate expectation values

Let's assume the single-qubit state +:=H0=12(0+1)|+\rangle := H|0\rangle = \frac{1}{\sqrt{2}}(|0\rangle + |1\rangle), and observable

H^=(1221)=2XZ\begin{aligned} \hat{H} & = \begin{pmatrix} -1 & 2 \\ 2 & -1 \\ \end{pmatrix}\\[1mm] & = 2X - Z \end{aligned}

with the following theoretical expectation value H^+=+H^+=2.\langle\hat{H}\rangle_+ = \langle+|\hat{H}|+\rangle = 2.

Since we do not know how to measure this observable, we cannot compute its expectation value directly, and we need to re-express it as H^+=2X+Z+\langle\hat{H}\rangle_+ = 2\langle X \rangle_+ - \langle Z \rangle_+ . Which can be shown to evaluate to the same result by virtue of noting that +X+=1\langle+|X|+\rangle = 1, and +Z+=0\langle+|Z|+\rangle = 0.

Let see how to compute X+\langle X \rangle_+ and Z+\langle Z \rangle_+ directly. Since XX and ZZ do not commute (i.e. don't share the same eigenbasis), they cannot be measured simultaneously, therefore we need the auxiliary circuits:

Authenticate to run code cells
Reset Copy to clipboard


Authenticate to run code cells
Reset Copy to clipboard


Authenticate to run code cells
Reset Copy to clipboard


We can now carry out the computation manually using Sampler and check the results on Estimator:

Authenticate to run code cells
Reset Copy to clipboard


Sampler results:
  >> Expected value of X: 1.00000
  >> Expected value of Z: 0.00000
  >> Total expected value: 2.00000
Estimator results:
  >> Expected value of X: 1.00000
  >> Expected value of Z: 0.00000
  >> Total expected value: 2.00000

Mathematical rigor (optional)

Expressing ψ|\psi\rangle with respect to the basis of eigenstates of H^\hat{H}, ψ=λaλλ|\psi\rangle = \sum_\lambda a_\lambda |\lambda\rangle, it follows:

ψH^ψ=(λaλλ)H^(λaλλ)=λλaλaλλH^λ=λλaλaλλλλ=λλaλaλλδλ,λ=λaλ2λ=λpλλ\begin{aligned} \langle \psi | \hat{H} | \psi \rangle & = \bigg(\sum_{\lambda'}a^*_{\lambda'} \langle \lambda'|\bigg) \hat{H} \bigg(\sum_{\lambda} a_\lambda | \lambda\rangle\bigg)\\[1mm] & = \sum_{\lambda}\sum_{\lambda'} a^*_{\lambda'}a_{\lambda} \langle \lambda'|\hat{H}| \lambda\rangle\\[1mm] & = \sum_{\lambda}\sum_{\lambda'} a^*_{\lambda'}a_{\lambda} \lambda \langle \lambda'| \lambda\rangle\\[1mm] & = \sum_{\lambda}\sum_{\lambda'} a^*_{\lambda'}a_{\lambda} \lambda \cdot \delta_{\lambda, \lambda'}\\[1mm] & = \sum_\lambda |a_\lambda|^2 \lambda\\[1mm] & = \sum_\lambda p_\lambda \lambda\\[1mm] \end{aligned}

Since we do not know the eigenvalues or eigenstates of the target observable H^\hat{H}, first we need to consider its diagonalization. Given that H^\hat{H} is , there exists a unitary transformation VV such that H^=VΛV,\hat{H}=V^\dagger \Lambda V, where Λ\Lambda is the diagonal eigenvalue matrix, so jΛk=0\langle j | \Lambda | k \rangle = 0 if jkj\neq k, and jΛj=λj\langle j | \Lambda | j \rangle = \lambda_j.

This implies that the expected value can be rewritten as:

ψH^ψ=ψVΛVψ=ψV(j=02n1jj)Λ(k=02n1kk)Vψ=j=02n1k=02n1ψVjjΛkkVψ=j=02n1ψVjjΛjjVψ=j=02n1jVψ2λj\begin{aligned} \langle\psi|\hat{H}|\psi\rangle & = \langle\psi|V^\dagger \Lambda V|\psi\rangle\\[1mm] & = \langle\psi|V^\dagger \bigg(\sum_{j=0}^{2^n-1} |j\rangle \langle j|\bigg) \Lambda \bigg(\sum_{k=0}^{2^n-1} |k\rangle \langle k|\bigg) V|\psi\rangle\\[1mm] & = \sum_{j=0}^{2^n-1} \sum_{k=0}^{2^n-1}\langle\psi|V^\dagger |j\rangle \langle j| \Lambda |k\rangle \langle k| V|\psi\rangle\\[1mm] & = \sum_{j=0}^{2^n-1}\langle\psi|V^\dagger |j\rangle \langle j| \Lambda |j\rangle \langle j| V|\psi\rangle\\[1mm] & = \sum_{j=0}^{2^n-1}|\langle j| V|\psi\rangle|^2 \lambda_j\\[1mm] \end{aligned}

Given that if a system is in the state ϕ=Vψ|\phi\rangle = V |\psi\rangle the probability of measuring j| j\rangle is pj=jϕ2p_j = |\langle j|\phi \rangle|^2, the above expected value can be expressed as:

ψH^ψ=j=02n1pjλj.\langle\psi|\hat{H}|\psi\rangle = \sum_{j=0}^{2^n-1} p_j \lambda_j.

It is very important to note that the probabilities are taken from the state VψV |\psi\rangle instead of ψ|\psi\rangle. This is why the matrix VV is absolutely necessary.

You might be wondering how to obtain the matrix VV and the eigenvalues Λ\Lambda. If you already had the eigenvalues, then there would be no need to use a quantum computer since the goal of variational algorithms is to find these eigenvalues of H^\hat{H}.

Fortunately, there is a way around that: any 2n×2n2^n \times 2^n matrix can be written as a linear combination of 4n4^n tensor products of nn Pauli matrices and identities, all of which are both hermitian and unitary with known VV and Λ\Lambda. This is what Runtime's Estimator does internally by decomposing any Operator object into a SparsePauliOp.

Here are the Operators that can be used:

OperatorσVΛIσ0=(1001)V0=IΛ0=I=(1001)Xσ1=(0110)V1=H=12(1111)Λ1=σ3=(1001)Yσ2=(0ii0)V2=HS=12(1111)(100i)=12(1i1i)Λ2=σ3=(1001)Zσ3=(1001)V3=IΛ3=σ3=(1001)\begin{array}{c|c|c|c} \text{Operator} & \sigma & V & \Lambda \\[1mm] \hline I & \sigma_0 = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} & V_0 = I & \Lambda_0 = I = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \\[4mm] X & \sigma_1 = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix} & V_1 = H =\frac{1}{\sqrt{2}} \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix} & \Lambda_1 = \sigma_3 = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix} \\[4mm] Y & \sigma_2 = \begin{pmatrix} 0 & -i \\ i & 0 \end{pmatrix} & V_2 = HS^\dagger =\frac{1}{\sqrt{2}} \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}\cdot \begin{pmatrix} 1 & 0 \\ 0 & -i \end{pmatrix} = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & -i \\ 1 & i \end{pmatrix}\quad & \Lambda_2 = \sigma_3 = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix} \\[4mm] Z & \sigma_3 = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix} & V_3 = I & \Lambda_3 = \sigma_3 = \begin{pmatrix} 1 & 0 \\ 0 & -1 \end{pmatrix} \end{array}

So let's rewrite H^\hat{H} with respect to the Paulis and identities:

H^=kn1=03...k0=03wkn1...k0σkn1...σk0=k=04n1wkP^k,\hat{H} = \sum_{k_{n-1}=0}^3... \sum_{k_0=0}^3 w_{k_{n-1}...k_0} \sigma_{k_{n-1}}\otimes ... \otimes \sigma_{k_0} = \sum_{k=0}^{4^n-1} w_k \hat{P}_k,

where k=l=0n14lklkn1...k0k = \sum_{l=0}^{n-1} 4^l k_l \equiv k_{n-1}...k_0 for kn1,...,k0{0,1,2,3}k_{n-1},...,k_0\in \{0,1,2,3\} (i.e. base 44), and P^k:=σkn1...σk0\hat{P}_{k} := \sigma_{k_{n-1}}\otimes ... \otimes \sigma_{k_0}:

ψH^ψ=k=04n1wkj=02n1jVkψ2jΛkj=k=04n1wkj=02n1pkjλkj,\begin{aligned} \langle\psi|\hat{H}|\psi\rangle & = \sum_{k=0}^{4^n-1} w_k \sum_{j=0}^{2^n-1}|\langle j| V_k|\psi\rangle|^2 \langle j| \Lambda_k |j\rangle \\[1mm] & = \sum_{k=0}^{4^n-1} w_k \sum_{j=0}^{2^n-1}p_{kj} \lambda_{kj}, \\[1mm] \end{aligned}

where Vk:=Vkn1...Vk0V_k := V_{k_{n-1}}\otimes ... \otimes V_{k_0} and Λk:=Λkn1...Λk0\Lambda_k := \Lambda_{k_{n-1}}\otimes ... \otimes \Lambda_{k_0}, such that: Pk^=VkΛkVk.\hat{P_k}=V_k^\dagger \Lambda_k V_k.

Cost functions

In general, cost functions are used to describe the goal of a problem and how well a trial state is performing with respect to that goal. This definition can be applied to various examples in chemistry, machine learning, finance, optimization, and so on.

Let's consider a simple example of finding the ground state of a system. Our objective is to minimize the expectation value of the observable representing energy (Hamiltonian H^\hat{\mathcal{H}}):

minθψ(θ)H^ψ(θ)\min_{\vec\theta} \langle\psi(\vec\theta)|\hat{\mathcal{H}}|\psi(\vec\theta)\rangle

We can use the Estimator to evaluate the expectation value and pass this value to an optimizer to minimize. If the optimization is successful, it will return a set of optimal parameter values θ\vec\theta^*, from which we will be able to construct the proposed solution state ψ(θ)|\psi(\vec\theta^*)\rangle and compute the observed expectation value as C(θ)C(\vec\theta^*).

Notice how we will only be able to minimize the cost function for the limited set of states that we are considering. This leads us to two separate possibilities:

  • Our ansatz does not define the solution state across the search space: If this is the case, our optimizer will never find the solution, and we need to experiment with other ansatzes that might be able to represent our search space more accurately.
  • Our optimizer is unable to find this valid solution: Optimization can be globally defined and locally defined. We'll explore what this means in the later section.

All in all, we will be performing a classical optimization loop but relying on the evaluation of the cost function to a quantum computer. From this perspective, one could think of the optimization as a purely classical endeavor where we call some each time the optimizer needs to evaluate the cost function.

Authenticate to run code cells
Reset Copy to clipboard


Authenticate to run code cells
Reset Copy to clipboard


Authenticate to run code cells
Reset Copy to clipboard



Example mapping to non-physical systems

The maximum cut (Max-Cut) problem is a combinatorial optimization problem that involves dividing the vertices of a graph into two disjoint sets such that the number of edges between the two sets is maximized. More formally, given an undirected graph G=(V,E)G=(V,E), where VV is the set of vertices and EE is the set of edges, the Max-Cut problem asks to partition the vertices into two disjoint subsets, SS and TT, such that the number of edges with one endpoint in SS and the other in TT is maximized.

We can apply Max-Cut to solve a various problems including: clustering, network design, phase transitions, etc. We'll start by creating a problem graph:

Authenticate to run code cells
Reset Copy to clipboard


This problem can be expressed as a binary optimization problem. For each node 0i<n0 \leq i < n, where nn is the number of nodes of the graph (in this case n=4n=4), we will consider the binary variable xix_i. This variable will have the value 11 if node ii is one of the groups that we'll label 11 and 00 if it's in the other group, that we'll label as 00. We will also denote as wijw_{ij} (element (i,j)(i,j) of the adjacency matrix ww) the weight of the edge that goes from node ii to node jj. Because the graph is undirected, wij=wjiw_{ij}=w_{ji}. Then we can formulate our problem as maximizing the following cost function:

C(x)=i,j=0nwijxi(1xj)=i,j=0nwijxii,j=0nwijxixj=i,j=0nwijxii=0nj=0i2wijxixj\begin{aligned} C(\vec{x}) & =\sum_{i,j=0}^n w_{ij} x_i(1-x_j)\\[1mm] & = \sum_{i,j=0}^n w_{ij} x_i - \sum_{i,j=0}^n w_{ij} x_ix_j\\[1mm] & = \sum_{i,j=0}^n w_{ij} x_i - \sum_{i=0}^n \sum_{j=0}^i 2w_{ij} x_ix_j \end{aligned}