## Introduction to convolution with link to filters in computer vision

Convolution is arguably one of the most important operations in computer vision. It can be used to modify the image (e.g. blurring), find relevant structures (e.g. edge detection) or infer arbitrary features (e.g. machine learning). It is often one of the first steps in an image processing pipeline. Abstractly speaking, this operation takes an image as input and produces also an image as output by operating on each pixel in the input image. For this, it takes the neighbourhood of the pixel into account and its behaviour can be configured by so-called kernels. This makes it a very powerful and versatile operation.

As its core, convolution is, first of all, a mathematical operation with general properties and not necessarily restricted to the computer vision domain (it is popular in signal processing in general). However, the mathematical definition compared to the actual calculations we perform can be confusing at first glance. The goal of this article is to provide a deep understanding of the convolution operation and to bridge the gap between the theoretical definitions and the practical operations. For this, we are going to look at various perspectives: we are starting in the 1D discrete domain, moving further to the 1D continuous and concluding in the 2D discrete world (skip to this last section if you are only interested in image processing).

### 1D – discrete

Let's start with the one-dimensional and discrete case. There is a very good example at BetterExplained which we want to take up here and use as a basis. Suppose you are a doctor leading a medical study where you oversee the test of a new medicine (a new pill) which has a strict plan of intake. A patient has to come on three consecutive days to your office while intaking three pills on the first, two pills on the second and one pill on the third day. Now, assume further that you don't only have one but multiple patients instead. To make things even more complicated, the patients start on different days with their treatment. Let's assume that on the first day just one patient comes to your office. He/She will get three pills from you. One the second day two new patients and the one from yesterday come to your office. The patient from yesterday gets two and the other two will get three pills each — leaving a total of eight pills you have to deposit in your office (to satisfy all demands)1. Of course, you want to keep track of the number of pills required per day. You could go on with business as usual, but this will probably get complicated and confusing. Maybe there is a better way...

We are now going to optimize this problem. We have two variables in this scenario: the plan for the intake of the pills per treatment day $$\tau = 1, 2, 3$$ of a patient and the number of new patients coming to the office on each day $$t = 1, 2, \ldots$$ of the experiment. In the end, we would like to retrieve the number of pills we need to hand out per experiment day $$t$$.

The plan is a list of values denoting the number of pills required per treatment day $$\tau$$ of one patient, i.e. $$\fvec{k} = (3, 2, 1)$$ in this example. The number of patients can also be stored in a list denoting the number of new patients arriving on a day $$t$$ of the experiment. Let us assume that this list would look like $$\fvec{s} = (1, 2, 3, 4, 5)$$ for our example (1 new patient comes on the first, 2 new patients come on the second day, etc.). We now want to combine the two lists in a way so that we know how many pills we need per experiment day $$t$$. For this, we summarize the information in a table.

The top row shows the plan with the number of pills in decreasing order. Below are the number of new patients arranged (visibility depends on the experiment day $$t$$). We multiply the two lists together over the treatment days $$\tau$$ so that each column denotes the total number of pills required for first-time, second-time and third-time patients. Then, we sum up these values and this result is the total number of required pills per experiment day $$t$$.

This is a very dynamic scenario and we end up with one table per experiment day $$t$$. This is essentially what you can control with the slider and has the effect of sliding the patient list from left to right over the plan list. However, if we used the original order of the patient list, the number 5 would come first but there is only one patient on the first day. So, we have to reverse the patient list: $$(5, 4, 3, 2, 1)$$.

Hopefully, we can agree that this representation makes the problem much clearer than the manual (naive) approach. Just with a combination of basic operations (multiplication and addition), we can easily retrieve the value we want.

You may have noticed that some terms in the previous paragraphs were highlighted. In summary:

• Plan: a function $$\label{eq:Convolution_ExamplePlan} k(\tau) = \begin{cases} k_{\tau + 1} & 0 \leq \tau \leq 2 \\ 0 & \text{else} \end{cases}$$ with the treatment day $$\tau$$ as argument and the number of required pills as the function value2.
• Patients: a function $$\label{eq:Convolution_ExamplePatients} s(t) = \begin{cases} s_t & 1 \leq t \leq 5 \\ 0 & \text{else} \end{cases}$$ with the day $$t$$ of the experiment as argument and the number of new patients coming on that day as the function value.
• Reverse order of the patient list (helpful to model our problem).
• Multiplication of the values in the patient and plan lists along the treatment days $$\tau$$.
• Addition of the previous multiplication results.
• New function $$p(t)$$ which returns the total number of pills we need to have in stock on each day $$t$$ of the experiment.

These points provide the ingredients for the definition of the convolution operation.

Definition 1: Convolution [1D, discrete]

Let $$s(t)$$ and $$k(t)$$ denote a one-dimensional and discrete ($$t \in \mathbb{Z}$$) signal as well as kernel, respectively. Then, applying the convolution operator $$*$$ on both functions

$$\label{eq:Convolution_1DDiscrete} {\color{Mulberry}p}(t) = {\color{LimeGreen}s}(t) * {\color{Aquamarine}k}(t) = {\color{Mahogany}\sum_{\tau=-\infty}^{\infty}} {\color{LimeGreen}s}({\color{Orange}-}{\color{Mahogany}\tau} + t) {\color{Red}\cdot} {\color{Aquamarine}k}({\color{Mahogany}\tau})$$

yields the response function $$p(t)$$. In essence, this operation reverses the signal $${\color{LimeGreen}s}$$, combines it with the kernel $${\color{Aquamarine}k}$$ and aggregates the result into a new function $${\color{Mulberry}p}$$.

It is, of course, not really practicable to have a sum which iterates over $$\infty$$. However, this is not a problem when our discrete functions store only a definite number of elements (always the case in practice). In this case, we can define the values outside the relevant range as 0 (like we did in \eqref{eq:Convolution_ExamplePlan} and \eqref{eq:Convolution_ExamplePatients}) so that most of the multiplications disappear. In the end, we only iterate over the relevant intersection between $$s(t)$$ and $$k(t)$$. It makes sense to let the relevant range be defined by the smaller function so that we save computations. In our example, this is the plan $$k(\tau)$$ which has only $$n=3$$ elements letting us rewrite \eqref{eq:Convolution_1DDiscrete} to

\begin{equation*} {\color{Mulberry}p}(t) = {\color{Mahogany}\sum_{\tau = 0}^{n-1}} {\color{LimeGreen}s}({\color{Orange}-}{\color{Mahogany}\tau} + t) {\color{Red}\cdot} {\color{Aquamarine}k}({\color{Mahogany}\tau}). \end{equation*}

This function can easily be evaluated. For example, when we want to know how many pills we need to hand out on the third day of the experiment, we calculate

\begin{equation*} p(3) = s(3) \cdot k(0) + s(2) \cdot k(1) + s(1) \cdot k(2) = 9 + 4 + 1 = 14. \end{equation*}

To visualize the function $$p(t)$$ of \eqref{eq:Convolution_1DDiscrete}, we can imagine that the patient function $$s(-\tau + t)$$ slides over the plan function $$k(\tau)$$. At each step $$\tau$$, the functions get multiplied together. The is also illustrated in the following animation.

The abbreviations which have been used so far are not arbitrarily chosen. $$s$$ stands for the signal since convolution is often used with signals (audio, images, etc.). $$k$$ is the kernel which defines the kind of operation we want to perform. There are for example kernels for blurring, averaging or finding the derivations of a signal. Even the convolution variable $$\tau$$ is chosen deliberately. Both $$t$$ and $$\tau$$ denote some point in time. $$t$$ is used to specify the timestamp currently of interest and $$\tau$$ is used to iterate over the complete relevant time range which we need to consider in order to calculate the result at the step $$t$$.

In our example, we flipped the signal (patients) and slid it over the kernel (plan). This was not much of a problem because both vectors contain only a few numbers. However, with larger signals (like from the real world) this procedure would get impractical. Luckily, the convolution operation has the commutativity property, i.e.

$$\label{eq:Convolution_Commutativity} s(t)*k(t) = k(t)*s(t).$$

In other words: it doesn't matter whether we flip and slide the signal or the kernel. In terms of the previous example, this would mean that we could also have fixed the patient data and reversed and slid the plan list over the patient list. The result is the same. In practice, we can always flip and slide the smaller of the two vectors and save computation time.

### 1D – continuous

The next step is to extend \eqref{eq:Convolution_1DDiscrete} to the continuous case. This turns out to be pretty simple since we basically have to replace the sum with an integral:

Definition 2: Convolution [1D, continuous]

Let $$s(t)$$ and $$k(t)$$ denote a one-dimensional and continuous ($$t \in \mathbb{R}$$) signal as well as kernel, respectively. Then, applying the convolution operator $$*$$ on both functions

$$\label{eq:Convolution_1DContinuous} s(t)*k(t) = \int_{-\infty}^{\infty} \! s(-\tau + t) \cdot k(\tau) \, \mathrm{d}\tau$$

yields the response function $$p(t)$$.

So, instead of adding up a discrete set of values, we integrate over the complete (relevant) range. This means that we are interested in the area underneath $$s(-\tau + t) \cdot k(\tau)$$ for each step $$t$$. However, the integral over the complete range $$]-\infty;\infty[$$ makes it really hard to name the resulting function analytically. It is often possible, though, if at least one of the functions is only defined on a limited range or the functions are not very complicated itself. To look at an example, let's consider the increasing step function

\begin{equation*} s(t) = \begin{cases} 1 & t \geq 0 \\ 0 & t < 0 \end{cases} \end{equation*}

as a signal. In an image, this could for example model a sudden contrast increase, i.e. an edge. As kernel, the difference of Gaussian function shall be used

$$\label{eq:Convolution_KernelContinuous} k(t) = \operatorname{DoG}_{\sigma_1,\sigma_2}(t) = G_{\sigma_1}(t) - G_{\sigma_2}(t).$$

This is an approximation of the normalized version of the Laplacian of Gaussian function which is created by applying the Laplace $$\nabla^{2}$$ operator to a Gaussian function. In 2D images, this implements edge detection based on the second order derivatives. Since noise amplifies with increasing order of derivatives, it is often a good idea to smooth the signal first with a Gaussian. The Laplacian of Gaussian combines both operations (smoothing and derivative calculation)3. The following animation shows how the response of \eqref{eq:Convolution_1DContinuous} forms a short impulse response.

So, why does the response function look like it looks? Well, first note that the $$\operatorname{DoG}$$ function consists of a positive part surrounded by two negative parts. When sliding over the step function, first it overlaps with the right negative part. In this range, the multiplication $$s(-\tau + t) \cdot k(\tau) =_{t \geq 0} k(\tau)$$ is non-zero and yields nothing more than the kernel function itself since the step function forms an identity function for $$t \geq 0$$. The integral inherits its sign from the kernel and is hence negative so that we get a negative response for this first part.

When sliding further, the positive part of the kernel function reaches $$t \geq 0$$. Now, we have a positive and a negative part “active” which lowers the response (the two integral areas balance out a bit). Until the second negative part reaches the step function, the response increases further. After this point, the response decreases again due to the left negative part coming in. The response will approximate (but never reach) 0 since the $$\operatorname{DoG}$$ function consists of two Gaussian functions which have non-zero values everywhere in $$\mathbb{R}$$.

### 2D – discrete/continuous

It is now time to add an additional dimension so that we are finally reaching the image domain. This means that our functions now depend on two variables, e.g. $$x$$ and $$y$$. In the convolution formulas, this results in an additional sum respectively integral. The following table summarizes the convolution formulas in the 1D/2D and discrete/continuous cases.

1D 2D
1D $$s(t)*k(t) = \sum_{\tau = -\infty}^{\infty} s(-\tau + t) \cdot k(\tau)$$ $$s(x, y)*k(x, y) = \sum_{\tau = -\infty}^{\infty} \sum_{\kappa = -\infty}^{\infty} s(-\tau + x, -\kappa + y) \cdot k(\tau, \kappa)$$
Continuous $$s(t)*k(t) = \int_{-\infty}^{\infty} \! s(-\tau + t) \cdot k(\tau) \, \mathrm{d}\tau$$ $$s(x, y)*k(x, y) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \! s(-\tau + x, -\kappa + y) \cdot k(\tau, \kappa) \, \mathrm{d}\tau \mathrm{d}\kappa$$

We now integrate over the complete relevant area around each point and “each point” means all points in a 2D grid. This sounds like an expensive operation (which it also sometimes is) but when the kernel size is small compared to the image size, it is manageable. Personally, I have never used the 2D continuous convolution formula before (just mentioned for completeness), so let's stick to the 2D discrete case from now on.

As already mentioned, convolution is a very important operation in computer vision4. The signal is most of the time just the image itself. The interesting part is the kernel since it defines how the image will be altered. The kernel is centred5 and flipped around the current pixel position and then each element of the kernel gets multiplied with the corresponding pixel value in the image, i.e. the current pixel with the kernel centre and the surroundings of the kernel with the neighbours of the current pixel. This is what we called the area around a position. To make this point clear, consider the following signal matrix $$A$$ and the $$3 \times 3$$ kernel $$K$$

\begin{equation*} A = \begin{pmatrix} 1 & 2 & 3 \\ 4 & {\color{Aquamarine}5} & 6 \\ 7 & 8 & 9 \\ \end{pmatrix} \quad \text{and} \quad K = \begin{pmatrix} k_{11} & k_{12} & k_{13} \\ k_{21} & k_{22} & k_{23} \\ k_{31} & k_{32} & k_{33} \\ \end{pmatrix}. \end{equation*}

For the highlighted position, this results in the following operation

\begin{equation*} (A*K)_{22} = 9 k_{11}+8 k_{12}+7 k_{13}+6 k_{21}+5 k_{22}+4 k_{23}+3 k_{31}+2 k_{32}+k_{33}. \end{equation*}

This shows how the flipping of the kernel works: basically, each index is mirrored independently around the centre axis, so that e.g. $$k_{13} \rightarrow k_{31}$$ or, to put it differently, the kernel matrix gets transposed. Note that usually the kernel is flipped and not the image. This is possible due to the commutativity property (\eqref{eq:Convolution_Commutativity}) and in general a good idea since the kernel will usually be much smaller compared to the image.

It is now time for a small example. Let's look at the following image $$I$$ and the kernel $$K_{\nabla^{2}}$$)

\begin{equation*} I = \begin{pmatrix} 0 & 1 & 1 & 0 \\ 0 & {\color{Aquamarine}1} & 2 & 0 \\ 0 & 0 & 1 & 0 \\ 3 & 0 & 0 & 0 \end{pmatrix} \quad \text{and} \quad K_{\nabla^{2}} = \begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix} \end{equation*}
\begin{equation*} I = \begin{pmatrix} 0 & 1 & 1 & 0 \\ 0 & 1 & {\color{Aquamarine}2} & 0 \\ 0 & 0 & 1 & 0 \\ 3 & 0 & 0 & 0 \end{pmatrix} \quad \text{and} \quad K_{\nabla^{2}} = \begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix} \end{equation*}
\begin{equation*} I = \begin{pmatrix} 0 & 1 & 1 & 0 \\ 0 & 1 & 2 & 0 \\ 0 & {\color{Aquamarine}0} & 1 & 0 \\ 3 & 0 & 0 & 0 \end{pmatrix} \quad \text{and} \quad K_{\nabla^{2}} = \begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix} \end{equation*}
\begin{equation*} I = \begin{pmatrix} 0 & 1 & 1 & 0 \\ 0 & 1 & 2 & 0 \\ 0 & 0 & {\color{Aquamarine}1} & 0 \\ 3 & 0 & 0 & 0 \end{pmatrix} \quad \text{and} \quad K_{\nabla^{2}} = \begin{pmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{pmatrix} \end{equation*}

and see how we can calculate the convolution response for the highlighted position

\begin{alignat*}{4} 0 \cdot 0 & {}+{} & 1 \cdot 1 & {}+{} & 0 \cdot 1 & {}+{} & \\ 1 \cdot 0 & {}+{} & (-4) \cdot 1 & {}+{} & 1 \cdot 2 & {}+{} & \\ 0 \cdot 0 & {}+{} & 1 \cdot 0 & {}+{} & 0 \cdot 1 & {}{} & = {\color{Aquamarine}-1} = (I*K_{\nabla^{2}})_{33}. \end{alignat*}
\begin{alignat*}{4} 0 \cdot 1 & {}+{} & 1 \cdot 1 & {}+{} & 0 \cdot 0 & {}+{} & \\ 1 \cdot 1 & {}+{} & (-4) \cdot 2 & {}+{} & 1 \cdot 0 & {}+{} & \\ 0 \cdot 0 & {}+{} & 1 \cdot 1 & {}+{} & 0 \cdot 0 & {}{} & = {\color{Aquamarine}-5} = (I*K_{\nabla^{2}})_{33}. \end{alignat*}
\begin{alignat*}{4} 0 \cdot 0 & {}+{} & 1 \cdot 1 & {}+{} & 0 \cdot 2 & {}+{} & \\ 1 \cdot 0 & {}+{} & (-4) \cdot 0 & {}+{} & 1 \cdot 1 & {}+{} & \\ 0 \cdot 3 & {}+{} & 1 \cdot 0 & {}+{} & 0 \cdot 0 & {}{} & = {\color{Aquamarine}2} = (I*K_{\nabla^{2}})_{33}. \end{alignat*}
\begin{alignat*}{4} 0 \cdot 1 & {}+{} & 1 \cdot 2 & {}+{} & 0 \cdot 0 & {}+{} & \\ 1 \cdot 0 & {}+{} & (-4) \cdot 1 & {}+{} & 1 \cdot 0 & {}+{} & \\ 0 \cdot 0 & {}+{} & 1 \cdot 0 & {}+{} & 0 \cdot 0 & {}{} & = {\color{Aquamarine}-2} = (I*K_{\nabla^{2}})_{33}. \end{alignat*}

Basically, we position the kernel at the target position in the image, multiply the two areas together and aggregate the result. If you compare these computations with the formula from the table, you might notice the similarity. To get the response function, the steps are repeated for every image pixel (this is the sliding of the kernel over the image) resulting in the following response image (assuming that every pixel value outside the image is zero6)

\begin{equation*} I * K_{\nabla^{2}} = \left( \begin{array}{cccc} 1 & -2 & -1 & 1 \\ 1 & {\color{Aquamarine}-1} & -5 & 2 \\ 3 & 2 & -2 & 1 \\ -12 & 3 & 1 & 0 \\ \end{array} \right). \end{equation*}
\begin{equation*} I * K_{\nabla^{2}} = \left( \begin{array}{cccc} 1 & -2 & -1 & 1 \\ 1 & -1 & {\color{Aquamarine}-5} & 2 \\ 3 & 2 & -2 & 1 \\ -12 & 3 & 1 & 0 \\ \end{array} \right). \end{equation*}
\begin{equation*} I * K_{\nabla^{2}} = \left( \begin{array}{cccc} 1 & -2 & -1 & 1 \\ 1 & -1 & -5 & 2 \\ 3 & {\color{Aquamarine}2} & -2 & 1 \\ -12 & 3 & 1 & 0 \\ \end{array} \right). \end{equation*}
\begin{equation*} I * K_{\nabla^{2}} = \left( \begin{array}{cccc} 1 & -2 & -1 & 1 \\ 1 & -1 & -5 & 2 \\ 3 & 2 & {\color{Aquamarine}-2} & 1 \\ -12 & 3 & 1 & 0 \\ \end{array} \right). \end{equation*}

The previous computations are also visualized in the following figure which focuses on the kernel positioning and the sliding over the input image.

Since this article is assigned to the computer vision category, it would be a shame to not show at least one image example. Therefore, the following figure shows an example image which gets convolved by the same kernel as in the previous example.

Did you noticed that the index $$\nabla^2$$ was used for the kernel? This is not a coincidence since the kernel $$K_{\nabla^2}$$ is precisely the 2D discrete version of the Laplace operator and like in the 1D continuous case (approximation in the form of \eqref{eq:Convolution_KernelContinuous}) this filter responses best on edges, e.g. visible at the whiskers of the cat.

There are many other filters and it is beyond the scope of this article to cover them all here. Instead, you can visit this website where you see other filters and examples which run directly in the browser. When you scroll down a bit, you can even test your own kernels interactively.

One final note, though. The convolution operation is also from special interest in the machine learning domain in the form of Convolutional Neural Networks (CNNs). The idea is to introduce a stack of layers in the network hierarchy each basically performing the convolution operation based on the previous input, e.g. the input image or the output of another convolutional operation. The trick is that the kernels of the layers are not predefined but learned instead. That is, based on a bunch of labelled examples and a target function the network adapts the kernel weights (elements $$k_{ij}$$) automatically. This way the kernels fit very well to the application domain without the need for manual feature engineering.7

List of attached files:

1. Don't worry if this description sounds confusing. The goal is that you see in a moment that the convolution operation is perfectly suited for this problem.
2. Precisely speaking, the function $$k(\tau)$$ does not expect the treatment day $$\tau$$ but rather its index $$\tau - 1$$ as argument (first non-zero value at $$\tau = 0$$). This is not necessary in general and an application-dependent choice instead. Here, it has the effect that we yield 0 total pills for the day $$t=0$$ which sounds reasonable for our problem.
3. It also resembles the edge detection which our retina ganglion cells perform.
4. However, it is sometimes termed differently as linear filtering with the kernel being the filter.
5. This is an implementation detail not taken into account by the mathematical definition. You would need to manually add some constant to centre your kernel, e.g. $$+ 1$$ for a $$3 \times 3$$ kernel in the 2D discrete case: $$s(x, y)*k(x, y) = \sum_{\tau = 0}^{n-1} \sum_{\kappa = 0}^{n-1} s(-\tau + x, -\kappa + y) \cdot k(\tau + 1, \kappa + 1)$$.
6. This is one case of the general border handling problem. See this Wikipedia section or page 113 of the book Digital Image Processing (second edition) by Burger and Burge for further details.
7. If you are interested and want to learn more about this topic, then I can recommend the last chapter of the book Neural Networks and Deep Learning by Michael A. Nielsen.