Notes on the formalism

This page collects various things which contribute, all together, to tracing the picture of how to formalise images data into quantifiable entities, and the mathematics behind some building blocks of doing computer vision. An image is nothing else than a matrix of values, single numbers in the case of grayscale ones and arrays of 3 numbers in the case of colour images in something like the RGB space, for instance.


A pixel value is given in terms of colour or intensity. Intensity is used in grayscale, it identifies the brightness of pixels and has a value between 0 and 255. For colour you have, for example, the values RGB (a triple) for red, green and blue, each reporting their respective intensities.

The alpha channel

In a 32-bit graphic system, 8 bits are used to encode each of the three colours (RGB) and 8 further bits are used for the alpha channel, representing transparency: specifies how colours should be merged when overlapped.

Colour space and colour model

A colour space describes the organisation of colours, a colour model is a way to represent colours as tuples of numbers. For example, Adobe RGB and sRGB are two different colour spaces, both based on the RGB colour model.

Colour model

Figure from Wikipedia, user Datumizer, licence CC BY-SA 3.0

There are five major colour models:

  • CIE (1931, was created by the International Commission on Illumination, or CIE, from its French name): was the first attempt to link the wavelengths (pure colours) to the colours as perceived by humans. It uses the tristimulus values: the human eye has three kinds of cone cells, each of which has the peak of sensitivity for light at a given wavelength, so three parameters can be used, corresponding to the levels of the stimuli to the three types of cells.

  • RGB (red, green, blue): describes what light produces a given colour. Several colour spaces can be derived from this model.

  • YUV (luma plus chroma): it is built with a luma (brightness, achromatic) value and two chroma (colour information) values. YPbPr is its scaled version and YCbCr is its scaled digital version.

  • HSV, also known as HSB (hue, saturation, value/brightness): is a coordinate transformation of RGB (a cube) into a cylindrical space. Note that there is also HSL (L for lightness), which is similar. The HSV space is a cylinder:H[0,359],S,V[0,100]H \in [0,359], S,V \in [0,100]. The HSV space has been designed to represent colour properties in a way that is more in adherence to the human eye perception and is particularly useful in cases where illumination matters. The hue of a colour is the base (dominant) colour that composes it, or better, the degree to which our colour is far away from the basic hues (yellow, orange, red, violet, blue, green); the saturation of a colour is its intensity; the value of a colour represents its lightness/darkness.

  • CMYK (cyan, magenta, yellow, key -black-): it is used in printing, describes what inks need to be used so that the reflected light produces the given colour.

Colour space

Colour spaces are:

  • LMS (long/medium/short), where long, medium and short refer to the wavelengths of peak sensitivity of cone cells: L for long (in the blue), M for medium (in the green) and S for short (in the red);

  • XYZ (tristimulus): humans perceive light in the green (medium) part of the electromagnetic spectrum as brighter than those in the red (long) and the blue (short) parts of the spectrum. Y is the luminance; Z the blue stimulation; X is a linear combination of the cone responses. So, at fixed X, the plane XZ contains all the cromaticities at that luminance. The cromaticity is the quality of colour regardless of luminance and is given by hue and saturation. This colour space encodes all colours that a typical human (with no colour deficiencies) can see.

Note that we can pass from the LMS space to the XYZ one by virtue of a transformation, which surfaces the fact that Z is equivalent to S, Y is a linear combination of L and M, and X is a linear combination of all three.

Pixel connectivity

Pixel connectivity is the way pixels relate to neighbours, and there's two ways we can define it:

  • 4-line: each pixel is connected to all those which touch one of their edges

  • 8-line: each pixels is connected to all those which touch their edges and corners

Moments of an image

What are

Image moments are weighted averages of the pixels' intensities. They share a similarity to the definition of moments in probability, where the intensity plays the role of the probability density function.

For a grayscale image a raw moment is defined as

Mij=xyxiyjI(x,y) ,M_{ij} = \sum_x \sum_y x^i y^j I(x, y) \ ,

where(x,y)(x, y)is a cell of the image and I its intensity;i+ji+jgives the order of the moment

Central moments are

μij=xy(xxˉ)i(yyˉ)jI(x,y) ,\mu_{ij} = \sum_x \sum_y (x-\bar x)^i (y-\bar y)^j I(x, y) \ ,

where xˉ=M10M00\bar x = \frac{M{10}}{M{00}}andyˉ=M01M00\bar y = \frac{M{01}}{M{00}}.

It can be derived that

μpq=mpnq(pm)(qn)(xˉ)pm(yˉ)qnMmn\mu_{pq} = \sum_m^p \sum_n^q \binom{p}{m} \binom{q}{n} (- \bar{x})^{p-m} (- \bar{y})^{q - n} M_{mn}

Some interesting results are:

  • central moments are invariant with respect to translation;

  • the second order moments define the orientation of the image;

  • for an analogy to physics, the 0-th moment has the same role as the mass of the object, the first moments are analogous to the center of mass and the second moments to the moments of inertia;

  • if the intensity is considered as a density, so that M00=1M_{00} = 1, the first moments M10M_{10}andM01M_{01}are the mean values in each coordinate, the second moments are the variances of the horizontal and vertical projections,M11M_{11}is their covariance.

Object features based on moments

Moments are used to calculate features of the objects displayed in an image. Let us refer here to a binary image (see page).

Zeroth moment: area of object

The area of an object is directly linked to the 0-th raw moment. In fact, the area is simply the sum of the 1's, that is, the total number of white pixels:

M00=xyx0y0I(x,y)=xyI(x,y) ,M_{00} = \sum_x \sum_y x^0 y^0 I(x, y) = \sum_x \sum_y I(x, y) \ ,

First moments: Center of mass

The center of mass is given by, in each coordinate, by

M10=dxdy x I(x,y)  ;  M01=dxdy y I(x,y)M_{10} = \iint d x d y \ x \ I(x, y) \ \ ; \ \ M_{01} = \iint d x d y \ y \ I(x, y)

Second moments: Inertia tensor, orientation, roundness and eccentricity

In analogy to mechanical moments, the central second order image moments μ20\mu_{20}, μ02\mu_{02}andμ11\mu_{11}contain terms in whichρ(x,y)\rho(x, y)is multiplied by the square of the distance from the center of mass. They compose the inertia tensor of the rotation of the object around its center of gravity:

J=[μ20μ11μ11μ02]J = \begin{bmatrix} \mu_{20} & -\mu_{11} \\ \mu_{11} & \mu_{02} \end{bmatrix}

From this, several parameters can be derived.

The eigenvalues of J:

λ1,2=12(μ20+μ02)±4μ112(μ20μ02)2\lambda_{1,2} = \sqrt{\frac{1}{2} (\mu_{20} + \mu_{02}) \pm \sqrt{4 \mu_{11}^2 - (\mu_{20} - \mu_{02})^2}}

give the main inertial axes of the rotation, which correspond to the semi-major and semi-minor axes of the ellipse which can be used as an approximation of the object.

The orientation of the object-ellipse is the angleθ\thetabetween the x axis and the axis around which the object can be rotated with minimal inertia (the direction of the major semi-axis a). It corresponds to the eigenvector with minimal eigenvalue:

θ=12arctan2μ11μ20μ02\theta = \frac{1}{2} \arctan{\frac{2 \mu_{11}}{\mu_{20} - \mu_{02}}}

The roundness is defined as

k=p22πA\mathcal{k} = \frac{p^2}{2 \pi A}

where p is the perimeter of the object and A its area. It is 1 for a circle and greater than 1 for other objects.

The eccentricity is defined as

ϵ=a2b2a=(μ20μ02)24μ112μ20+μ02  0ϵ1\epsilon = \frac{\sqrt{a^2 - b^2}}{a} = \frac{(\mu_{20} - \mu_{02})^2 - 4 \mu_{11}^2}{\mu_{20} + \mu_{02}} \ \ 0 \leq \epsilon \leq 1

Scale invariance and Hu invariants

Givenf(x,y)f'(x, y), a function of a new image scaled by factorλ\lambda, so thatf(x,y)=f(xλ,yλ)f'(x, y) = f(\frac{x}{\lambda}, \frac{y}{\lambda}), if we rescale in such a way thatx=xλx' = \frac{x}{\lambda}andy=yλy' = \frac{y}{\lambda}, we havedx=λdxdx = \lambda dx'and dy=λdydy = \lambda dy', so

Setting the total area to 1, thenμ00=λ2μ00=1\mu'_{00} = \lambda^2 \mu_{00} = 1, soλ=μ001/2\lambda = \mu_{00}^{-1/2}. The scale invariant (invariant by both scale and translation) is

npq=1μ00p+q+22μpq .n_{pq} = \frac{1}{\mu_{00}^{\frac{p+q+2}{2}}} \mu_{pq} \ .

Hu in his paper has calculated invariants to translation, scale and rotation, called the Hu invariants.

Image reconstruction

An image can be reconstructed from its moments, when known. Assuming that all moments MpqM_{pq} of a functionf(x,y)f(x, y)and of orderN=p+qN = p+qare known up to orderNmaxN_{max}, it is possible to obtain function g(x,y)g(x, y)whose moments match those of the original function up to orderNmaxN_{max}:

g(x,y)=g00+g10x+g01y+g20x2+g11xy++gpqxpyqg(x, y) = g_{00} + g_{10}x + g_{01}y + g_{20}x^2 + g_{11}xy + \ldots + g_{pq}x^py^q

That is,

g(x,y)=pqgpqxpyq ,  Nmax=p+qg(x, y) = \sum_p \sum_q g_{pq} x^p y^q \ , \ \ N_{max} = p + q

Assuming that the image is a continuous function bounded asx[1,1]x \in [-1, 1],y[1,1]y \in [-1, 1] , then

1111dxdyg(x,y)xpyq=Mpq\int \limits_{-1}^1 \int \limits_{-1}^1 dx dy g(x, y) x^p y^q = M_{pq}

and substituting the expansion above gives a set of equations which can be solved for the gpqg_{pq} in terms ofMpqM_{pq}.

Operators on an image

The Sobel operator

The Sobel (or Sobel-Feldman) operator is an operator used as a filter to create an image with emphasised edges. It is a discrete differentiation operator: computes an approximation of the gradient of the image density function .

The algorithm consists in applying a convolution to the image with a filter in both directions. Two kernels, KxK_xandKyK_y, are convolved with the image matrix to calculate approximations of the derivative in the two directions:

Kx=[101202101],   Ky=[121000121]K_x = \begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix} , \ \ \ K_y = \begin{bmatrix} -1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{bmatrix}

The convolution is applied at each point of the image, so that the matrix convolved with the kernel is the 3×33 \times 3 matrix centered on the point under consideration.

KxK_xandKyK_yrepresent the change in thexxandyydirections, respectively. They can both decomposed as the product of an averaging and a differentiation kernels, so that for instance

Kx=[121][101]K_x = \begin{bmatrix} 1\\ 2\\ 1 \end{bmatrix} \begin{bmatrix} -1 & 0 & 1 \end{bmatrix}

This means they compute the gradient with smoothing.

The gradient magnitude

K=Kx2+Ky2K = \sqrt{K_x^2 + K_y^2}

is computed at each point of the image, as well as the gradient direction

θ=arctan(Ky,Kx) .\theta = \arctan{(K_y, K_x)} \ .


  1. M K Hu, Visual pattern recognition by moment invariants, IRE transactions on information theory, 8.2, 1962

  2. I Sobel, An isotropic 3×3 image gradient operator, Machine Vision for three-dimensional Sciences, 1990