Canberra distance Print
Algorithms - Similarity
Written by Jan Schulz   
Thursday, 13 September 2007 08:21

Canberra distance

 

 

Objective

The Canberra distance is a metric function often used for data scattered around an origin. It was introduced in 1966 (Lance & Williams 1966) and is today mainly used in the form of 1967 (Lance & Williams 1967).

Equation

The Canberra metric is similar to the Manhattan distance (which itself is a special form of the Minkowski distance). The distinction is that the absolute difference between the variables of the two objects is divided by the sum of the absolute variable values prior to summing. The generalised equation is given in the form:

This is a slightly modified form compared to the original form given by Lance & Williams (1966) and was suggested by Adkins (reference in Lance & Williams 1967). In the equation dCAD is the Canberra distance between the two objects i and j, k is the index of a variable and n is the total number of variables y.

Synonyms

No common synonyms are established for the Canberra metric.

Usage

In the original form of the Canberra metric data must not be signed. The modified form according to Adkins (in Lance & Williams 1967) has the property that the result becomes unity when the variables are of opposite sign. It is useful in the special case where signs represent differences in kind rather than in degree (Lance & Williams 1967). Anyhow, it is mainly used for values > 0. This metric is easily biased for measures around the origin and very sensitive for values close to 0, where it is more sensitive to proportional than to absolute differences (Lance & Williams 1967). This feature becomes more apparent in higher dimensional space, respectively an increasing number of variables. It is in turn less influenced than the Manhattan distance by variables with high values (Krebs 1989). As a very sensitive measurement it is applicable to identify deviations from normal readings (e.g. Emran & Ye 2001).

Algorithm

The algorithm controls whether the data input matrix is rectangular or not. If not the function returns FALSE and a defined, but empty output matrix. When the matrix is rectangular the Canberra distance is calculated. Therefore the dimensions of the respective arrays of the output matrix and the titles for the rows and columns set. As the result is a square matrix, which is mirrored along the diagonal only values for one triangular part and the diagonal are computed. When errors occur during computation the function returns FALSE.

Source

Function dist_Canberra (InputMatrix : t2dVariantArrayDouble; Var OutputMatrix : t2dVariantArrayDouble) : Boolean;
// The function dist_CalcCanberraMatrix calculates the Canberra dissimilarity
// matrix between several cases, which are expected in the rows. The variables are
// expected in the columns. Function returns FALSE if at least one cell can not be
// calculated. The result matrix is returned in OutputMatrix.
// (c) Dr. Jan Schulz, 25.April 2007; www.code10.info
Var InputCols : Integer;
InputRows : Integer;
OutputMatrixSize : Integer;
RunnerY : Integer;
RunnerX : Integer;
Numerator : Double;
Denominator : Double;
i : Integer;
FirstVal : Double;
Summed : Double;
NoOfValues : Integer;
SecondVal : Double;
Dissimilarity : Double;
Begin

// if one dimension is zero
If Not mtx_IsRectangular (InputMatrix, InputRows, InputCols) THen
Begin
// create an empty matrix, return FALSE and exit
mtx_Create (OutputMatrix, 1, 1, 0, 'Erroneous Canberra distance matrix');
dist_Canberra := False;
Exit;
end;

// create an output matrix of required size
mtx_Create (OutputMatrix, InputRows, InputRows, NaN, 'Canberra distance matrix');

//copy the respective titles
For RunnerY := Low (InputMatrix.RowTitle) to High (InputMatrix.RowTitle) do
Begin
// names for rows and columns are the same in this triangualary matrix
OutputMatrix.RowTitle [RunnerY] := InputMatrix.RowTitle [RunnerY];
OutputMatrix.ColTitle [RunnerY] := InputMatrix.RowTitle [RunnerY];
end;


// let's expect the best case ...
dist_Canberra := True;

// compare every object
For RunnerY := Low (OutputMatrix.Cells) to High (OutputMatrix.Cells) do
Begin
// with every other
For RunnerX := Low (OutputMatrix.Cells) to RunnerY do
Begin
NoOfValues := 0;
Summed := 0;

// include all variables in analysis
For i := 0 to High (InputMatrix.Cells [0]) do
Begin
FirstVal := InputMatrix.Cells [RunnerX, i];
SecondVal := InputMatrix.Cells [RunnerY, i];

// no 'Not A Number' at all?
If Not (IsNAN (FirstVal) Or IsNAN (SecondVal)) THen
Begin
Numerator := Abs (FirstVal - SecondVal);
Denominator := Abs (FirstVal) + Abs (SecondVal);

If Denominator <> 0 THen
Begin
// increase number of variable pairs and add the term
Inc (NoOfValues);
Summed := Summed + (Numerator / Denominator);
end
Else
Begin
// can not calculate as denominator is zero
dist_Canberra := False;
end;
end
Else
Begin
// can not calculate as invalid numbers were found
dist_Canberra := False;
end;
end;

// at least one valid pair of variables found?
If NoOfValues > 0 THen
Begin
Dissimilarity := Summed;
end
Else
Begin
Dissimilarity := NAN;
end;

// set the value on both sides of the diagonal or diagonal itself
OutputMatrix.Cells [RunnerX, RunnerY] := Dissimilarity;
OutputMatrix.Cells [RunnerY, RunnerX] := Dissimilarity;
end;
end;
end;
 

Example

For a data matrix aInputMatrix of the type t2dVariantArrayDouble, populated with:

Data

Var1

Var2

Var3

Case1

1

1

1

Case2

1

1

0

Case3

2

2

2

Case4

10

10

10

Case5

11

11

11

Case6

10

5

0

the call of:

aBooleanVar := dist_Canberra (aInputMatrix, aOutputMatrix);

returns the respective Canberra distance matrix in aOutputMatrix:

Canberra

distance

Case1

Case2

Case3

Case4

Case5

Case6

Case1

0

1.000

1.000

2.455

2.500

2.485

Case2

1.000

0

1.667

2.636

2.667

1.485

Case3

1.000

1.667

0

2.000

2.077

2.095

Case4

2.455

2.636

2.000

0

0.143

1.333

Case5

2.500

2.667

2.077

0.143

0

1.423

Case6

2.485

1.485

2.095

1.333

1.423

0

Although the Euclidean distance between the objects Case1 and Case3 is the same as between Case4 and Case5, the Canberra distance indicates a higher distance between the objects Case1 and Case3. This is due to the fact that the analysis is more sensitive for values closer to the origin. The distance between Case2 and Case6 is shorter, as the joint zeroes in variable 3 are omitted.

Literature

Emran S.M., Ye N. (2001): Robustness of Canberra metric in computer intrusion detection. Proceedings of the 2001 IEEE, Workshop on Information Assurance and Security, United States Military Academy, West Point, New York, 5-6 June, 2001.

Krebs C.J. (1989): Ecological Methodology. Harper-Collins, New-York.

Lance G.N., Williams W.T. (1966): Computer programs for hierarchical polythetic classification (“similarity analysis”). Computer Journal 9:60-64.

Lance G.N., Williams W.T. (1967): Mixed-data classificatory programs, I.) Agglomerative Systems. Australian Computer Journal 1:15-20.

 

Last Updated on Friday, 18 March 2011 18:16