Jaccard similarity
Objective
The Jaccard similarity (Jaccard 1902, Jaccard 1912) is a common index for binary variables. It is defined as the quotient between the intersection and the union of the pairwise compared variables among two objects.
Equation
In the equation dJAD is the Jaccard distance between the objects i and j. For two data records with n binary variables y the variable index k ranges from 0 to n-1. Four different combinations between yi,k and yj,k can be distinguished when comparing binary variables. These combinations are (0/0), (0/1), (1/0) and (1/1). The sums of these combinations can be grouped by:
- J01: the total number of variables being 0 in yi and 1 in yj.
- J10: the total number of variables being 1 in yi and 0 in yj.
- J11: the total number of variables being 1 in both yi and yj.
- J00: the total number of variables being 0 in both yi and yj.
As each paired variable belongs to one of these groups it can be easily seen that:
J00 + J01 + J10 + J11 = n
As the Jaccard similarity is based on joint presence, J00 is discarded.
The Jaccard dissimilarity is defined as dJAD = 1- dJAS.
In some cases the Jaccard similarity is computed as dJAS=2dBCD/(1+dBCD), where dBCD is the Bray–Curtis dissimilarity. This equation does not reduce values to binary states. Thus, results are different when using on the one hand a presence/absence matrix and on the other hand a count matrix. The results are the same, when the count matrix is converted to a binary matrix beforehand.
Synonyms
The Jaccard similarity or Jaccard similarity coefficient is often called Jaccard index. Anyhow, the term Jaccard index is sometimes used for the Jaccard dissimilarity, while the Jaccard dissimilarity is sometimes called Jaccard distance. It can be observed that the terms Jaccard similarity and Jaccard dissimilarity are not precisely separated and sometimes seem to be used synonymical or confused, although results represent opposite meanings. Thus, one should carefully inspect the intention of the analysis.
Usage
The Jaccard similarity can be used, when intersted in binary differences between two or more objects. Especially in ecological research investigations often focus on the presence/absence between several sites. When interested in characterising compared sites by the possibility of species to settle there abundances are often negligible.
Algorithm
The algorithm controls whether the data input matrix is rectangular or not. If not the function returns FALSE and a defined, but empty output matrix. When the matrix is rectangular the Jaccard similarity will be calculated. Therefore the dimensions of the respective arrays of the output matrix are set, and the titles for the rows and columns set. As the result is a square matrix, which is mirrored along the diagonal only values for one triangular part and the diagonal are computed. When errors occur during computation the function returns FALSE.
For practical reasons the implementation of the algorithm does not necessarily need true binary data. It distinguishes whether a value is 0 or within a certain threshold close to it. In this case it will be interpreted as logical FALSE, e.g. absence. Values being larger than the given threshold are interpreted as logical TRUE, e.g. presence. Thus, it is possible without further preparation to pass a count matrix to the function. As the given threshold affects all values equally it does not alter its metric characteristic.
To calculate the Jaccard dissimilarity the Jaccard similarity matrix is computed first and thereafter transformed.
Source
Function dist_JaccardSimilarity (InputMatrix : t2dVariantArrayDouble; aZeroThresh: Double; Var OutputMatrix : t2dVariantArrayDouble) : Boolean; // The function dist_JaccardSimilarity calculates the Jaccard similarity between // several cases based on binary values (e.g. presences/absences or TRUE/FALSE). // The variable aZeroThresh defines the minumum value, that is required to // interpret a variable as being logically TRUE. // The cases are expected in the rows. The variables are expected // in the columns. Function returns FALSE if at least one cell can not be // calculated. The result matrix is returned in OutputMatrix. // (c) Dr. Jan Schulz, 05.May 2008; www.code10.info Var OutputMatrixSize : Integer; InputCols : Integer; InputRows : Integer; RunnerY : Integer; RunnerX : Integer; J01 : Integer; J10 : Integer; J11 : Integer; i : Integer; Denominator : Integer; Quotient : Double; FirstVal : Boolean; SecondVal : Boolean; Begin // if one dimension is zero or matrix not rectangular If Not mtx_IsRectangular (InputMatrix, InputRows, InputCols) THen Begin //create an empty matrix, return FALSE and exit mtx_Create (OutputMatrix, 1, 1, 0, 'Erroneous Jaccard similarity matrix'); dist_JaccardSimilarity := False; Exit; end;
// Threshold for binary level needs to be positive aZeroThresh := Abs (aZeroThresh);
// let's expect the best case ... dist_JaccardSimilarity := True;
// define and set the row dimension of the result matrix OutputMatrixSize := High (InputMatrix.Cells) + 1; SetLength (OutputMatrix.Cells, OutputMatrixSize);
// create the column dimension of the array For RunnerY := Low (OutputMatrix.Cells) to High (Outputmatrix.Cells) do Begin SetLength (OutputMatrix.Cells [RunnerY], OutputMatrixSize); end;
// define title of matrix OutputMatrix.MatrixName := 'Jaccard similarity matrix';
// Set Row/Col-Title for the new matrix SetLength (OutputMatrix.RowTitle, OutputMatrixSize); SetLength (OutPutMatrix.ColTitle, OutPutMatrixSize);
For RunnerY := Low (InputMatrix.RowTitle) to High (InputMatrix.RowTitle) do Begin // names for rows and columns are the same in this triangualary matrix OutputMatrix.RowTitle [RunnerY] := InputMatrix.RowTitle [RunnerY]; OutputMatrix.ColTitle [RunnerY] := InputMatrix.RowTitle [RunnerY]; end;
// compare every object For RunnerY := Low (OutputMatrix.Cells) to High (OutputMatrix.Cells) do Begin //with every other object For RunnerX := Low (OutputMatrix.Cells) to RunnerY do Begin J01 := 0; J10 := 0; J11 := 0;
//include all variables in analysis For i := 0 to High (InputMatrix.Cells [0]) do Begin // directly convert to binary variables FirstVal := Abs (InputMatrix.Cells [RunnerX, i]) > aZeroThresh; SecondVal := Abs (InputMatrix.Cells [RunnerY, i]) > aZeroThresh;
If FirstVal And SecondVal THen Begin J11 := J11 + 1; end Else Begin If FirstVal Then J10 := J10 + 1; If SecondVal Then J01 := J01 + 1; end; end;
Denominator := J01 + J10 + J11; If Denominator > 0 THen Quotient := J11 / Denominator Else Begin Quotient := NaN; dist_JaccardSimilarity := False; end;
// set the calculated value on both sides of diagonal and diagonal itself OutputMatrix.Cells [RunnerX, RunnerY] := Quotient; OutputMatrix.Cells [RunnerY, RunnerX] := Quotient; end; end; end;
Function dist_JaccardDissimilarity (InputMatrix : t2dVariantArrayDouble; aZeroThresh: Double; Var OutputMatrix : t2dVariantArrayDouble) : Boolean; // The function dist_JaccardDissimilarity calculates the Jaccard dissimilarity // matrix between several cases, which are expected in the rows. The variables are // expected in the columns. Function returns FALSE if at least one cell can not be // calculated. The result matrix is returned in OutputMatrix. This function depends // on the function CalcCanberraDissimilarityMatrix // (c) Dr. Jan Schulz, 05.May 2008; www.code10.info Var RunnerX : Integer; RunnerY : Integer; Begin // calculate the Canberra Dissimilarity matrix Result := dist_JaccardSimilarity (InputMatrix, aZeroThresh, OutputMatrix);
// convert the similarity into a dissimilarity matrix For RunnerY := Low (OutputMatrix.Cells) to High (OutputMatrix.Cells) do Begin For RunnerX := Low (OutputMatrix.Cells [RunnerY]) to High (OutputMatrix.Cells [RunnerY]) do Begin OutPutMatrix.Cells [RunnerY, RunnerX] := 1 - OutPutMatrix.Cells [RunnerY, RunnerX]; end; end;
If Result THen OutputMatrix.MatrixName := 'Jaccard dissimilarity matrix' Else OutputMatrix.MatrixName := 'Erroneous Jaccard dissimilarity matrix'; end;
Example
For a data matrix aInputMatrix of the type t2dVariantArrayDouble, populated with:
Data
|
Var1
|
Var2
|
Var3
|
Case1
|
1
|
1
|
1
|
Case2
|
1
|
1
|
0
|
Case3
|
2
|
2
|
2
|
Case4
|
10
|
10
|
10
|
Case5
|
11
|
11
|
11
|
Case6
|
10
|
5
|
0
|
the call of:
aBooleanVar := dist_JaccardSimilarity (aInputMatrix, aOutputMatrix);
returns the respective Jaccard similarity matrix in aOutputMatrix:
Jaccard
similarity
|
Case1
|
Case2
|
Case3
|
Case4
|
Case5
|
Case6
|
Case1
|
1
|
0.667
|
1
|
1
|
1
|
0.667
|
Case2
|
0.667
|
1
|
0.667
|
0.667
|
0.667
|
1
|
Case3
|
1
|
0.667
|
1
|
1
|
1
|
0.667
|
Case4
|
1
|
0.667
|
1
|
1
|
1
|
0.667
|
Case5
|
1
|
0.667
|
1
|
1
|
1
|
0.667
|
Case6
|
0.667
|
1
|
0.667
|
0.667
|
0.667
|
1
|
Based on the binary analysis it can be seen that only objects having a zero value among their variables show a lower similarity. This zero is interpreted as logical FALSE. All other objects are completely similar in this analysis, as their values are interpreted as logical TRUE.
Literature
Jaccard P. (1902): Lois de distribution florale. Bulletin de la Socíeté Vaudoise des Sciences Naturelles 38:67-130.
Jaccard P. (1912): The distribution of the flora in the alpine zone. New Phytologist 11(2):37-50.
|