Tools for Quantitative Archaeology    Email:

DIST: Calculate Similarity and Distance Measures

Using one of a number of functions, program DIST computes a similarity or distance matrix for a set of count, presence or absence, or attribute state data. The program reads in input data matrix composed of integer values for some number of variables for each observation. The program is limited to 180 observations and a total of about 16,000 data values in the input data matrix (91 variables for the maximum number of cases). The program prints and writes the matrix (without the diagonal) to a file which can be used in further analysis, such as in multidimensional scaling.

PROGRAM OPERATION

To start the program type: DIST<Enter>

You can interrupt the program at any time by typing <Ctrl><Break>. The program prompts are generally self explanatory. Conventions used in the prompts are described in detail in the section entitled "Program Conventions", and the specific prompts are discussed below.

DESCRIPTION OF PROGRAM PROMPTS

File Name for Input Counts {.ADF} ?

Name (or path) of the Antana format file with the input data or CON to read data from the keyboard. Note: To get data out of SYSTAT to use with DIST you can either use SYSTAT to write out the file in ASCII form or you can use my CONVSYS program which will read a SYSTAT file and write out an Antana format file ready for DIST to use.

Title ?

Title used in labeling the output. It may be omitted.

Coefficients Computed:
1=Simple Matching +/- (a+d)/(a+b+c+d)
2=Jaccard +/- a/(a+b+c)
3=Gower Coefficient (Matches/Comparisons)
4=Gower Coefficient (S|x1-x2|/Range)/Comparisons
5=Euclidean Distance (Input Data)
6=Euclidean Distance (Proportion-Program Calculates Proportion)
7=Brainerd Robinson Coefficient (Program Calculates Percents)
8=Pearson's r - (beware)
9=Richness abs(R1-R2)/Nvar
10=Joint Presence: a
11=Joint Presence: Z
Enter Number of Desired Measure ?

This prompt is asking you to choose a measure. The Simple Matching Coefficient, Jaccard's Coefficient, the Gower Coefficient, Euclidean distance, and Pearson's r are standard measures that are described in Sneath and Sokal (1973) and in many other places. Note that Euclidean distance with counts does not standardize the counts, and as a consequence will generally not provide an appropriate measure when counts vary widely from variable to variable.

The Simple Matching and Jaccard's Coefficient are similarity measures defined for presence/absence data. The program expects presence/absence data is expected to be coded with 0=absent and 1=present. For convenience, the program will automatically treat any positive count as presence but requires 0 for absence data. Both these measures count the number of variables for which the two observations have a=joint presences, d=joint absences, b=present in case 1 and absent in case 2, and c=present in case 2 and absent in case 1. These measures range from 0 (dissimilar) to 1 (similar). If either case is coded with a negative number, the comparison is ignored (in both the numerator and the denominator.

The Gower Coefficient is a similarity measure used to evaluate similarity between items with nominal, ordinal or presence/absence data, or some combination of these variable types. The program expects attribute states to be coded as non-negative integer numbers, where each integer represents a distinct attribute state. The value -1 (actually any negative number) is a special attribute state, usually meaning not relevant, that indicates that any comparison with that value should not be considered. (The generalized Gower coefficient that also combines ratio-scale data is described below.) The Gower Coefficient is defined to be the number of attributes for which the two cases have identical values (other than -1) divided by the number of attributes for which both cases have non-negative values (i.e., the number of matches divided by the number of valid comparisons). For solely presence absence data this measure is equivalent to the simple matching coefficient.

Jaccard, SMC, and Gower (ungeneralized) require integer input and support the convention that a comparison with a negative number is ignored. A warning is printed if there are no valid comparisons between cases. In this case a similarity or distance cannot be computed and the distance is printed as -1.

The generalized Gower coefficient evaluates presence/absence, ordinal, and nominal data in the manner described above. However, similarity of two cases on continuous variables is incorporated in the analysis as 1.0 minus the absolute value of numerical difference in the values divided by the variable's range. A separate file provides a specification of how each variables is to be treated. It specifies a variable type implicitly, through a range (min,max) of valid values for each variable. A 0,0 range is used for +/-, nominal, or ordinal data, in which case exact matches are counted as 1; any other valid comparison yields 0. If any other range, x,y (note x>=0 x<y) is given, each comparison between cases contributes the absolute value of the difference in values for a variable divided by the range of the variable, subtracted from yielding a contribution between 0 (opposite ends of the range) and 1 (identical values) for each variable comparison. The sum of these contributions for all variables (of any combination of types) is divided by the number of comparisons yielding an index with a [0,1] range. Note that variable types can be mixed within a single analysis. Data and ranges may be integer or real; however any negative values are taken to indicate invalid comparisons.

For the generalized Gower coefficient, the file specifying variable ranges is in ADF format. The number of rows is the number of variables to be described, and the two columns represent the minimum and maximum, respectively, for each of the variables in the original data set.

Euclidean distance, the Brainerd-Robinson Coefficient, and Pearson's r are generally used for count data, and the first two will give sensible results with presence/absence data coded as 0=absent and 1=present. Euclidean distance is computed in the standard way from counts or case by case proportions (calculated by the program from the counts). Euclidean distance coefficients can optionally be scaled from 0 (close) to 1 (distant), by subtracting the minimum distance from each and then dividing the result by the range of distance coefficients observed. (This value is reported on the screen).

The Brainerd Robinson Coefficient, widely used in archaeology, is a similarity measure that ranges from 0 (dissimilar) to 200 (similar). Scaled to a 0-1 range, it is calculated as 1.0 minus the half the sum of the absolute value of the differences in the proportions (the proportions are calculated from counts by the program) across all variables.

Pearson's r, or the product-moment correlation coefficient ranges from -1 to 1 and is calculated in the standard way on counts. Pearson's r is sometimes treated as a similarity measure that ranges from -1 to +1, however, see Cowgill (1991) who advises strongly against this procedure.

Euclidean Distance (proportions), Brainerd Robinson, and the Richness Measure, all require non-negative input with no missing values. However, non-integer data may be used (e.g. weights). For Euclidean Distance on raw data and Pearson's r, negative numbers are treated as data; no missing values are allowed.

The richness measure does not have wide utility but may be occasionally useful. This is a distance measure that ranges from 0 (close) to 1 (distant). It is calculated as the absolute value of the difference in richness, the number of classes (variables) present, of the two observations, divided by the number of variables.

Joint Presence: a. This returns a count of joint presences, the number of columns on which the two rows have positive values (negative values are treated as missing.

Joint Presence: Z. This gives a binomial probability-based Z score for the number of joint presences based on an expected proportion (if the presences in the two cases are independent in a Chi² sense) of joint presences which can be computed from the total number of present and absent categories for each case (calculations are described under Allison's standardization in the TwoWay program).

This option was designed to assess joint occurrences of types within collections. To do this calculation, the matrix must first be transposed with ADFUTIL. The Joint Presence: Z option of the DIST program then counts the number of collections in which the two types cooccur and creates a expectation for co-occurrence based on the number of collections in which either type is present. The option may well be useful in other circumstances, but it also may not be meaningful.

Output [D]istance or [S]imilarity Measure {?}

The program can convert any similarity measure to distance or vice versa. The default output is the type of matrix implied by the measure. If conversion to the other type is requested (e.g. to convert Jaccard's to distance because Antana's multidimensional scaling routine requires a dissimilarity matrix), this is accomplished by simply subtracting the observed value from the maximum or maximum possible value (1 for SMC, Jaccard's, Gower, and Richness, 200 for Brainerd Robinson, √2 for Euclidean Distance based on proportions, and the observed maximum distance for Euclidean Distance based on counts. Pearson's r when converted to distance range from 0 close (1 correlation), to 1 (intermediate; 0 correlation) to 2 (distant; -1 correlation).

Scale Measure to [0,1] {N} ?

Scales the coefficient to a 0-1 range, for each value by subtracting the theoretical minimum distance (or observed minimum for Euclidean Distance based on counts) and dividing the result by the theoretical range in values (or observed range for Euclidean Distance based on counts).

Note: Sample Sizes Range from ? to ?

Adjust for Sample Size with Monte Carlo Subsamples {N} ? N

The program informs you of the range of sample sizes of counts within the data set. This option allows sample-size adjusted Monte Carlo calculation of the distance or similarity coefficients. Rather than compute the coefficients in the ordinary way, coefficients are calculated repeatedly using subsamples of the original data.

[P]airwise or [G]lobal Sample Size Adjustment {P} ?

If Monte Carlo analysis is requested, two options are provided for choosing subsamples. First, a global or across-the board sample size can be used in which all coefficients are computed using subsamples of that size (or their actual sample size, whichever is smaller). Alternately pair-wise selection of sample size can be used in which the smaller sample size of each pair is used.

Number of Monte Carlo Trials to Run {50} ?

Random Generator Seed (0 to set from clock) {0} ?

Number of Monte Carlo trials to run. Each coefficient printed is the average of the coefficients obtained from the random subsamples. You see this prompt only if you request monte Carlo trials. You need enter a random number seed only if you want to reproduce exactly a previous run with Monte Carlo trials.

xx% Complete

Informs you of the progress of calculations. The program should take almost no time unless Monte Carlo trials are run. In this case execution time goes up with the number of trials times square of the number of observations.

Computation Ends

Print Matrix {Y} ? Y

The program allows you to print the distance or similarity matrix.

Read Case Label File {N} ?

Requests the name of the file with labels for the cases.

Output File Name {DIST.LST} ?

Requests the name of the file or device where printed output is to be placed.

Set Output Options {N} ?

Print Matrix in [T]riangular or [S]quare Form {T} ? T

Output Line Width (in characters) {80} ?

Print Width of Observation Labels {3} ?

Scale Factor (Multiple Values by) {1} ?

Number of Decimal Places to Print {2} ?

The output options can be used to print a distance or similarity matrix in more readable forms. Requesting a Triangular matrix prints a lower triangular matrix, while Square prints the symmetric square matrix. The scale factor can be used to multiply all values by a number. It often enhances interpretability to scale a measure whose range is 0-1 by 100 and then print with 0 decimal places.

Write Coefficients to File For Analysis {Y} ? Y

Output File for Matrix {.DST} ?

If you wish to further analyze the coefficients, e.g. using multidimensional scaling, reply Y to the first prompt. Then enter the name of the file to which the output similarity or distance matrix will be written.

Output Matrix Form: [U]pper (Antana) or [L]ower (SYSTAT) ?

The program output is a triangular matrix, without the diagonal. The matrix can be written in two ways. The usual way of looking at a triangular matrix is a lower matrix, that part of the matrix below (to the left) of the diagonal. The upper matrix has the matrix above the diagonal; the order of the coefficients is 1,2; 1,3; ... 1,NOBS; 2,3; 2,4; ...2,NOBS; ... NOBS-1,NOBS. If you are going to use the program with Antana analysis procedures choose U, if you are going to use it with SYSTAT choose L. If you print an upper matrix, the program inserts an Antana triangular matrix heading indicating the number of observations. In either case, the program simply writes out NOBS*(NOBS-1)/2 distance values separated by blanks.

Note: to get a triangular matrix into SYSTAT use the following steps. Replace text included in lower case with an appropriate name for 'filename' and number of observations for 'nobs.'

DATA

SAVE filename

GET 'filename.DST'

INPUT OBS(1-nobs)

TYPE=SIMILAR or TYPE=DISSIMILARITY or TYPE=CORRELATION (use the appropriate entry)

DIAGONAL=ABSENT

RUN

?? Distance Measures Will Be Written to ???.DST

Program End.

SAMPLE PROGRAM INPUT

#Hinkson Site Rough Counts#
17 8
#red white  gray brown sflak lflak sbone lbone#
106   103   198    24    73    16   129     0
4     2     6     1     0     3     0     0
1     1     3     6     5     6     9     0
58    36    14     8    31    37    46     1
125    62   143    22    27    45    64    11
41    18   100     5    35     3     4     0
225    83   299    37   108   120    79     2
19    20    38     2    37     7     0     0
452   330   562    78   317   199   388    12
636   248   739   223   136   120   448     3
446   186   427    74   176   124    32     0
149   141   206    85    38    90  1248    13
7    20   109     0     6    16     0     0
152   128   201     5    77    67   188    22
115    21    45    14     8     5    76    12
1535   866  1352   441   881   516   308    91
122   119    61    12    19    38    58     0

SAMPLE PROGRAM OUTPUT

Hinkson Site Rough Counts
Distance Matrix of Euclidean Distance (%) Coefficients
H13:GK1:      30
H13:GK2:      37      51
H13:GK3:      29      40      32
H13:M01:      15      19      40      26
H13:M02:      27      28      56      49      27
H13:M03:      18      17      41      29      10      23
H13:M04:      28      35      48      39      30      24      24
H13:P01:      10      28      33      21      12      30      14      25
H13:R01:      14      25      38      28       8      29      15      33      14
H15:M01:      24      19      48      31      14      24      11      24      19
H15:R01:      50      73      43      50      57      75      62      73      52
H15:U01:      46      39      71      70      47      31      44      48      51
H16:M02:      10      31      34      22      14      34      19      32       8
H16:R01:      31      41      45      27      25      48      32      48      28
H17:M03:      21      24      40      24      14      29      13      22      14
H17:R01:      26      33      45      20      22      44      28      38      21
H12:M01 H13:GK1 H13:GK2 H13:GK3 H13:M01 H13:M02 H13:M03 H13:M04 H13:P01

H15:M01:      19
H15:R01:      53      69
H15:U01:      49      49      87
H16:M02:      15      25      46      53
H16:R01:      22      32      49      70      26
H17:M03:      18      10      64      53      21      30
H17:R01:      25      25      58      63      22      28      21
H13:R01 H15:M01 H15:R01 H15:U01 H16:M02 H16:R01 H17:M03

 Home Top Overview Ordering Documentation

Page Last Updated - 02-Jun-2007