TFQA Logo

Tools for Quantitative Archaeology
   Email:
             Kintigh email

TFQA Home
TFQA Documentation
TFQA Orders
Kintigh's ASU Home


DIVERS: Monte Carlo Diversity/Sample Size Analysis


      Program DIVERS allows analyses of diversity to control for the influence of sample size. Two dimensions of diversity are used: richness, the number of different categories present; and evenness, the evenness of the distribution of counts across the categories. A companion program, DIVPLT, graphically displays the results of produced by DIVERS.


      The measure of evenness used is H/Hmax, also known as a J-score. If the number of categories is k, then H is the sum, over the all categories with counts or proportions in the assemblage greater than 0, of pi*log10(pi), where pi is the proportion of items in the assemblage belonging to category i. Hmax is simply log10(k), where k is the number of categories used. (If desired, any number of categories with no representation can be included.) This measure varies from 0 (only one category present), to 1.0, all categories present in equal numbers or proportions. In this implementation, Hmax is held constant for all assemblages, actual and simulated, rather than having it vary with the number of categories present. Thus, the H/Hmax values given are simply scaled values of H, so H values can be obtained by simply multiplying all H/Hmax values by Hmax.

 

      See Kintigh (1984, 1989) for more information concerning the use of and rationale behind this procedure.


RUNNING DIVERS

 

      To run the program type: DIVERS<Enter>


      Program parameters are read interactively, while the data are read from a file (not in Antana format) prepared using any text editor that creates plain ASCII files. See "Program Conventions" section at the beginning of this documentation for information on ASCII files. Because DIVERS relies on Monte Carlo methods, it can be time consuming on a microcomputer. For this reason, after the parameters have been read, but before diversity computations begin, it provides an estimate of the time that will be needed, so that the parameters can be readjusted or the program canceled.


      It is possible to cancel execution once the program has begun by typing Ctrl-Break when the program is waiting for a response to a prompt. Once all parameters have been read, and the "OK to Proceed" prompt has been answered in the affirmative, many machines will not recognize a Ctrl-Break, and those that do, will not do so until the next time the program tries to write output. To exit more quickly, you can do a soft reset (Ctrl-Alt-Del), without causing any damage.


DIVERS PROGRAM PROMPTS


      Although the many prompts listed in the next section are useful for tailoring the program more closely to one's needs, an understanding of each parameter is not necessary to run the program. In fact the only mandatory prompt is for the input and file name. The program will run in a reasonable way if <Enter> is hit for each prompt except that one.


      The program has certain built in limits on the size of the data sets that can be accommodated. To run the program on data that exceed these bounds requires that the program be re-compiled, and a new .EXE file be produced. If an inappropriate or invalid response is given, a message will be displayed. For numeric prompts, the message will indicate the range of possible values. Don't worry about it unless this happens to you. But for reference the current limits are:


    maxnint=60; { maximum number of increments }

    maxelmts=200; { max number of different elements }

    maxpoints=200; { maximum number of point data points }

    titlelen=60; { maximum run title length in characters}

    maxtrials=2000; { maximum number of trials }

    maxhistint=100; { maximum number of j score hist intervals}


EXPLANATION OF DIVERS PROGRAM PROMPTS


Data Input File ?


      The path name of the file which contains the input data. If the file does not exist, it will display a message and prompt again. See INPUT DATA below.


Reading Input Data...

  Model Distribution Read

  Number of Elements Read: 57

  Number of Non-0 Elements: 44

  Number of Points Read: 5


      This display informs you of the progress of the program in reading your input. If any errors are found in reading the input, error messages will be displayed at this time. If a model distribution separate from the one that might be obtained from the raw data associated with input points is found, the "Model Distribution Read" message is displayed. The "Number of Elements Read" is the number of values in the frequency distribution (either a separate model distribution, or a distribution built from the point raw data. The "Number of Non-0 Elements" is the number of elements that have a value greater than 0 in the model distribution (see Number of Different Categories to Use in Computing H/Hmax, below). The "Number of Points" is the number of separate data points that were read.


Output Listing File {DIVERS3A.LST} ?


      A valid pathname to which program output will be written. If the file already exists, you will be asked if the file may be deleted.


Listing Line Length in Characters {80} ?

Listing Left Margin Spaces {0} ?

Number of Lines to Print on a Listing Page {60} ?

Number of Blank Lines to Print at Top of Page {0} ?


      These prompts define the output format. The listing line length is the number of characters that will be printed, exclusive of any left margin. The line length should be at least 60. These program print options, once entered are written to a file called DIVERS.DEF on the current directory. The program does NOT prompt to erase this file, so do not have another file with this name on the current directory.


Default Output Options:

  Linelen=80 Left Margin=0 Pagelen=60 Top Margin=0

  Default Output Options OK {Y} ?


      This two line display and prompt indicates that a DIVERS.DEF file has been found on the current directory and the output options contained (which may be different than those listed), have been read. To simply accept these values, hit <enter>. Replying N or NO, will allow you to enter these values again, and a new DIVERS.DEF file will be written.


Print Histograms for Sample Sizes Run {Y} ?


      By default, the program produces a histogram of richness and evenness (along with summary statistics) for each sample size run. These histograms are useful for understanding the program operation and for getting a visual feeling for whether a sufficient number of trials was run (in which case the histogram will show a relatively smooth bell shape, with skewing or truncation at large and small sample sizes). However, in many cases the summary statistics, which are always printed, may be all that are necessary. If the histograms are not needed, the output can be greatly shortened by replying No to this prompt. (Note: a negative response only slightly reduces program run time.)


Produce an Output Plot File {Y} ?

Plot File Name {DIVERS3A.PLT} ?


      Reply Yes or accept the default to the first prompt if you want the program to produce a file that can be read by DIVPLT to graphically display the results of the analysis. If a plot file is requested, the file name is requested.


Produce an Output File of Tabulated Results {N} ?

Output File Name {DIVERS3A.OUT} ?


      If you wish to manipulate or otherwise display the summary data produced by the program (perhaps using a Microsoft Chart or Lotus 1-2-3), an output file like the summary listing output is produced, missing of course the headings and forms control. If an output file is requested, the file name is requested.


Run Title ?


      An optional title that will be printed at the top of each output page. The title will be truncated to fit the available space so titles 40 characters or shorter are to be preferred. If no title is desired, simply hit <enter>.


Specify Random Number Seed Value {N} ?


      The first prompt asks if you want to explicitly set the seed on the random number generator. In general, this need not be done. And, if the reply is <Enter> or negative, then the program will set the seed with an arbitrary value (that is printed in the listing file) and continue without issuing the subsequent prompts. It is useful to be able to set the seed only if you want to be able to exactly reproduce a run, which requires the same sequence of random numbers. In order to reproduce a run, look at the parameter listing of the run you want to reproduce, and enter the same seed values (as well as the same values for the other parameters that affect the amount of computation, such as number of trials and the sample size range).


Number of Trials is [C]omputed from Sample Size or [F]ixed {C} ?

Number of Random Trials to Run for Sample Sizes < 10 {200} ?


      If you wish, the program will adjust the number of trials for sample size (hit enter, c, or C); otherwise it will use the same number of trials for each sample size (hit f or F). Computing the number of trials reduces the amount of computer time used without seriously affecting the reliability of the results, since the stability of the results increases with sample size. The number of trials at a given sample size is based on the number of trials run for sample sizes of 1 to 9, which is specified in the next prompt. Then, for each increase in order of magnitude of sample size, the number of trials is halved.  That is, if the number of trials for sample size less than 10 is 200, then 100 trials are run for samples sizes of 10-99, 50 for 100-999, etc.  For guidance in picking the number of trials, see the information provided with the description of the next prompt.


Number of Random Trials to Run {100} ?


      If the Fixed option is chosen, then this is the next prompt received. Enter here the number of random trials to be run for each simulated sample (both interval and point samples, as described below). No simple rule can be given to guide the user in determining how many trials are necessary. However, for smaller sample sizes, larger numbers of categories, and less even model distributions, more trials will be needed. Bell-shaped curves in evenness and richness histograms produced by the program tend to indicate that an adequate number of trials have been run. However, note that with small sample sizes the possible values of evenness are limited, and the histogram will never become smooth. In this case, if the program is run several times at the same sample sizes on the same data and similar results are obtained (as judged by the means and standard deviations of richness and evenness), this suggests that an adequate number of trials have been run for those sample sizes.


Run Trials for a Range of Sample Sizes {Y} ?


      Answer yes to compute expected diversity for a range of possible sample sizes. This option is used to build a diversity curve like those presented in the references, above. The computation of expected diversity for the sample sizes associated with any point data entered is carried out independent of this prompt (see "Run Random Trials with Sample Sizes of Points Read").


Note: Data Set Sample Sizes Range from 23 to 152

Smallest Sample Size to Run {20} ?

Largest Sample Size to Run {200} ?


      These two prompts determine the range over which the random trials are run, the range over which the expected diversity curve is computed. If point data have been entered, the default values associated with these prompts are "round" numbers the include the range of sample sizes of the actual data.


Sample Size Increment is [W]ide, [N]arrow, or [F]ixed {W} ?


      If "Run Trials for One or a Range of Sample Sizes" was answered yes, then expected diversity is computed at the minimum sample size, and at additional sample sizes up to the maximum. If one chooses the Wide or Narrow options, the program will choose an approximately logarithmic spread of sample sizes to run. If the Fixed option is chosen answer is no, the "Sample Size Increment" prompt, below, is issued.


      With the Wide option the program computes a relatively wide sample size increment which produces a rough curve. The Narrow option will produce a smoother curve but take quite a bit more time. In either case, the program will adjust the minimum and maximum sample sizes to "round" numbers. The Wide increment uses sample sizes at the following values within the range of the minimum and the maximum sample sizes specified: 0, 10, 20, 50, 100, 200, 500, 1000, 2000.... Thus, if the minimum is 20 and the maximum is 200, the following sample sizes are used: 20, 50, 100, 200. The Narrow increment at a given sample size is about on fifth that of the rough increment. As a result, the curve is smoother, but it takes about 5 times as long to compute. The sequence of sample sizes for the smooth interval goes as follows 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16, 18, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100, 120, 140, 160, 180, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000, 1200....


Sample Size Increment {45} ?


      If the computed interval (Wide or Narrow) is not used, the user may specify an even increment to be used. The program default is to increment is calculated to provide 4 or 5 sample sizes spread over the range. A smaller number here increases the resolution of the diversity curve, at the expense of additional computation. If the increments do not divide evenly into the range between the maximum and minimum, then the largest sample size computed will be the largest one that does not exceed the maximum. If the minimum and maximum sample sizes are the same, then random trials are run for that sample size alone, and this prompt is not displayed.


Estimated Run Time in Minutes for Sample Size Range

   PC: 3.7 AT: 1.0

 > PC w/ 8087: 0.6 AT w/ 80287: 0.3

   Selections: 22000 Element-Trial Iterations: 13200


      This display indicates the approximate time that will be required to do the computation for the range of sample sizes with the specified increment with IBM microcomputers configured in different ways. Newer 286, 386, and 486 machines will run the program much faster. The numbers following "Selections" and "Element-Trial Iterations" indicate the amount of computing to be done in the two most time-consuming parts of the program. These two numbers are described below in the section on Program Operation.


Run Random Trials with Sample Sizes of the Points Read {Y} ?


      If data are read that describe actual data points, this option allows the computation of expected diversity for those exact sample sizes. In addition, the program will compute the percentile values of richness and evenness of the actual data with respect to the expected curve.


Estimated Run Time in Minutes for Point Sample Sizes

   PC: 5.4 AT: 1.5

 > PC w/ 8087: 0.8 AT w/ 80287: 0.4

   Selections: 25600 Element-Trial Iterations: 19800


      This message provides a time estimate for the computation of expected diversity for the sample sizes of the input points. See the information provided with the similar display, above.


Number of Different Categories to Use in Computing Hmax {44} ?


      The number of categories used in computing Hmax. Usually, this will be the number of non-zero categories in the model distribution (the program default, which is data dependent), but there are situations in which a larger number may be desired.


Confidence Interval Width (0.0<=interval<=1.0) {0.9} ?


      The interval centered on the median richness and evenness of the trial samples that is marked on the output and may be plotted as a reference line. A value of 0.9 indicates that 90% of all trials will fall within the range, a value of 0.5 indicates that 50% will fall within the range. Since richness and evenness values often cluster, the actual percentage in any interval may be MORE than the specified proportion, but never less than that value.


OK to Proceed {Y} ?


      This prompt gives you a chance to go back and reenter program parameters. It is especially useful, if the time estimate appears too long, and you want to adjust the number of trials, the increment, or the minimum and maximum sample sizes to cut down the time. Answering yes starts basic computation. Answering no takes you back to the output format prompt.


Program Running; Please Be Patient

  200

Compute Time: 0.22

   23

Compute Time: 0.45


Total Compute Time: 0.67 Minutes


End DIVERS


      This display indicates that the program is running, and indicates the sample size most recently completed. The first number and Compute Time value are associated with the trials for the sample size interval, the second set of numbers are associated with the trials for the sample sizes of the points read. The number printed near the left side of the screen is the sample size last COMPLETED, in the first case 200, in the second, 23. The first two compute time values are printed only when the respective parts of the program have finished execution. The last two lines will be displayed at the termination of the program run. Until the run on the first sample size is complete, only the first line of this display will show. If you need to interrupt the program, see RUNNING DIVERS above.


DIVERS INPUT DATA


      Input data is stored in a standard ASCII file produced using Notepad or another editor. If a word processing editor is used, make sure that the data set is saved as an ASCII file, that is that no additional codes are stored with it.


      The input data file consists of two parts, Model Data and Point Data.  Model Data compose the distribution that is used to generate the expected diversity. Point Data are used to examine the relationship between actual data values and the expected diversity produced from the model data. No point data are required for program operation if Model Data are explicitly provided. Whenever a terminating "*" is used, it must be preceded by one or more blanks.


Model Data


      The model data may be entered as a sequence of real numbers, followed by a "*". Any number of values may be on a line, separated by one or more spaces (NOT commas). The program will continue reading numbers until a "*" or other non-numeric character is read. Note, a blank must appear before the * or other non-numeric character. No more than the program limit of maxelmt values may be entered (see above). The values read may be counts, percentages, or proportions that make up the model distribution.


      Alternately, the model distribution may be build by summing the values given with the Point Data. In this case, the absence of explicit model data must be indicated by a single initial line with a "*" on it.


Point Data


      Point data consists of two parts for each point (assemblage), label information and diversity data. Label information is read from the first line following the Model Data or the last line of the previous entry of Point Data. The diversity data is read starting on the next line, until a "*" or other non-numeric character is found.


      Label Information. The label line may have three types of information on it. First a single character that indicates the plot symbol to be used by DIVPLT, for example, a '+' or an 'A'. Any standard character may be used.


       Next, a label of up to titlelen characters for the point. This label may be separated from the plot symbol by any number of blanks, but must be on the same line. The label may include any characters, including blanks.  All characters up to the first "/" (if any) will be plotted as a label on the plot. The entire string (with the "/" changed to a blank) will be plotted in a legend.


      Diversity Data.  Starting on the line following the label information line record diversity information for the point is read. Diversity information may be of two types.


      If a model distribution was explicitly provided, diversity data may consist of three values for the point, sample size, richness, and evenness, on one or more lines, terminated with a "*". Sample size and richness are read as integers (no decimal point), and evenness as a real number. Sample size is the number of items observed. Richness is the number of different categories (as defined by the model distribution) observed. Evenness is the H/Hmax value of the observed distribution, accurate to at least four decimal places (the program rounds to three decimal places when computing percentiles).  Base 10 logs are used.


      Alternately, a list of counts (percentages, or proportions) representing the categories for the data point will be read. The program will compute each point's sample size, richness, and evenness. This data list has the same form as the model distribution list, including the terminating "*". If no model distribution was provided, the number of values expected for each point value will be determined by the number associated with the first point. Thus, the same number of values must be entered for each point, and the model distribution, if any. If no explicit model distribution is provided, the values for the first, second, third, etc. categories for all points are summed to provide a composite distribution. In this case counts would ordinarily be used, however percentages or proportions may be used if the composite distribution should weight each point (assemblage) equally (but the sample size will not be available).


      The choice of the input format is a matter of convenience. Usually, either the model distribution will be provided along with computed richness and evenness values for the points (Sample Dataset 1), or no model distribution will be given, but full data will be given for each point (Sample Dataset 2). Use of a model distribution and raw data is useful only if a model distribution other than the composite distribution is desired (Sample Dataset 3). The user need not specify the input form used, the program will figure it out. Three sample datasets are included below. All three of these sample datasets will produce the same results. (Here, the model distribution of Sample Dataset 3 is redundant, since it is simply the sum of the point assemblages.)


Sample Dataset 1

3 41 12 4 1 3 12 38 9 21 2 2 25 11 7 11 3 4 12 6 5 4 7 1 8 1 2   
  4 2 2 1 1 0 7 2 1 1 0 3 0 1 0 6 0 0 0 3 9 0 0 0 17 10 0 7 0 0 *
  * Alta/ALTAMIRA
  152 38 0.8639 *
  * Mina/CUETO DE LA MINA
   69 27 0.7810 *
  * Cier/EL CIERRO
   35 15 0.6568 *
  * Juyo/EL JUYO
   53 19 0.6584 *
  * Palo/LA PALOMA
   23 12 0.6155 *

Sample Dataset 2

3 41 12 4 1 3 12 38 9 21 2 2 25 11 7 11 3 4 12 6 5 4 7 1 8 1 2   
  4 2 2 1 1 0 7 2 1 1 0 3 0 1 0 6 0 0 0 3 9 0 0 0 17 10 0 7 0 0 *
  * Alta/ALTAMIRA
  152 38 0.8639 *
  * Mina/CUETO DE LA MINA
   69 27 0.7810 *
  * Cier/EL CIERRO
   35 15 0.6568 *
  * Juyo/EL JUYO
   53 19 0.6584 *
  * Palo/LA PALOMA
   23 12 0.6155 *

Sample Dataset 3

3 41 12 4 1 3 12 38 9 21 2 2 25 11 7 11 3 4 12 6 5 4 7 1 8 1 2
  4 2 2 1 1 0 7 2 1 1 0 3 0 1 0 6 0 0 0 3 9 0 0 0 17 10 0 7 0 0 *
O Alta/ALTAMIRA
    2 12  7  1  0  3 12 15  0  3  1  1 12  7  3 11
    3  1  7  2  4  3  3  1  5  1  1  1  0  2  0  1
    0  0  1  1  1  0  3  0  0  0  2  0  0  0  1  4
    0  0  0  5  4  0  5  0  0  *
O Mina/CUETO DE LA MINA
    1 12  2  0  1  0  0  3  1  5  0  1  4  3  1  0
    0  1  2  4  0  1  1  0  1  0  0  2  2  0  1  0
    0  7  0  0  0  0  0  0  1  0  1  0  0  0  2  2
    0  0  0  6  1  0  0  0  0  *
O Cier/EL CIERRO
    0  5  1  2  0  0  0  7  3  2  1  0  4  0  2  0
    0  0  2  0  0  0  1  0  0  0  0  0  0  0  0  0
    0  0  1  0  0  0  0  0  0  0  1  0  0  0  0  0
    0  0  0  2  1  0  0  0  0  *
O Juyo/EL JUYO
    0  8  2  1  0  0  0 12  3  9  0  0  2  1  1  0
    0  1  1  0  1  0  2  0  1  0  1  1  0  0  0  0
    0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  2
    0  0  0  0  3  0  0  0  0  *
O Palo/LA PALOMA
    0  4  0  0  0  0  0  1  2  2  0  0  3  0  0  0
    0  1  0  0  0  0  0  0  1  0  0  0  0  0  0  0
    0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  1
    0  0  0  4  1  0  2  0  0  *

DESCRIPTION OF DIVERS PROGRAM OPERATION


      For each sample size, from the minimum to the maximum, a number samples (determined by the Number of Random Trials prompt) chosen from the frequency distribution of data, and the distribution of the number of elements of different categories (the richness) is plotted in a histogram.  Next, the distribution of the evenness (H/Hmax) values of the trials samples is summarized and plotted in a histogram.  Finally, this information is summarized by tables and plots in which the mean and a confidence interval around the richness and evenness values is plotted against the sample size.  The point data (if any) are superimposed on this plot.  All output goes to the output file and the plots to a plotter. (The plot option is not now implemented.)


      CPU Time Required.  The execution time of the program is basically a function of the number of random selections made and the product of the number of random trials and the number of elements, summed over each sample size that is run.  Thus, a single run of 500 trials for a sample size of 100 with 50 different categories, results in 50,000 selections and 5000 element-trial iterations.  With the time estimates, the program provides the number of selections that will be performed and the number of element-trial iterations required by the run.  In general, with a constant increment and number of trials, the number of selections increases with the square of the maximum sample size requested, as given by:


SELECTIONS=0.5*TRIALS*(MAX**2/INC+MAX). 


      If a computed interval (INC=0) is used, the increase in selections with increasing MAX is essentially linear, estimated by: 


SELECTIONS=1.8*MAX*TRIALS (for rough intervals) SELECTIONS=7.8*MAX*TRIALS (for smooth intervals)


      If computed trials and intervals (TRIALS<0 and INC=0) are used then the number of selections increases much more slowly.  The relationship is estimated closely (at a MAX of 50 or more) by:


SELECTIONS=4*TRIALS*MAX*0.81**(3.32*log10(MAX)) (for rough)

SELECTIONS=16*TRIALS*MAX*0.81**(3.32*log10(MAX)) (for smooth)


COMPOSITION OF .PLT FILES PRODUCED BY DIVERS


      If you look at the .PLT file produced by Divers (not the .HPG file produced by DivPlt that actually contains the plot) you will see that it contains all the data needed to create the plots. You can read that data into a statistics or graphing program and use it to create plots. For Divers, the composition of the .plt file is as follows. You won' need all of this information:


title date no of elements, min ss, max ss, increment, no of intervals, no of points, conf.level, iproginc, ipointtrial, iinctrial, disteven,hmax,avlevpct (you don't need these)


for each element, the element proportion


for each sample size interval: ss, mean richness, std of richness, richness conf. interval low, richness conf. interval high, minimum richness, maximum richness, mean evenness, std of evenness, evenness conf. interval low, evenness conf. interval high, minimum evenness, maximum evenness.


for each data point: symbol, label sample size, number of trials, richness evenness, (and depending on the computations, richness lower percentile, richness percentile, richness upper percentile, evenness lower percentile, evenness percentile, evenness upper percentile


Home Top Overview Ordering Documentation

Page Last Updated - 08-Dec-2007