TFQA: Tools for Quantitative Archaeology - Statistical Analysis Software for Archaeology

TFQA: Tools for Quantitative Archaeology - Statistical Analysis Software for Archaeology TFQA

TFQA: Tools for Quantitative Archaeology
kintigh@tfqa.com +1 (505) 395-7979

TFQA Home
TFQA Documentation
TFQA Orders
Kintigh (ASU Directory)

CNTCNV: Transform and Aggregate Sparse Count Data Input into Data Matrix

Program CNTCNV was written to manipulate count data with an abbreviated input format. For most statistical programs, count data must be entered with a single count for each possible category. In cases where there are many 0 counts this can be a tedious and error-prone procedure. CNTCNV allows fast and accurate input of count data sets with many 0 values, provides flexible mechanisms to aggregate both cases and counted categories, provides a labeled output and a data matrix suitable for input to most statistical programs.

CNTCNV assumes that the data have been coded with certain format rules consistently followed. The conventions it shares with others of my programs are described in the section entitled "Program Conventions." CNTCNV assumes that the first part of each line in the input file has provenience information, which may be either character or numeric, that identifies the individual cases for the program. Second, it allows for the coding of a manual total for each case that the computer will compare with its computed total. If this check is requested, an error message is printed if the totals to not match. Use of this feature on the initial coding is recommended, because it allows catching input errors due to improper transcription of values.

The count data are input as a set of row and column coordinates for each case, followed by a count for that row and column. Thus, if you were coding types 1-30 and had 8 vessel forms for which you wanted counts, a code of 27 3 47 could be interpreted as a count of 47 for type 27 with vessel form 3. The program allows you to specify the length of the row, column, and count fields. (The program allows this input for row numbers alone, when no column has been specified. In this case the column field is omitted in each row-column-count set. No count may exceed 32767.) Thus, an input line to the program consists of arbitrary provenience information (which may have any characters) and an optional manual sum, followed by sets of row-column-counts, one after another. The row-column-count sets must follow all other information on a line of data.

The total line length may not exceed 256 characters, but a shorter length, such as 80, is recommended. The program will continue to read row-column-count sets until the end of each line is reached. It is important to not that the row, column, and count fields must be entirely numeric, and that each field must always be the same width. Thus, the count field width must be set up to accept the maximum width count that will be found in the data set. Each field is treated independently, so there is no need to fill the fields with 0's. If the row field width were 2, the column field width 1, and the count width 3, the following are equivalent '273047' and '273 47'. Note also that the fields within each set are not separated by blanks (although this effect can be achieved by expanding the field widths to be larger than necessary), and that the row-column-count sets are also need not be separated by blanks.

If the sets of counts extend past the end of a line, a new line should be started, with the necessary provenience information repeated (the optional sum need not be repeated). The row-column-count fields must start in the same position in each record. All records with the same provenience key will be considered the same, no matter how many lines they are on. Note also that the lines need not be of the same length, the program simply reads to the end of each line.

The program provides several options for convenience. First, the row-column-count fields may be checked to make sure that the rows are input in ascending order, and that within row, the columns are in ascending order. Thus, if care is taken in preparing the records to input the fields in order with no duplication, this provides a useful check against coding errors. Further, the program requires that the user specify the valid row-column combinations. Any combinations not specified will be flagged as errors. Finally as mentioned above, a manual total for each provenience can be entered which is checked against the computer's total; again to help detect data entry errors.

Row numbers may be from 0 to a maximum specified in the program constant section (e.g. 100) and columns may be from 0 to a similar specified constant (e.g. 8). Thus, coding forms should be set up to use more rows than columns. However, all row or column numbers need not be used as long as the maximum row and column numbers are less than or equal to the maximum. Thus row numbers may be specified in a hierarchical fashion 10, 11, 12, 13, 20, 21, 22, 23, 24, 30..., where 20 may be an indeterminate category for the more specific codes 21, 22, 23, and 24.

The program allows lumping of cases through a simple mechanism. A set of columns within the provenience portion of the record are specified as the key. As long as the input data set has all lines that have the same key together in the file, all the proveniences with the same key are combined on output. This can be easily achieved by using SORTLINE or another program to sort the lines according to the key field. Because CNTCNV only uses one contiguous key string, you may want to use my MVC program to construct a contiguous key string from separated fields, and then use SORTLINE to sort the records according to that key so that all records with the same key appear together.

For example, if within the provenience portion of the record, you had a column with E for excavated or S for surface collections, you could get total counts for these two collection types by sorting the file on that column, and specifying that column alone as the key string. However, if you lump original proveniences you will not want to simultaneously use the ascending row-column check or the total check, since superfluous warnings will be issued. One limitation if the program is that no single count may exceed 32767.

The program reads character labels for the row and column numeric codes. There is an optional facility to read a numeric provenience identifier that allows use of an optional label (such as a site name) for each record (in this case several proveniences may use the same label). However, as with row and column fields, this number must be less than a specified maximum (400), and greater than or equal to 0, although not all intermediate numbers must be used.

For example, the input below has provenience information in the first 31 characters (which, in this case, all form the key), followed by a four character total, followed by sets of 2 character row numbers, 1 character column numbers, and 3 characters counts. In this case proveniences (with the first 31 characters the same) span multiple records. These data are used in all of the examples in this documentation.

        10        20         30         40         50        60        70
    +    v    +    v     +    v     +    v     +    v    +    v    +    v    +   
 2  3                      1194 257731 19732 11733 91734 13735 99736  4742  1 
 2  3                      1194 257743  1745  4591  4592  4403  4405  2 
 4       1171850  990   1   394  60291  2292  2293  4295  3191  1192  1 
 4       1171850  990   1   394  60731  1732  1733  5734  1735 12741  2743  6 
 4       1171850  990   1   394  60744  1745 17746  1 
 4       1161860  890   1   393  41291  1293  1295  3191  1192  5195  2 
 4       1161860  890   1   393  41731  2732  1733  2735  9741  1742  2743  1 
 4       1161860  890   1   393  41744  1745  9 
 5        371300  850   1   212  45291  3293  2295  2191  1192  4193  2195  1 
 5        371300  850   1   212  45733 15735  6741  1743  5745  2541  1

Finally the program allows aggregation of several row-column combinations into a single output field. By default, each valid row-column combination produces one output field. By reading a additional file, this aggregation is specified, and labels are provided for the output categories (otherwise labels are created from the row and column labels).

The output listing includes a display of the format that the output data set will take for reading of the fixed-format data by other programs, and an optional labeled listing for each provenience of all output categories, their percentage of the provenience total, and a total count for the provenience. The program also provides a grand total of each category from all proveniences combined. This listing is illustrated below:

Output Record Format 
Record  1: Record Key (A31) 
Record  2 Fields (20I4): 1   2   3   4   5   6   7 
 
Key:  2  3                      1194 
  Buffwares                  6    2.3% 
  Redwares                   8    3.1% 
  Plainware                243   94.6% 
  Total                    257 
 
Key:  4       1171850  990   1   394 
  Tucson Basin Red/Brown    13   21.7% 
  Plainware                 47   78.3% 
  Total                     60 
 
Key:  4       1161860  890   1   393 
  Tucson Basin Red/Brown    13   31.7% 
  Plainware                 28   68.3% 
  Total                     41 
 
Key:  5        371300  850   1   212 
  Tucson Basin Red/Brown    15   33.3% 
  Redwares                   1    2.2% 
  Plainware                 29   64.4% 
  Total                     45 
 
 
All Proveniences Combined 
  Tucson Basin Red/Brown    41   10.2% 
  Buffwares                  6    1.5% 
  Redwares                   9    2.2% 
  Plainware                347   86.1% 
  Total                    403

The output data set is written in Antana form (see the section entitled "Program Conventions"). The file is preceded by a line that specifies the number of observations and count variables in the dataset. For each observation, It consists of the optional provenience label (as an Antana comment line) and the provenience key (also as an Antana comment line), followed by counts for each of the output categories, formatted so that no record is more than 80 characters long, and so that the output counts are an even width (specified by the user in response to a prompt). A sample output matrix for the input cases above is shown here:

#Cases# 4 #Vars# 7

# 2  3                      1194
    0    0    6    8  243    0    0
# 4       1171850  990   1   394
   13    0    0    0   47    0    0
# 4       1161860  890   1   393
   13    0    0    0   28    0    0
# 5        371300  850   1   212
   15    0    0    1   29    0    0

In order to use this output data set with a Antana or any other program that requires free-format input, make sure that the output count width is 1 larger than necessary to show the largest (aggregated) count, thus, each count will be separated from adjacent ones by one or more blanks.

The program prompts the user for information of the precise format of the input count records, and reads several files to that define the row-column validity, labels, and any aggregation required. The sample interactive session shown here when used with the other files described in this documentation, produced the listing and output data set shown above (these files are included on your disk so that you can experiment):

Note: CNTCNV has current limits of:
  100 Row,   8 Column,   200 Output Categories
  To change this limit the program must be recompiled
Output Listing File {.LST} ? cntcnv
All Adjacent Records with the Same Key Will Be Lumped
  Position of Key String {1} ?
  Length of Key String ? 3
Check Computed Sum Against Sum Coded on Record {N} ? N
Position of 1st Row Field {25} ?
Length of Row Field {2} ?
If No Column Code Was Used Answer 0 to Next Question
  Length of Column Field {1} ?
Length of Count Field {3} ?
Output [C]ount or [P]ercent {C} ? C
Output Field Width Must Accommodate Largest Aggregated Count
  Length of Output Count Field {4} ?
Check Ascending Order of Rows and Columns {N} ? N
Complete Output Listing By Provenience {N} ? N
Row-Column Codes Are Checked for Validity In Input
  Row-Column Code Validity File {CNTCNV.RCV} ?
  Reading: CNTCNV.RCV
  Completed Reading: CNTCNV.RCV
Read Provenience ID Names from File {Y} ? Y
  Provenience Name File (NUL for None) {CNTCNV.APL} ?
  Reading: CNTCNV.APL
  Completed Reading: CNTCNV.APL
  Position of Numeric Prov ID Field {1} ?
  Length of Numeric Prov ID Field {3} ?
Row Name File (NUL for None) {CNTCNV.ARL} ?
  Reading: CNTCNV.ARL
  Completed Reading: CNTCNV.ARL
Column Name File (NUL for None) {CNTCNV.ACL} ?
  Reading: CNTCNV.ACL
  Completed Reading: CNTCNV.ACL
Row-Column Counts May be Aggregated On Output
  Aggregate Rows & Columns for Output {Y} ? Y
  Aggregation Description File {CNTCNV.AGG} ?
  Reading: CNTCNV.AGG
  Completed Reading: CNTCNV.AGG
Lower Limit of Sample Size to Output Record {0} ?
Input Count File {CNTCNV.DAT} ?
Output Count File {CNTCNV.ADF} ?
Output Row Label File (NUL for none) {CNTCNV.ARL} ?
File Exists; OK to Erase It {N} ?
Input Count File {CNTCNV.DAT} ?
Output Count File {CNTCNV.ADF} ?
Output Row Label File (NUL for none) {CNTCNV.ARL} ? cntcnv2.arl
Output Column Label File (NUL for none) {CNTCNV.ACL} ? cntcnv2.acl
Begin Processing Data
    4 Records Written
Rewriting CNTCNV.ADF w/ Antana Header
Writing Column Label File
Program End

The input file formats are described below.

Row-Column Validity File. This file consists of any number of sets of three numbers. The first number represents the row number and the second and third the minimum and maximum column numbers. thus 27 1 8 indicates that columns 1-8 are valid for type 27. If no column numbers are used each valid row number should be entered in the form 27 0 0. If only columns 1 and 5-8 were valid for row 27, this must be entered as 27 1 1 27 5 8. These sets of three numbers may be in any order, and they may extend from line to line, however, all three numbers of each set must be on a single line. The numbers are read as a free-format stream, so that they have no required format except that they must be separated by at least one blank, may not have decimal points or other punctuation or non-numeric characters. A sample of such a file, used in the other examples included herein, is reproduced below:

 
            1 1 8  10 1 8  11 1 8  12 1 8  13 1 8  14 1 8  15 1 8  16 1 8 
17 1 8  18 1 8  19 1 8  20 1 8  21 1 8  22 1 8 
23 1 8  24 1 8  28 1 8  29 1 8  30 1 8  31 1 8  32 1 8  33 1 8 
34 1 8  35 1 8  39 1 8   4 1 8  40 1 8  41 1 8 
49 1 8   5 1 8  50 1 8  51 1 8  52 1 8  53 1 8  54 1 8  55 1 8 
59 1 8   6 1 8  61 1 8  62 1 8  68 1 8  69 1 8  70 1 8 
71 1 8  73 1 8  74 1 8  81 1 8  89 1 8  98 1 8  99 1

Prov-ID, Row Label, and Column Label Files. These files have the same simple format. Each line represents a single label. Each line has an integer number that may be preceded by any number of blanks, and must be followed by at least one blank. The remainder of the line is read as the label for that number. Thus, in the Column Label File, the line '2 Bowl Rim' will associate column 2 with the label Bowl Rim. The labels need not be entered in numerical order. A Column Label file is reproduced below:

1 Bowl Body 
2 Bowl Rim 
3 Jar Body 
4 Jar Rim 
5 Indet. Body 
6 Indet. Rim 
7 Handle 
8 Figurine

Aggregation File. In this file sets of lines define the row-column composition of output categories. The first line of each set is read as the output category label. Output category numbers are assigned sequentially by the program and must not be entered. The second line of the set begins with a number that represents the number of succeeding row-column specifications that need be read for this output category. This count is followed by that number of specifications that may take one of two forms. A single row-column specification is added to the output category by a number of the form row*1000+column. Thus, 2703 indicates row 27 column 3. To include all valid columns for a specific row, indicate only the row number. Because row numbers will be less than 1000, any number less than 1000 is interpreted in this way. The two forms of specification may be mixed on a single record. A sample aggregation file is listed below:

Tucson Basin Red/Brown 
 17 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 28 29 
Other Decorated 
 10  1 41 51 53 61 62 68 69 81 89 
Buffwares 
  9  4 30 31 32 33 34 35 39 40 
Redwares 
 13  5 50 52001 52002 52003 52004 52005 52006 52007 52008 54 55 59 
Plainware 
  6  6 70 71 73 74 49 
Historic 
  1 98 
Other 
  1 99

Page Last Updated: 21 June 2022