In an effort to jump start use of the online data and to promote collaborative map-making, the NEH-funded digitization team at Columbia has sought to produce a machine-readable of the printouts. Abbyy FineReader OCR (optical character recognition) software was used for this purpose. The pages were first zoned and analyzed to identify the tables of data on each page, and the text in each of the series was then subjected to a few hours of training to enhance accuracy. After full machine reading, and some initial cleanup all of the pages were read and exported in the form of to Excel spreadsheets. Those spreadsheets have been subjected to various cleanup processes, and we will continue to work on its greater refinement. In their current state, they give a good sense of the range of answers provided to a given question, but all will require verification against the images of the originals before serving as the basis for solid research. We will be making the sets of answers to specific questions available in the LCAAJResearch collection (described elsewhere in this guide), in the form of editable Google spreadsheets, as researchers express interest in those questions. This staged upload will allow us to concentrate our cleanup on areas of the most interest, as well as to take advantage of the overall accuracy of the data set over time. It is our hope that a growing set of collectively spreadsheets will make possible increasingly powerful use of this data.
It is possible for the time being to download searchable pdfs of the printouts in their uncorrected state. They will be available as zipped folders containing individual page images only. Once unzipped, they can be combined into individual volumes using a software like Adobe Acrobat that enables you to combine pdfs. We hope, in the near future, to replace the individual page images with complete volumes. Please note that these pdfs represent images of the page with the OCRed text embedded beneath them.
In the meantime, if you are interested in OCRed printouts for specific datasets, please contact us at email@example.com. We will be posting individual questions printouts upon request.
In the late 1960s and early 1970s, about half of the data collected by the project was transferred onto punch cards and read onto computer tapes to create lists that could facilitate easy entry of answers onto maps. Those printouts were created in four separate batches, which were not combined into a single file (probably because of the limitations of computing at the time). The original tapes have disappeared, and would probably not be readable today in any case. The output of those batches remains, however, in four separate series, the output of batches of answers that were input to a computer and then printed out at various times in the late 1960s and early 1970s. (Their names describe their appearance on the shelves of the former LCAAJ workroom in Philosophy Hall -- the "dot" series had colored dots on their spines, and the "Black Binder" was contained in a series of black binders.)
In general, depending on whether your question is found in the Eastern questionnaire, the Western questionnaire, or both, you will need to use different groups of printouts:
A list of links to the pages in each of the printouts where the answers to specific questions begin is available at the bottom of this page.
As illustrated below, the data in the printouts is arranged in columns, described below. PAGQUNO, INTVN, and RESPONSES are likely to be of primary interest, but the meanings of all are described briefly below.
PAR (Paragraph): Not used in most printouts, except for a few in the BlackBinder series in volumes 194 and 195, but was designed to allow answers to be sorted according to Linguistic Topic. (Possible use of the volumes containing these PAR sorts is not currently supported by this guide, but may be incorporated into a future update.)
LOCREF (Location Reference): Indicates a location other than the interviewee's location to which all or part of the this answer applies.
PAGQUNO (Page and Question Number): This seven-digit string consists of four parts. The first three numbers represent the page of the questionnaire, the next two the number of the question on that page, the next number the a subquestion, if any (with a value of zero if there is no subquestion). The final number, starting from zero, records multiple answers from the specific interviewee, who is identified in the next column. (The PAGQUNO in the questionnaire itself is only 6 digits in length (9 in the Western Questionnaire), since it lacks this final number.)
Note that while master Western Questionnaire lists questions in the form 9 digits, the first three digits representing the page in the Western Questionnaire on which a question occurs, followed by the actual question number, which may reflects an Eastern question to which it corresponds, these are presented slightly differently in the printouts. Thus, for example, the master Questionnaire at EYDES and here lists a question on page 245 about the last meal before a fast as 245190030, since it corresponds to question 190030 in the Eastern Questionnaire, but appears on page 245 of the Western Questionnaire. Conversely, another question on that page asking whether Christians attended their synagogue is listed as 245245015, since it is unique to that page in the Western Questionnaire and has no correspondence in the Eastern Questionnaire. In the printouts, the first three digits do not appear, so the questions in the examples above would be listed as 190030 or 245015 (followed, of course by an additional digit indicating multiple answers by the same respondent) and be sorted accordingly in the Red Dot series.
INTVN (Interviewee Number): This is the identifier for the interviewee, providing, as noted elsewhere, a rough approximation of the latitude and longitude of the locality he or she represents.
RESPONSES: This is contains the answer(s), including the actual words of the respondent along with some qualifying prefixes and suffixes describing how the answer was obtained and any additional commentary the interviewee may have offered about its usage. (For information about the special transcription scheme used to record the answer, as well as a list of the accompanying codes see TRANSCRIPTION and NOTATION.)
***: An asterisk in this column appears to indicate that the answer to this question is continued on the line below.