Research Guides: Language and Culture Archive of Ashkenazic Jewry Digital Archive User Guide: OCRed Printout Resources

OCRed Printouts

Location in LCAAJ Printouts of Answers to Specific Questions

In an effort to jump start use of the online data and to promote collaborative map-making, the NEH-funded digitization team at Columbia has sought to produce a machine-readable of the printouts. Abbyy FineReader OCR (optical character recognition) software was used for this purpose. The pages were first zoned and analyzed to identify the tables of data on each page, and the text in each of the series was then subjected to a few hours of training to enhance accuracy. After full machine reading, and some initial cleanup all of the pages were read and exported in the form of to Excel spreadsheets. Those spreadsheets have been subjected to various cleanup processes, and we will continue to work on its greater refinement. In their current state, they give a good sense of the range of answers provided to a given question, but all will require verification against the images of the originals before serving as the basis for solid research. We will be making the sets of answers to specific questions available in the LCAAJResearch collection (described elsewhere in this guide), in the form of editable Google spreadsheets, as researchers express interest in those questions. This staged upload will allow us to concentrate our cleanup on areas of the most interest, as well as to take advantage of the overall accuracy of the data set over time. It is our hope that a growing set of collectively spreadsheets will make possible increasingly powerful use of this data.

It is possible for the time being to download searchable pdfs of the printouts in their uncorrected state. They will be available as zipped folders containing individual page images only. Once unzipped, they can be combined into individual volumes using a software like Adobe Acrobat that enables you to combine pdfs. We hope, in the near future, to replace the individual page images with complete volumes. Please note that these pdfs represent images of the page with the OCRed text embedded beneath them.

In the meantime, if you are interested in OCRed printouts for specific datasets, please contact us at lcaaj@library.columbia.edu. We will be posting individual questions printouts upon request.