Collecting and Processing Historical Documents with R

Date
Oct 24, 2023, 12:15 pm1:15 pm

Speaker

Details

Event Description

Many researchers are interested in conducting statistical analysis of large-scale data contained in scanned books or other documents. While OCR technology is increasingly capable of producing text versions of scanned documents, its unstructured (and often typo-ridden) products can be challenging to analyze. This talk will discuss how to digitize scanned PDF documents into computer-readable tabular data, focusing on how to use R to postprocess OCR output.

I have developed a software framework called formatted optical character recognition (fOCR) to transform scanned documents into a computer-readable database. The discussed code and most of the data – which comprise early 20th century university registers and other administrative records – will be overviewed in the talk.

*Lunch will be provided!*  Please RSVP below by Monday October 23rd so we can order enough food!

Please contact Kim Kreiss ([email protected]), DDSS Grad Fellow, or Angela Li ([email protected]), 4th-year Social Policy and Sociology PhD, with any questions.

Sponsor
Initiative for Data-Driven Social Science