Speaker
Details
Many researchers are interested in conducting statistical analysis of large-scale data contained in scanned books or other documents. While OCR technology is increasingly capable of producing text versions of scanned documents, its unstructured (and often typo-ridden) products can be challenging to analyze. This talk will discuss how to digitize scanned PDF documents into computer-readable tabular data, focusing on how to use R to postprocess OCR output.
I have developed a software framework called formatted optical character recognition (fOCR) to transform scanned documents into a computer-readable database. The discussed code and most of the data – which comprise early 20th century university registers and other administrative records – will be overviewed in the talk.
*Lunch will be provided!* Please RSVP below by Monday October 23rd so we can order enough food!
Please contact Kim Kreiss ([email protected]), DDSS Grad Fellow, or Angela Li ([email protected]), 4th-year Social Policy and Sociology PhD, with any questions.