Computational Analysis of Police Body-Worn Camera Footage
Research Question: What happens during police-civilian interactions? This ongoing work seeks to develop computational methods to automatically generate summaries of police-civilian interactions captured by body-worn cameras. Decades of policing research have relied on self-reported accounts of police-civilian interactions written by police officers. Body-worn camera footage represents a valuable source of data that has gone largely unused due to the time and cost needed to manually review thousands of hours of footage. Our automated system will streamline this process and offer a new data source to study classic questions relating to police oversight, the dynamics of violent escalation, and racial bias in policing.
Data: Body-worn camera footage from the Chicago Police Department (made publicly available by the Chicago Office of Police Accountability); body-worn camera footage from two municipal police agencies acquired through data sharing agreements; body-worn camera footage from additional agencies once an alpha version of the automated system is operational.
Methods: The first step is developing and deploying efficient audio and video annotation interfaces. The interface allows for rapid spatio-temporal coding of key activities in the video, partially automating the process to minimize required human intervention.
The second step is designing and training computer vision and speech analysis models that are able to learn from the data and automatically recognize target activities in new videos. This allows for extracting timelines of key events from videos on an unprecedented scale.
The third step is computational analysis of the extracted timelines. Long Short Term Memory (LSTM) recurrent neural networks, which are well suited for analyzing path-dependent sequence data, will be used to extract patterns of vocal and physical actions and reactions.
A number of planned extensions build on these preliminary technical stages, including field tests of the system in randomized controlled trials to measure impact in the supervisory process within police agencies.
Challenges: Modern computer vision activity understanding systems are developed primarily for third person views, making the egocentric body-worn camera footage particularly challenging for existing models. The target events to be detected require an understanding of fine-grained activities in noisy environments through an egocentric viewpoint, and further require multi-modal reasoning through integration of audio, language and visual cues. In addition, obtaining manual annotations of real-world police data is challenging not only due to the sheer scale of the data but also due to the need to minimize potential adverse mental health effects on the annotators.