Panel Dataset of Criminal Group Presence in Mexican Municipalities
Research Question: Research on organized crime in Mexico and elsewhere is limited by a dearth of systematic, high-frequency, and high-resolution information on where criminal groups operate. We are creating a panel dataset of criminal group presence in Mexico with annual frequency (1990–2020), at the municipal level (2,500 municipalities), and tracking individual criminal groups (rather than generic “organized crime presence”), which fragmented from 6–8 large drug trafficking organizations into dozens of independent groups during the period of interest. The question is, how to produce such a large dataset requiring local data for many years and places?
Data: Two stages and modes of data collection. In the first stage, qualitative work produced a dictionary of actors (coming up with 79 groups, each potentially with multiple names by which they are identified) and a dictionary of place names (not only the official names of municipalities, which are the equivalent of U.S. counties, but also alternative names they are known by and the names of cities and localities linked to them). The main sources for this investigation were U.S. and Mexican government agency reports, reports by experts, and magazines and blogs specializing in tracking organized criminal activity in Mexico.
In the second stage, we searched for combinations of exact terms in Google and Google News (for example “Cartel de Sinaloa” + “Tijuana” for a given five-year period) based on our dictionaries of group and place names, saved the URLs from the search results1 , and extracted the main text from each of 2.9 million unique and valid URLs, from which we will keep only the documents for which the main text still contains the terms of interest (thus eliminating “false positives” in which search hits were based on text in ads, related articles, etc.). The resulting text is mostly from national and local newspapers, but also press releases from various levels of governments and authorities and blog/social media posts.
Methods: We will train an algorithm to classify sentences as indicating or not the presence of a certain group in a certain place. We will first deploy BERT (Bidirectional Encoder Representations from Transformers), an open-source unsupervised machine learning technique for natural language processing (NLP) pre-training developed by Google that has been shown to produce language models with a deep sense of language context and has been used for tasks such as question answering and language inference. BERT, which can be used for multiple languages, will thus “learn” from the Spanish-language text corpus we have produced and will, we expect, be able to then classify text chunks more accurately than simpler techniques have been able to.2 This stage will require the team to hand-code training and validation subsets. We expect to hand-code at least 10,000 sentences.
Challenges: There was no existing universal list of actors to follow, but rather snapshots of data from dozens of sources that produced an initial list of over 300 group or faction names. Sorting through conflicting information and identifying evolving names and affiliations of groups required time-consuming archival work and triangulation of multiple open sources.