Follow the Money: Tracking Monetization of COVID-Misinformation
Research Question: Are the monetization efforts of coordinated anti-vax groups on the internet (especially social media pages) higher than the monetization efforts of other groups (uncoordinated anti-vax, coordinated pro-vax, uncoordinated pro-vax). For our research, an effort is said to be coordinated when there are multiple pages and groups that are pushing their common agenda by referencing each other in their posts – effectively forming a ‘network’ through which a user navigates through clicks. These pages and groups, as they coordinate, are trying to earn as many clicks and dollars as they can. Uncoordinated efforts are when a single person/group is trying to push their agenda without any outside influence.
Data: Using an approach developed by our collaborators at GW, Facebook groups are divided into 2 categories based on: (1) how they operate – coordinated and uncoordinated; and (2) their views about vaccination – anti-vax and pro-vax. Combined, there are 4 distinct groups: coordinated anti-vax, coordinated pro-vax, uncoordinated anti-vax, and uncoordinated pro-vax. We have a list of ~3,000 web domains to which these groups are sending their readers. A domain refers to the address of the ‘root’ page (e.g. https://esoc.princeton.edu/) whereas pages or webpage of the domain refers to the ‘child’ pages (https://esoc.princeton.edu/projects/trackingdisinformation-and-conflict) i.e. each of the domains has tens of thousands of webpages. We measure how much a domain and its pages try to monetize from its users based on several metrics. These proxies for monetization are calculated for domains and their pages over time as well. By comparing the distribution of these proxies for each of the 4 groups we will answer our research question.
Methods: We explore multiple proxies for monetization efforts:
1. The number of ads and sponsored content on a page – eventually averaged for the domain. 2. The percentage ad occupancy vs. text and non-ad images on pages of the domain – i.e. how much physical area of the webpage is covered by ads – eventually averaged for the domain. 3. The number of times the word ‘Donate,’ ‘Contribute,’ or ‘Donation’ appears for all webpages of the domain – eventually averaged for the domain. 4. The number of monetization technologies used to build the domain – in particular for shipping, payment, advertising, and marketing.
Challenges: One of the major challenges is curating the multiple proxies for monetization. Given that we have a list of >15 million webpage content spread over several months, web scraping them and manipulating them to extract monetization metrics on personal systems is challenging. This involves coding more akin to full-stack development than standard data science.