Navigate back to the homepage

Unpacking COVID-19 publication research themes and urban indications (Part I)

Yuan Lai
June 19th, 2020 · 4 min read

The ongoing COVID-19 pandemic brings both deep and broad impacts worldwide, calling for all research efforts to tackle the uncertainty and urgency involving the novel virus. With unknown pathogens, epidemiological characteristics, and transmission patterns, the new virus (SARS-CoV-2) inevitably brings inconsistency, discrepancies, and debates among the scientific community. Additionally, the rapid transmission speed and large scale of the infected population requires timely responses despite the above uncertainties.

Processing COVID-19 manuscripts metadata

On March 17th, the White House Office of Science and Technology Policy launched a COVID-19 AI OPEN Research Dataset Challenge in a partnership with Allen Institute for Artificial Intelligence (AI2), the Chan Zuckerberg Initiative, Microsoft Research, Georgetown University’s Center for Security and Emerging Technology, and National Institutes of Health. Hosted on Kaggle, an online community of data scientists and machine learning experts owned by Google, the published dataset contains more than 29,000 research articles (over 13,000 with full text) on SARS-CoV-2 and COVID-19 clinical studies, public health response, population characteristics, and epidemiology. Each publication has been parsed into separate JSON files with its metadata, authors’ information, abstract, and full manuscript.

One core challenge is the utilization of data science and machine learning for better collection, organization, and audition of surging manuscripts. This data exploration aims to unpack the domains and progression of COVID-19 related studies by unpacking currently available research publications. We adopt both quantitative and qualitative methods from computer science and urban planning with the goal of stimulating interdisciplinary discussion and research collaboration and supporting more inclusive approaches to address some immediate and long-term problems. We first summarize a retrospective overview of COVID-19 research challenges and progression in the data science community through this Kaggle challenge. Text analysis and visualization identify several key findings and critical factors that are highly relevant to COVID-19.

Quantifying COVID-19 research thematic structure

This study proceeds in the following steps. First, we explore the entire collection of PDF-parsed datasets to understand the COVID-19 research landscape. To do this, we establish a pipeline to process these publications’ metadata and abstracts from an extensive collection of JSON files (n=47,731) in a Python environment. Each file includes key information such as the title, number of authors, authors’ origin (country), and a full text of the abstract, extracted from its associated research article. We further clean the abstract text data (e.g., removing the stop words, lemmatization, vectorization) to generate a descriptive summary of popular words (single word, bi-gram, and tri-gram), number of authors, and origin (by the first author’s country). Using LDA topic modeling techniques, we train a model with the processed texts to discover underlying topic groups and corresponding keywords.

We expect to identify not only major research interests but also a small subset of publications that may relate to urban science through this exploratory analysis. To find the latter subset, we filter abstract texts based on a list of urban science-related vocabulary, such as “urban planning”, “public health”, “environment”, “social distancing”, “transportation”, “mobility”, “housing”, “community”, and “race”. This process extracts a subset (n=3914) from all manuscripts to further identify publications that may relate to cities and urban science. Analyzing urban-related articles will provide more detailed insights into current research interests and key findings relevant to planning, policy, and operation in cities. Using topic modeling technique, we quantify each article’s thematic composition with four components:

  1. Epidemiological research on infection prevention and control, including the effectiveness of different response strategies and public health measures, such as quarantine, community contact reduction, travel restriction, social distancing at school and workplaces, personal protective equipment (PPE), and public health digital surveillance. Popular terms representing this theme include“public”, “outbreak”, “pandemic”, “social”, “epidemic”, “spread”, “population”, “transmission”, “global”, “distancing”, “response”, etc. We consider this is the most relevant theme for urban science.
  2. Virological research on SARS-CoV-2, including its genetic sequence, origin, evolution, and genomic differences by geography, transmission, incubation, mutation, and stability in various environments. This also includes material studies on viral shedding from humans (stool, urine, blood, nasal discharge), the persistence of virus on different surface material (e,g., copper, stainless steel, plastic), the virus’ susceptibility to cleaning or disinfecting agents, the physical science of the virus spread, and decontamination mechanics as well as virus transmission patterns involving seasonality, environment (e.g., humidity, temperature), community spread, and asymptomatic transmission during incubation. Popular terms representing this theme include“cell”, “protein”, “virus”, “host”, “immune”, “intracellular”, “gene”, “antiviral”, “replication”, etc.
  3. Clinical trials and medical evidence for therapeutic interventions including the efficacy of treatment, or diagnostic findings on infected patients and antibody testing. This also includes patient descriptions, virus incubation period, length of hospital stay, and asymptomatic likelihood. Studies regarding high-risk patient groups with a medical history and pre-existing conditions, such as hypertension, diabetes, heart disease, cardio and cerebrovascular diseases, respiratory diseases are included. Popular terms representing this theme include“sars”, “influenza”, “mers”, “patient”, “acute”, “clinical”, “pathogen”, “syndrome”, etc.
  4. Others represent miscellaneous themes besides the above three.

The interactive dashboard enables users to quickly browse urban-related manuscripts in different countries, sorted by relevance to epidemiology. In the next post, we will discuss two questions related to urban science:

  1. How do epidemiological research findings on virus transmission and community spread indicate the new norm of urban life?
  2. How does social science contribute to COVID-19 research, especially when addressing unexpected conflicts and controversies involving socioeconomic equity, environmental justice, data ethics, and policy fairness?

Since the beginning of the pandemic, we have witnessed debates and mistrust in science amid this pandemic, revealing injustice, biases, and uncertainty in treatment, testing, policy, and public services. In MIT Course 11-6 (Urban Science and Planning with Computer Science), we believe in the importance of addressing broader social, environmental, and political challenges at an urban scale through both technology and planning methods. In Part II, we will further discuss how cities and urban science experts can integrate scientific insights with action and further contribute to collective research, as well as note the impact of potential missing data on under-represented population groups. We hope this data visualization can support researchers who are interested in cities and further connect scientific insights with local community actions.

More articles from MIT Civic Data Design Lab

The Importance of the "Starting Point" in Tracking COVID by Region

When comparing how different regions have been impacted by the coronavirus over time, it is important to define a "starting point": an early…

June 15th, 2020 · 5 min read

Implications of Missing Data: Gaps in COVID-19 Data by Race & Ethnicity

The COVID-19 pandemic has highlighted the importance of collecting and reporting data on health outcomes by race and ethnicity in order to…

June 5th, 2020 · 2 min read
© 2020 MIT Civic Data Design Lab
Link to $https://github.com/civic-data-design-lab