Analyzing COVID-19 by Numbers in Countries

covid-19_countries

A visual approach

by: Ilyas Ustun

Welcome to my analysis of COVID-19 where I use visualization to see the situation and speed of how the disease is spreading in some countries.

The data for this analysis was obtained from https://www.kaggle.com/c/covid19-global-forecasting-week-3/data which is curated by Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE).

In [1]:
# Make plots wider 
options(repr.plot.width=10, repr.plot.height=5, repr.plot.res=250)
In [28]:
## Importing packages

# This R environment comes with all of CRAN and many other helpful packages preinstalled.
# You can see which packages are installed by checking out the kaggle/rstats docker image: 
# https://github.com/kaggle/docker-rstats

library(tidyverse) # metapackage with lots of helpful functions

## Running code

# In a notebook, you can run a single code cell by clicking in the cell and then hitting 
# the blue arrow to the left, or by clicking in the cell and pressing Shift+Enter. In a script, 
# you can run code by highlighting the code you want to run and then clicking the blue arrow
# at the bottom of this window.

## Reading in files

# You can access files from datasets you've added to this kernel in the "../input/" directory.
# You can see the files added to this kernel by running the code below. 

# list.files(path = "../input")

## Saving data

# If you save any files or images, these will be put in the "output" directory. You 
# can see the output directory by committing and running your kernel (using the 
# Commit & Run button) and then checking out the compiled version of your kernel.
In [3]:
# list.files(path = "covid19-global-forecasting-week-3")
In [4]:
df = read_csv("covid19-global-forecasting-week-3/train.csv");
# df_tst = read_csv("../input/covid19-global-forecasting-week-2/test.csv")
# df_sub = read_csv("../input/covid19-global-forecasting-week-2/submission.csv")
Parsed with column specification:
cols(
  Id = col_double(),
  Province_State = col_character(),
  Country_Region = col_character(),
  Date = col_date(format = ""),
  ConfirmedCases = col_double(),
  Fatalities = col_double()
)

In [5]:
# df %>% str()

Sample data:

In [6]:
df %>% head()
A tibble: 6 × 6
IdProvince_StateCountry_RegionDateConfirmedCasesFatalities
<dbl><chr><chr><date><dbl><dbl>
1NAAfghanistan2020-01-2200
2NAAfghanistan2020-01-2300
3NAAfghanistan2020-01-2400
4NAAfghanistan2020-01-2500
5NAAfghanistan2020-01-2600
6NAAfghanistan2020-01-2700
In [7]:
df %>% summary()
       Id        Province_State     Country_Region          Date           
 Min.   :    1   Length:22950       Length:22950       Min.   :2020-01-22  
 1st Qu.: 8170   Class :character   Class :character   1st Qu.:2020-02-09  
 Median :16356   Mode  :character   Mode  :character   Median :2020-02-28  
 Mean   :16356                                         Mean   :2020-02-28  
 3rd Qu.:24541                                         3rd Qu.:2020-03-18  
 Max.   :32710                                         Max.   :2020-04-05  
 ConfirmedCases     Fatalities      
 Min.   :     0   Min.   :    0.00  
 1st Qu.:     0   1st Qu.:    0.00  
 Median :     0   Median :    0.00  
 Mean   :   702   Mean   :   31.65  
 3rd Qu.:    59   3rd Qu.:    0.00  
 Max.   :131646   Max.   :15887.00  
In [8]:
# df$Country_Region %>% unique()

Let's get the daily total cases and fatalities for each country.

In [9]:
df_daily = df %>%
    group_by(Country_Region, Date) %>%
    summarize(ConfirmedCases = sum(ConfirmedCases) ,Fatalities = sum(Fatalities)) %>% 
    ungroup()
In [29]:
library(scales)
library(RColorBrewer)
In [11]:
df_daily_nz = df_daily %>% 
    filter(ConfirmedCases > 0) %>%
#     mutate(ConfirmedCases = ifelse(ConfirmedCases==0, 1,ConfirmedCases)) %>%
    group_by(Country_Region) %>%
    arrange(Date, `.by_group` = TRUE)
In [12]:
df_daily_nz = df_daily_nz %>%
    group_by(Country_Region) %>%
    mutate(NumberDays = rank(Date)) %>%
    ungroup()
In [13]:
df_daily_nz %>% filter(Country_Region=='US')
A tibble: 27 × 5
Country_RegionDateConfirmedCasesFatalitiesNumberDays
<chr><date><dbl><dbl><dbl>
US2020-03-10 892 28 1
US2020-03-11 1214 36 2
US2020-03-12 1596 40 3
US2020-03-13 2112 47 4
US2020-03-14 2658 54 5
US2020-03-15 3431 63 6
US2020-03-16 4565 85 7
US2020-03-17 6353 108 8
US2020-03-18 7715 118 9
US2020-03-19 13608 20010
US2020-03-20 19025 24411
US2020-03-21 25414 30712
US2020-03-22 33663 42613
US2020-03-23 43586 55114
US2020-03-24 53659 70515
US2020-03-25 65701 94116
US2020-03-26 83759120817
US2020-03-27101580157818
US2020-03-28121326202319
US2020-03-29140734246420
US2020-03-30161655297521
US2020-03-31188018387022
US2020-04-01213214475323
US2020-04-02243295592224
US2020-04-03275426708325
US2020-04-04308690840326
US2020-04-05336912961527

Above we can see the data belonging only to the US.The ConfirmedCases and Fatalities columns show the cumulative number of observations.

The case of US is weird as the first data point of observation starts with 892 cases, as if data from previous days was lost. Or, it just so happened that suddenly so many cases were reported in one day - no clue!

In [14]:
countries = c('US', 'Turkey', 'Germany', 'Italy', 'Spain', 'Netherlands', 'Belgium', 'Canada')
In [15]:
df_daily_nz = df_daily_nz %>%
    group_by(NumberDays) %>%
    arrange(desc(ConfirmedCases)) %>%
    mutate(CountryRank = rank(desc(ConfirmedCases)))
In [16]:
# df_daily_nz.head()
In [17]:
library(RColorBrewer)
darkcols <- brewer.pal(n=8, "Dark2")
palette = c('#d6d6c2', darkcols)
# palette
In [18]:
# print(palette)

The countries selected for this analysis are listed below.

In [19]:
countries
  1. 'US'
  2. 'Turkey'
  3. 'Germany'
  4. 'Italy'
  5. 'Spain'
  6. 'Netherlands'
  7. 'Belgium'
  8. 'Canada'
In [20]:
# df_daily_nz %>% 
#     mutate(Highlight=if_else(Country_Region %in% countries, Country_Region, "1_other")) %>%
# #     filter(Country_Region %in% countries) %>% 
#     ggplot(aes(x=NumberDays, y=ConfirmedCases, color=Highlight, group=Country_Region)) +
#     geom_line(size=1, alpha=0.8) +
#     scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
#                                         labels = trans_format("log10", math_format(10^.x))) +
#     scale_x_continuous(name='Number of Days', breaks = seq(0, 70, 4)) +
#     scale_color_manual(values = palette) +
#     theme(axis.text.x = element_text(angle=90))

Ranking of countries by cumulative number of cases each day

The following plot shows how quickly these countries reached the top 10 in the world in terms of cumulative number of people infected. Netherlands and Turkey seem to reach top 10 quite quickly.

In [21]:
df_daily_nz %>% 
#     mutate(Highlight=if_else(Country_Region %in% countries, Country_Region, "zOther")) %>%
    filter(Country_Region %in% countries) %>% 
    ggplot(aes(x=Date, y=CountryRank, color=Country_Region)) +
    geom_line(size=1.5, alpha=0.8) +
#     scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
#                                         labels = trans_format("log10", math_format(10^.x))) +
    scale_y_reverse() + 
    scale_x_date(name='', date_breaks = '4 days') +
    scale_color_brewer(palette = "Dark2") +
    theme(axis.text.x = element_text(angle=90))

Daily number of cumulative cases confirmed by country

The x axis shows the date
The y axis shows the cumulative number of cases in 10-fold increments (log10 formatted)

In [22]:
df_daily_nz %>% 
    filter(Country_Region %in% countries) %>% 
    ggplot(aes(x=Date, y=ConfirmedCases, color=Country_Region)) +
    geom_line(size=1.5, alpha=0.8) +
    scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                                        labels = trans_format("log10", math_format(10^.x))) +
    scale_x_date(name='', date_breaks = '4 days') +
    scale_color_brewer(palette = "Dark2") +
    theme(axis.text.x = element_text(angle=90))

The above plot shows how the cumulative number of cases are increasing by each country. We see that the increase in US is quite worrisome. There is another country where the epidemic is increasing rapidly and that is Turkey.

To understand this better, let's align the origins of each country to the day when the first case observed. The x axis will be the number of days since the first case is observed.

Daily number of cumulative cases confirmed by country

Origins aligned to be the day when the first case was observed

The x axis shows the number of days since first observation
The y axis shows the cumulative number of cases in 10-fold increments (log10 formatted)

In [23]:
df_daily_nz %>% 
    filter(Country_Region %in% countries) %>% 
    ggplot(aes(x=NumberDays, y=ConfirmedCases, color=Country_Region)) +
    geom_line(size=1.5, alpha=0.8) +
    scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                                        labels = trans_format("log10", math_format(10^.x))) +
    scale_x_continuous(name='Number of Days', breaks = seq(0, 70, 4)) +
    scale_color_brewer(palette = "Dark2") +
    theme(axis.text.x = element_text(angle=0))

The lines that have a sharper increase happening earlier (more to the left) mean that the spread of the virus is happening quicker. US, Turkey and Netherlands show a rapid increase compared to the other countries. Turkey, especially in the first week, shows a much larger slope, meaning even a quicker increase in the cumulative number of people infected. This is very serious. This will lead to hospitals getting overwhelmed quickly, which is the main problem experienced in many countries.

The number of cases in Turkey has risen from 1000 to 10,000 in the last 8 days. That means a 10-fold increase in just a week!

Let's now check the plots for fatalities.

In [24]:
df_daily_nz = df_daily %>% 
    filter(Fatalities > 0) %>%
#     mutate(ConfirmedCases = ifelse(ConfirmedCases==0, 1,ConfirmedCases)) %>%
    group_by(Country_Region) %>%
    arrange(Date, `.by_group` = TRUE)
In [25]:
df_daily_nz = df_daily_nz %>%
    group_by(Country_Region) %>%
    mutate(NumberDays = rank(Date)) %>%
    ungroup()

Daily number of fatalities by country

The x axis shows the date
The y axis shows the cumulative number of fatalities in 10-fold increments (log10 formatted)

In [26]:
df_daily_nz %>% 
    filter(Country_Region %in% countries) %>% 
    ggplot(aes(x=Date, y=Fatalities, color=Country_Region)) +
    geom_line(size=1.5, alpha=0.8) +
    scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                                        labels = trans_format("log10", math_format(10^.x))) +
    scale_x_date(name = '', date_breaks = '4 days') +
    scale_color_brewer(palette = "Dark2") +
    theme(axis.text.x = element_text(angle=90))

Unfortunately the dire situation in Italy and Spain is obviuos from the graph. Another country that is alarming is Turkey. Although Turkey does not have as many fatalities, the sharp increase in the first 2 weeks calls for attention.

Daily number of fatalities by country

Origins aligned to be the day when the first fatality was observed

The x axis shows the number of days since first observation
The y axis shows the cumulative number of fatalities in 10-fold increments (log10 formatted)

In [27]:
df_daily_nz %>% 
    filter(Country_Region %in% countries) %>% 
    ggplot(aes(x=NumberDays, y=Fatalities, color=Country_Region)) +
    geom_line(size=1.5, alpha=0.8) +
    scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                                        labels = trans_format("log10", math_format(10^.x))) +
    scale_x_continuous(name='Number of Days', breaks = seq(0, 70, 4)) +
    scale_color_brewer(palette = "Dark2") +
    theme(axis.text.x = element_text(angle=0))

So, the last plot again confirms the severe situation in these countries. US, Spain and Turkey are the left-most countries, which means a more rapid spread. In this plot the sharper the increase happens earlier (to the left) the worse the situation is. Turkey has especially a large slope at the begin, which seems to be more aligned with other countries after the first week.

Unfortunately Italy and Spain already passed 10,000 fatalities, followed closely by the US. Belgium, Netherlands, and Germany have more than 1000 fatalities, closely followed by Turkey. These numbers are worrisome. Canada seems to be doing fairly better compared to these countries.

These figures show how rapidly COVID-19 is spreading. This is especially dangerous for people who live or work together in large numbers within confined spaces. Places like prisons, care houses, hospitals, universities, schools, factories, malls, or any other closed quarters are a few to name. In environments similar to these it takes only one person to be infected to spread the disease to a large number of people. That's why many states have shelter-at-home orders in place. This has led to the closure of many facilities. The economies are taking a big hit because of this, but it is a necessary evil to combat this unseen enemy. Social distancing is one of the best weapons we have for now, until a better solution is found.

This brings me to another point, which is places where people are forced to live together and can't leave on their own will, namely prisons and jails. I hope that the governments worldwide are taking the facts of this pandemic into consideration and are acting accordingly with putting the human-life first. The overcrowded conditions of prisons in some countries put many human lives at risk. The journalists, professors, researchers, teachers, students, mothers and their babies, and many thousands of innocent people who are imprisoned should be released, effective immediately. Tomorrow might be too late. This is not a time of vengeance, it is time of mercy and compassion.

Take care all! Hope we will win this virus together.

Ilyas Ustun
April 6, 2020
Chicago, IL

In [ ]:

comments powered by Disqus