A visual approach¶

by: Ilyas Ustun

Welcome to my analysis of COVID-19 where I use visualization to see the situation and speed of how the disease is spreading in some countries.

The data for this analysis was obtained from https://www.kaggle.com/c/covid19-global-forecasting-week-3/data which is curated by Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE).

# Make plots wider 
options(repr.plot.width=10, repr.plot.height=5, repr.plot.res=250)

## Importing packages

# This R environment comes with all of CRAN and many other helpful packages preinstalled.
# You can see which packages are installed by checking out the kaggle/rstats docker image: 
# https://github.com/kaggle/docker-rstats

library(tidyverse) # metapackage with lots of helpful functions

## Running code

# In a notebook, you can run a single code cell by clicking in the cell and then hitting 
# the blue arrow to the left, or by clicking in the cell and pressing Shift+Enter. In a script, 
# you can run code by highlighting the code you want to run and then clicking the blue arrow
# at the bottom of this window.

## Reading in files

# You can access files from datasets you've added to this kernel in the "../input/" directory.
# You can see the files added to this kernel by running the code below. 

# list.files(path = "../input")

## Saving data

# If you save any files or images, these will be put in the "output" directory. You 
# can see the output directory by committing and running your kernel (using the 
# Commit & Run button) and then checking out the compiled version of your kernel.

# list.files(path = "covid19-global-forecasting-week-3")

df = read_csv("covid19-global-forecasting-week-3/train.csv");
# df_tst = read_csv("../input/covid19-global-forecasting-week-2/test.csv")
# df_sub = read_csv("../input/covid19-global-forecasting-week-2/submission.csv")

Parsed with column specification:
cols(
  Id = col_double(),
  Province_State = col_character(),
  Country_Region = col_character(),
  Date = col_date(format = ""),
  ConfirmedCases = col_double(),
  Fatalities = col_double()
)

# df %>% str()

Sample data:

df %>% head()

df %>% summary()

       Id        Province_State     Country_Region          Date           
 Min.   :    1   Length:22950       Length:22950       Min.   :2020-01-22  
 1st Qu.: 8170   Class :character   Class :character   1st Qu.:2020-02-09  
 Median :16356   Mode  :character   Mode  :character   Median :2020-02-28  
 Mean   :16356                                         Mean   :2020-02-28  
 3rd Qu.:24541                                         3rd Qu.:2020-03-18  
 Max.   :32710                                         Max.   :2020-04-05  
 ConfirmedCases     Fatalities      
 Min.   :     0   Min.   :    0.00  
 1st Qu.:     0   1st Qu.:    0.00  
 Median :     0   Median :    0.00  
 Mean   :   702   Mean   :   31.65  
 3rd Qu.:    59   3rd Qu.:    0.00  
 Max.   :131646   Max.   :15887.00

# df$Country_Region %>% unique()

Let's get the daily total cases and fatalities for each country.

df_daily = df %>%
    group_by(Country_Region, Date) %>%
    summarize(ConfirmedCases = sum(ConfirmedCases) ,Fatalities = sum(Fatalities)) %>% 
    ungroup()

library(scales)
library(RColorBrewer)

df_daily_nz = df_daily %>% 
    filter(ConfirmedCases > 0) %>%
#     mutate(ConfirmedCases = ifelse(ConfirmedCases==0, 1,ConfirmedCases)) %>%
    group_by(Country_Region) %>%
    arrange(Date, `.by_group` = TRUE)

df_daily_nz = df_daily_nz %>%
    group_by(Country_Region) %>%
    mutate(NumberDays = rank(Date)) %>%
    ungroup()

df_daily_nz %>% filter(Country_Region=='US')

Above we can see the data belonging only to the US.The ConfirmedCases and Fatalities columns show the cumulative number of observations.

The case of US is weird as the first data point of observation starts with 892 cases, as if data from previous days was lost. Or, it just so happened that suddenly so many cases were reported in one day - no clue!

countries = c('US', 'Turkey', 'Germany', 'Italy', 'Spain', 'Netherlands', 'Belgium', 'Canada')

df_daily_nz = df_daily_nz %>%
    group_by(NumberDays) %>%
    arrange(desc(ConfirmedCases)) %>%
    mutate(CountryRank = rank(desc(ConfirmedCases)))

# df_daily_nz.head()

library(RColorBrewer)
darkcols <- brewer.pal(n=8, "Dark2")
palette = c('#d6d6c2', darkcols)
# palette

# print(palette)

The countries selected for this analysis are listed below.

countries

# df_daily_nz %>% 
#     mutate(Highlight=if_else(Country_Region %in% countries, Country_Region, "1_other")) %>%
# #     filter(Country_Region %in% countries) %>% 
#     ggplot(aes(x=NumberDays, y=ConfirmedCases, color=Highlight, group=Country_Region)) +
#     geom_line(size=1, alpha=0.8) +
#     scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
#                                         labels = trans_format("log10", math_format(10^.x))) +
#     scale_x_continuous(name='Number of Days', breaks = seq(0, 70, 4)) +
#     scale_color_manual(values = palette) +
#     theme(axis.text.x = element_text(angle=90))

Ranking of countries by cumulative number of cases each day¶

The following plot shows how quickly these countries reached the top 10 in the world in terms of cumulative number of people infected. Netherlands and Turkey seem to reach top 10 quite quickly.

df_daily_nz %>% 
#     mutate(Highlight=if_else(Country_Region %in% countries, Country_Region, "zOther")) %>%
    filter(Country_Region %in% countries) %>% 
    ggplot(aes(x=Date, y=CountryRank, color=Country_Region)) +
    geom_line(size=1.5, alpha=0.8) +
#     scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
#                                         labels = trans_format("log10", math_format(10^.x))) +
    scale_y_reverse() + 
    scale_x_date(name='', date_breaks = '4 days') +
    scale_color_brewer(palette = "Dark2") +
    theme(axis.text.x = element_text(angle=90))

Daily number of cumulative cases confirmed by country¶

The x axis shows the date
The y axis shows the cumulative number of cases in 10-fold increments (log10 formatted)

df_daily_nz %>% 
    filter(Country_Region %in% countries) %>% 
    ggplot(aes(x=Date, y=ConfirmedCases, color=Country_Region)) +
    geom_line(size=1.5, alpha=0.8) +
    scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                                        labels = trans_format("log10", math_format(10^.x))) +
    scale_x_date(name='', date_breaks = '4 days') +
    scale_color_brewer(palette = "Dark2") +
    theme(axis.text.x = element_text(angle=90))

The above plot shows how the cumulative number of cases are increasing by each country. We see that the increase in US is quite worrisome. There is another country where the epidemic is increasing rapidly and that is Turkey.

To understand this better, let's align the origins of each country to the day when the first case observed. The x axis will be the number of days since the first case is observed.

Daily number of cumulative cases confirmed by country¶

Origins aligned to be the day when the first case was observed¶

The x axis shows the number of days since first observation
The y axis shows the cumulative number of cases in 10-fold increments (log10 formatted)

df_daily_nz %>% 
    filter(Country_Region %in% countries) %>% 
    ggplot(aes(x=NumberDays, y=ConfirmedCases, color=Country_Region)) +
    geom_line(size=1.5, alpha=0.8) +
    scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                                        labels = trans_format("log10", math_format(10^.x))) +
    scale_x_continuous(name='Number of Days', breaks = seq(0, 70, 4)) +
    scale_color_brewer(palette = "Dark2") +
    theme(axis.text.x = element_text(angle=0))

The lines that have a sharper increase happening earlier (more to the left) mean that the spread of the virus is happening quicker. US, Turkey and Netherlands show a rapid increase compared to the other countries. Turkey, especially in the first week, shows a much larger slope, meaning even a quicker increase in the cumulative number of people infected. This is very serious. This will lead to hospitals getting overwhelmed quickly, which is the main problem experienced in many countries.

The number of cases in Turkey has risen from 1000 to 10,000 in the last 8 days. That means a 10-fold increase in just a week!

Let's now check the plots for fatalities.

df_daily_nz = df_daily %>% 
    filter(Fatalities > 0) %>%
#     mutate(ConfirmedCases = ifelse(ConfirmedCases==0, 1,ConfirmedCases)) %>%
    group_by(Country_Region) %>%
    arrange(Date, `.by_group` = TRUE)

df_daily_nz = df_daily_nz %>%
    group_by(Country_Region) %>%
    mutate(NumberDays = rank(Date)) %>%
    ungroup()

Daily number of fatalities by country¶

The x axis shows the date
The y axis shows the cumulative number of fatalities in 10-fold increments (log10 formatted)

df_daily_nz %>% 
    filter(Country_Region %in% countries) %>% 
    ggplot(aes(x=Date, y=Fatalities, color=Country_Region)) +
    geom_line(size=1.5, alpha=0.8) +
    scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                                        labels = trans_format("log10", math_format(10^.x))) +
    scale_x_date(name = '', date_breaks = '4 days') +
    scale_color_brewer(palette = "Dark2") +
    theme(axis.text.x = element_text(angle=90))

Unfortunately the dire situation in Italy and Spain is obviuos from the graph. Another country that is alarming is Turkey. Although Turkey does not have as many fatalities, the sharp increase in the first 2 weeks calls for attention.

Daily number of fatalities by country¶

Origins aligned to be the day when the first fatality was observed¶

The x axis shows the number of days since first observation
The y axis shows the cumulative number of fatalities in 10-fold increments (log10 formatted)

df_daily_nz %>% 
    filter(Country_Region %in% countries) %>% 
    ggplot(aes(x=NumberDays, y=Fatalities, color=Country_Region)) +
    geom_line(size=1.5, alpha=0.8) +
    scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x), 
                                        labels = trans_format("log10", math_format(10^.x))) +
    scale_x_continuous(name='Number of Days', breaks = seq(0, 70, 4)) +
    scale_color_brewer(palette = "Dark2") +
    theme(axis.text.x = element_text(angle=0))

So, the last plot again confirms the severe situation in these countries. US, Spain and Turkey are the left-most countries, which means a more rapid spread. In this plot the sharper the increase happens earlier (to the left) the worse the situation is. Turkey has especially a large slope at the begin, which seems to be more aligned with other countries after the first week.

Unfortunately Italy and Spain already passed 10,000 fatalities, followed closely by the US. Belgium, Netherlands, and Germany have more than 1000 fatalities, closely followed by Turkey. These numbers are worrisome. Canada seems to be doing fairly better compared to these countries.

These figures show how rapidly COVID-19 is spreading. This is especially dangerous for people who live or work together in large numbers within confined spaces. Places like prisons, care houses, hospitals, universities, schools, factories, malls, or any other closed quarters are a few to name. In environments similar to these it takes only one person to be infected to spread the disease to a large number of people. That's why many states have shelter-at-home orders in place. This has led to the closure of many facilities. The economies are taking a big hit because of this, but it is a necessary evil to combat this unseen enemy. Social distancing is one of the best weapons we have for now, until a better solution is found.

This brings me to another point, which is places where people are forced to live together and can't leave on their own will, namely prisons and jails. I hope that the governments worldwide are taking the facts of this pandemic into consideration and are acting accordingly with putting the human-life first. The overcrowded conditions of prisons in some countries put many human lives at risk. The journalists, professors, researchers, teachers, students, mothers and their babies, and many thousands of innocent people who are imprisoned should be released, effective immediately. Tomorrow might be too late. This is not a time of vengeance, it is time of mercy and compassion.

Take care all! Hope we will win this virus together.

Ilyas Ustun
April 6, 2020
Chicago, IL

Id	Province_State	Country_Region	Date	ConfirmedCases	Fatalities
<dbl>	<chr>	<chr>	<date>	<dbl>	<dbl>
1	NA	Afghanistan	2020-01-22	0	0
2	NA	Afghanistan	2020-01-23	0	0
3	NA	Afghanistan	2020-01-24	0	0
4	NA	Afghanistan	2020-01-25	0	0
5	NA	Afghanistan	2020-01-26	0	0
6	NA	Afghanistan	2020-01-27	0	0

Country_Region	Date	ConfirmedCases	Fatalities	NumberDays
<chr>	<date>	<dbl>	<dbl>	<dbl>
US	2020-03-10	892	28	1
US	2020-03-11	1214	36	2
US	2020-03-12	1596	40	3
US	2020-03-13	2112	47	4
US	2020-03-14	2658	54	5
US	2020-03-15	3431	63	6
US	2020-03-16	4565	85	7
US	2020-03-17	6353	108	8
US	2020-03-18	7715	118	9
US	2020-03-19	13608	200	10
US	2020-03-20	19025	244	11
US	2020-03-21	25414	307	12
US	2020-03-22	33663	426	13
US	2020-03-23	43586	551	14
US	2020-03-24	53659	705	15
US	2020-03-25	65701	941	16
US	2020-03-26	83759	1208	17
US	2020-03-27	101580	1578	18
US	2020-03-28	121326	2023	19
US	2020-03-29	140734	2464	20
US	2020-03-30	161655	2975	21
US	2020-03-31	188018	3870	22
US	2020-04-01	213214	4753	23
US	2020-04-02	243295	5922	24
US	2020-04-03	275426	7083	25
US	2020-04-04	308690	8403	26
US	2020-04-05	336912	9615	27

Analyzing COVID-19 by Numbers in Countries

A visual approach¶

Ranking of countries by cumulative number of cases each day¶

Daily number of cumulative cases confirmed by country¶

Daily number of cumulative cases confirmed by country¶

Origins aligned to be the day when the first case was observed¶

Daily number of fatalities by country¶

Daily number of fatalities by country¶

Origins aligned to be the day when the first fatality was observed¶