On March 11, 2020, the World Health Organization (WHO) declared COVID-19 a global pandemic. COVID-19 is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It has now spread over 185 countries or territories with more than 300,000 reported cases, claiming the lives of more than 13,000. Amid the escalating fear over the spread of the disease, it is nonetheless often reported in the mass media that vulnerability to COVID-19 is highly age-specific, with older adults the most vulnerable to its worst effects. Indeed, according to a recent study conducted by the China Centers for Disease Control and Prevention (China CDC), the case fatality rate (CFR; proportion of deaths among the confirmed cases) for patients 70 years or older can be as high as 8%-14.8%, while that for the non-elderly (<50 years old) remains steadily below 0.4%.
To confirm the validity of such reports and findings, we aim to formally assess the age difference in the case fatality rate by analyzing publicly available COVID-19 epidemiological datasets using survival analysis methodology.
##To that end, we use the online data pulled from the Korea Centers for Disease Control & Prevention (Korea CDC) and prepared by the DS4C (Data Science for COVID-19) Project. The particular dataset of interest is PatientInfo.csv
, which contains subject-level data for over 2000 confirmed COVID-19 cases in South Korea. Key variables include:
1 | patient_id: the ID of the patient |
To perform the data cleaning, I use different methods for different patients. My main ideas are as follows.
Missing State
For those with a missing state variable, I will just drop them. That was because from my perspective, state is an important variable in survival analysis. Without state, we do not know whether an observation is deceased or not. To avoid any mistake caused by incorrect state imputation, I choose to drop those with a missing state variable.
Isolated Patients
For the isolated patients, I will use the date on which the data set is last updated (2020-03-21) as the censoring date.
Released Patients
For the released patients, I will use the release date as censoring date. If the released date is not provided but the confirmed date is provided, I will use the "confirmed date + overall average isolation time duration (which is about 14 days)" as the censoring date.
Deceased Patients
For the deceased patients, the deceased date is just the death date. For those with a missing deceased date, if the confirmed date is provided, I will use "confirmed date + overall average time duration between confirmed and death" as the death date.
Missing Confirmed Date
For those with a missing confirmed date, if the censoring/death date is available from above methods and the original data set, I will perform a weighted sampling to pick a date between the first confirmed date among all the patients and this patient's censoring/death date as his/her confirmed date. The weight is generated from the counts of every confirmed dates: The more patients were confirmed during a specific date, the higher weight this date will receive.
Missing Age
For those with a missing age, if the birth year is available, I will calculate his/her age from his/her birth year. If a patient has both missing age and missing birth year, I will drop this observation because of the importance of age in this analysis. (For calculation of crude CFRs, maybe we can also include those with missing age as separate categories in order to gain a fuller picture. )
Missing Gender
For those with a missing gender, I will randomly assign a gender to them: the probabilities of being assigned to be a male and being assigned to be a female are the same. (Also. you can include those with missing gender as separate categories. )
Outcome of interest
Time from being confirmed and censoring/death is the outcome of interest. For those with a confirmed date after his/her deceased date, I will drop them because they were confirmed after their death.
##Divide the study sample into 5 age groups: <= 40s, 50s, 60s, 70s and >= 80s. Figure 1 shows the age-specific CFRs calculated from both this cleaned data and the study by the China CDC. From Figure 1, we can see that for younger patients (especially those who <=50s), CFRs from this data set and the study by China CDC are very close to each other. However, for older patients (those who>= 60s), CFRs from China CDC are higher than those from this data set.
KM Curves & Log-rank Test
Figure 2 shows the gender-specific and age-specific Kaplan—Meier curves for the case survival probabilities.
The gender-stratified log-rank test results can be seen as follows:
1 | Call: |
The p-value is about 3e-16
, suggesting that, controlling for gender status, age has a highly significant effect on the survival rate. From the results above in this section, we may come to a conclusion that COVID-19 is more dangerous for older patients than younger patients, and females are more likely to survival than males to some extent.
Cox proportional hazards model
I also fitted a Cox proportional hazards model with age groups (with >=80s as reference group) and gender as covariates. The summary table of this model is as follows:
1 | Call: |
From Figure 3, We can see that the points are mostly clustered around the identity line, indicating that overall the model fits the data reasonably well.
I also conducted the chi-square tests on proportionality. The test results are as follows:
1 | rho chisq p |
The p-value of global test on proportionality was 0.674, greater than 0.05, which means that the global proportionality test was non-significant.
Wald Test
Finally, I conducted a Wald test on the effect of age (chi-square with 4 degrees of freedom). The resulting p-value was about 2.5e-7
, which indicates that age does have significant effect on the survival rate of COVID-19.
1 | ### Wald test with df of 4 |
To draw a conclusion, above analysis indicates that patients suffering from COVID-19 with different age and gender have different survival rate. In general, younger patients are more likely to survive compared with older patients. Besides, female patients, rather than male patients suffering from COVID-19, are more likely to survive.
##BMI/STAT 741, UW-Madison