Before I get to my story, I would like to say that you can download the notebooks I created and ran in Azure Databricks here: https://github.com/thesqlpro/blog/tree/master/notebooks
The source of my data was: https://covidtracking.com/data
I chose not source my data directly from Maryland’s State Government site because the format was not easy to use. The official Maryland Government provided data basically has each day as a column and had the rows as Zip Codes — not as easy as the data provided from the site above. So there may be few discrepancies between the data on a day to day basis, but the totals are identical. You can read about their methodologies of retrieving data from various official State Government websites and the quality of each.
This post is in no way intended to attack anyone, be part of a political movement, promote any agendas political/financial/social, or support any causes out there except one: highlight how data and statistics can be used to tell stories. As data professionals, we need to give importance to the quality of data and the quality of data reporting. Basically this is a lesson in data visualization and telling stories with data.
This is the story that the Maryland Government has produced.
It looks positive, things are changing!!! But hold on a moment.
Not trying to be the person that spoils everyone’s fun, but there is another story that I don’t think is being told here. I’m not a medical professional nor am I a healthcare statistician, but I know basic math and some fancy computer programming languages. From that knowledge, I have decided to show a different picture for COVID-19 — specifically in the State of Maryland.
The claim was a 50% drop in positive cases of COVID-19 in the above-mentioned state.
But let us take a step back and look at the raw data. Yes, the percentage of positive cases (35.78%) was higher on April 13th (I’m using the date with the highest percent from the dataset) but if you compare it to May 28th (10.04%) you’ll notice that the actual number of cases has almost doubled. Statistics can tell different stories based on how you convey them and how you wish to tell your story. There were more tests done on May 28th in comparison to April 13th.
Can we say that there are more or less cases? It’s hard to tell because the population used is not the same in both cases.
If we look at this snip-it (from the Databricks notebook on my GitHub repo that I provided), we can see (the dark blue) positive number of cases is going up with the total number of test cases.
This snip-it above shows the percentage of Positive vs Negative cases accumulated over time. You’ll see, for the most part April and May are around 17-20% positive. This doesn’t take account any data on cured cases, or readmission cases, or anything like that. Healthcare is a complicated data set and this was one of the reasons I decided to write this blog post.
I will leave this as a preview for the story that I have told in the notebook. Please feel free to download it and try it for yourself.