Positividade* and data against COVID-19: Description of a Python-based model prototype initiated during the #WirVsVirus Hackathon
This piece is a preliminary technical report of an on-going pilot on applying Data Powered Positive Deviance in response to the Covid-19 crisis in Germany. We would be happy to receive your feedback and further ideas on our analysis.
Author: Iskriyana Vasileva, Business Intelligence Analyst and aspiring Data Scientist, with valuable contributions by Joshua Driesen & Niklas Wulkow.
This blog post covers the prototype initiated by our team in collaboration with GIZ Data Lab during the German #WirVsVirus hackathon, which took place between the 20th and 22nd of March 2020.
Please note that we are discussing a prototype developed in the context of the Data Powered Positive Deviance Initiative and a COVID-19-response Hackathon. We acknowledge the need for additional refinements and look forward to new ideas to enrich what our team has created.
What is Positive Deviance?
The basic idea is to identify positive outliers that, despite equal conditions, perform better than their peers. The conventional PD approach is, however, limited by scarce and costly primary data collection, resulting in small data samples, making it statistically difficult to identify positive deviants. To address this issue, the Data Powered Positive Deviance (DPPD) approach was started. Its goal is to utilize big data and recent analytic techniques to enrich and accelerate the original PD approach.
Read here for more information.
Our Approach to the COVID-19 challenge
In the context of the COVID-19 pandemic in Germany, our team decided to find districts (Landkreise / Kreisfreie Städte) that managed to flatten their infection curve unexpectedly well compared to structurally similar peers (more about it here). These would be the COVID-19-positive deviants. After identifying them and their positive drivers, their actions could potentially be replicated in other districts to “spread” virus-dampening behaviour.
You will be able to find our code here.
The data we used covered all German districts and was extracted from publicly available sources. The data included structural data and data on the infection spread of COVID-19. Structural data included factors such as population density, hospital capacities, nursing homes, etc. per district and was primarily extracted from Landatlas. We also added data on average age and average household size (source: official 2011 census site). The data on the infection spread encompassed the number of COVID-19 infections per district provided by the Robert Koch Institute (data & data API, both in German).
Development of the Performance Measure
Defining a performance measure was key to the project’s success. It needed to reflect a desirable deviant outcome, cover all units of analysis, and allow for precise identification. For our case, we developed an inhibition coefficient to reflect the extent of successful flattening of the COVID-19 infection curve — the lower the coefficient, the better the performance of the district in containing the infection.
Below, we applied a series of steps to calculate the coefficient.
The key component for understanding the coefficient mechanism is its “Inception”-like element, which is basically a rate of change of a rate of change. Let’s break it down and show the relevant code snippets:
New infections (new_inf): We used the absolute number of infections from the RKI and merged sub-districts into districts where necessary (e.g. Berlin) to allow combining this data with the structural data.
smoothed_new: We used a seven-day rolling average of new infections to cancel out day-of-the-week effects in the data, which resulted from inconsistent reporting of COVID-19 test outcomes during the weekend by some laboratories.
alpha_t: First rate of change — we estimated the infection spread rate (spread rate or alpha coefficient) by dividing pairs of successive values of smoothed_new, i.e. smoothed_new on dayt / smoothed_new on dayt+1. In effect, this constitutes a smoothed growth factor.
This measure constitutes a multiplicative rate of change, which reflects the exponential spread of a disease when it is unhindered. With exponential growth, the multiplicative changes are constant, not the additive/linear.
We used a threshold for the number of newly infected people on one single day for our algorithm to start including it in the data. For example, if a district has three new cases on day 1, four on day 2 and five on day 3, and the threshold is set to 5, then day 3 will be the first day for which alpha_t is calculated.
This starting threshold makes sure that only districts with a substantial COVID-19 occurrence are considered because otherwise there would not be a curve to flatten. We chose 5 because after several trials, this number provided the optimal trade-off between estimation stability and data loss.
Second rate of change — the inhibition coefficient: the linear rate of change of the alpha coefficients (spread rates) over the period between the 1st and 14th of April 2020. This period was the same for all districts. Because an unhindered infection’s curve shape corresponds to exactly one alpha value, reducing a district’s alpha is synonymous with flattening its curve. Thus, a highly negative inhibition coefficient indicates a well-performing district.
You can find more information about the basics of epidemics modelling in our background paper provided by Niklas Wulkow.
Next, we focused on the structural data. Two indices, ruralness and socioeconomic status, were created by applying principal component analysis (PCA) on a standardized set of variables as conceptualized by Thünen and applied by the German Federal Ministry of Food and Agriculture (BMEL).
Ruralness contained settlement density, area percentage of farming and forest areas, residence percentage of 1- and 2-family houses, surrounding population densities weighted by distance, and distance to the next five urban centers.
Socioeconomic status encompassed unemployment rate, mean salaries, mean income, communal tax revenue, net population migration, residence vacancies, life expectancy for men and women, and percentage of high-school dropouts.
The indices per district were correspondingly calculated by multiplying the district values of the loading variables with the loadings identified for the first principal component, as the first component is always the one with the highest variance explanation (and in our case the only one, as we are reducing the dimensionality to 1 for each index). Thanks to scikit-learn, this logic is contained in the pca_.transform method.
To account for a potential mismatch of the indices’ direction due to its dependency on the random state the machine is in when performing the analysis, the code contains some checkpoints that ensure conformity between the loading directions expected and those found by the algorithm. Thus, we ensured the creation of a ruralness index and not an urbaneness index.
The PCA-created variables — ruralness and socioeconomic status — were used to cluster the districts and identify homologues. The algorithm of choice was k-means, and the number of clusters was identified via the Elbow Method. The latter looks at the sum of squares of distances of every data point from its corresponding cluster centroid (the within-cluster sum of squares (WCSS)) and its change with a rising number of clusters. The point after which there is no sudden change to be seen corresponds to the optimal number of clusters. In our case, 5 clusters seemed to be optimal.
We were interested in the effect of skipping the clustering and instead using ruralness and socio-economic status as additional predictors.
Of the ten districts that we originally identified as PDs, 6 had also been PDs when we used clustering. However, 2 of the 5 clusters were entirely absent from this new set of PDs. In practice, this might lead to identifying PD explanators that do not translate to all other units. For example, finding early subway closures to be an effective strategy would not yield actionable insights for rural villages.
Additionally, the linear regression model calculated across all districts only accounted for 2% of the performance measure’s variance. Thus, the “traditional” two-step procedure of clustering and within-cluster prediction proved advantageous for our use-case as well.
Positive Deviants Identification
The identification of Positive Deviants is at the heart of this analysis. We searched for Positive Deviance within the clusters we had previously defined.
A few things to consider:
The predictor variables were chosen based on factors frequently discussed in the media and learned through a consultation with an epidemiologist.
To account for the fact that performing worse at time t means more room for improvement at time t+1, we also included the average growth rate between March 25th and 31st, defined as the geometric mean of the alpha_t values of those days.
The number of positive deviants per cluster (2) was chosen arbitrarily for this prototype.
A district was classified as a positive deviant when the actual inhibition coefficient was lower than the predicted one, and this difference was among the two highest in the cluster.
We started with a linear regression model for its simplicity and the ease with which we could derive information about potentially relevant factors from the regression coefficients.
When interpreting the regression weights, keep in mind that lower values of the performance measure indicate a better performance (because of a stronger curve flattening). The higher a predictor with a negative regression weight, the better the performance, and vice versa.
The preliminary goal of this prototype was the identification of positive deviants and not the out-of-sample prediction of the inhibition coefficient, which is why we did not apply a train-test split. As a result, the focus was placed on developing a model that retrospectively explains a sufficient part of the performance measure variance. Our goal was to control for those parts of the performance measure’s variance attributable to the covariates, not to accurately predict unseen values. Thus, we used linear regression for splitting the inhibition coefficient’s variance into variance explained and not explained by the predictors. Splitting the data set into training and test sets would have been counterproductive to that end.
The linear regression was able to explain between 2% and 48% of the variance per cluster. Due to this wide interval, we applied a second model: The Random Forest Regressor.
Random Forest Regressor
It is an ensemble method that uses bagging as the ensemble method and decision tree as the individual model. “Forest” because it combines multiple decision trees and “random” because it performs random sampling on both the observations and a subset of features for splitting nodes. In effect, random forest produces many individually overfitted decision trees that, due to their different samples, overfit in different ways, leading them to exhibit random errors. As a result, when brought together, their errors average to 0 and often produce a well-performing model.
For the same reasons we outlined for the linear regression model, we did not use a train-test split and applied the model to the whole data set. The variance explained by the model was between 80% and 86% (R2 score) within the clusters.
In the next step, feature importance was determined (read this article by Eryk Lewinson for background information). While the article outlines several methods, methods we adapted in our code, we’ll concentrate on the most accurate one — drop column. The method compares the performance of the model (measured by R2) with the feature included in the model to the performance when the feature is excluded. Negative importance, in this case, means that dropping a feature from the model would improve its performance. In contrast to the linear regression weights, this method illustrates only the importance of the measure and not the direction of its impact on the inhibition coefficient.
Nursing homes and/or population density were among the top factors in all clusters, perhaps because both factors enable social distancing. If your neighbor lives in a different 1-family home than you, you can better avoid infection than if you lived in the same apartment. Following the same logic, it is easier to isolate a stationary care facility than it is to ensure infection-free ambulatory care work. Keep in mind, however, that these are just speculations at this point.
Comparison between the two models & current results
Eight of the identified positive deviants in the clusters overlapped between the 2 models, sometimes with changed positions. We took this to be an indicator that their positive performance is not due to artifacts of the prediction method.
Random forest, with its high goodness of fit (retrospectively), offers suggestions for structural factors that may be of importance to a district’s general ability to flatten the curve, with PDs flattening their respective curve even above and beyond that.
Both models in their current states cannot be used to predict an unseen inhibition coefficient, as explained in the linear regression section. This is a point for future consideration.
Below is a summary of our results per cluster as of the 15th June 2020. Note that the PD graphs depict the performance measure calculated for the blue interval, controlled for mean growth during the red interval.
Suggestions for moving forward
- Refining the predictors: the structural factors can be expanded, e.g. by adding a more granular measure for age. Furthermore, their selection can be optimized by applying methods such as Chi-squared test, LASSO, Ridge regression, etc.
- Adding predictors: adding new, non-structural predictors, such as connectedness of districts with early COVID-19 hotspots measured by the frequency of travelling between the districts (input Basma Albanna). We could also add behavioral data from social media or mobility data. By using those data sources, however, data privacy and protection should become an even higher priority.
- Prototype replication with structural data from other countries.
- Application of more robust outlier-detection methods.
- A closer investigation of the factors identified by the feature importance to understand what contributed to a positive-deviant formulation and how.
I’m very glad that I participated in the hackathon and worked on this project. I have learned so much and I was lucky to find such a great team. Thank you, Joshua, Niklas & the GIZ Data Lab!
Eryk L. (2019). Explaining Feature Importance by example of a Random Forest.
Georgios D. (2019a). Random Forest Regressor explained in depth.
Georgios, D.(2019b). Random Forest regression model Advanced Topics (+ Python code snippet using Sklearn).
Pascale, R., Sternin, J. and Sternin, M. (2010b). The Power of Positive Deviance: How Unlikely Innovators Solve the World’s Toughest Problems, Informatics in Primary Care, 12, pp. 139–145.
Will K. (2018). An Implementation and Explanation of the Random Forest in Python.
Yufeng (2020). Present The Feature Importance of a Random Forest Classifier.