Malaria in Sri Lanka

Some 30 years of monthly malaria cases data reported by Ministry of Health districts are used to describe malaria in time and space in Sri Lanka. As explained previously, the dataset, even if long, is challenging to use due to numerous and large gaps. Because of the changes in districts boundaries in 1990, we propose a merged version of the original 222 districts into 115 districts covering the whole period of data availability. We then apply the EM algorithm previously used and commented and eventually perform a cluster analysis to describe the climatology and annual variability of malaria cases in Sri Lanka.

Merging districts
We previously applied the EM gap-filling algorithm to the 222 districts of the dataset, being aware that long continuous gaps prior to 1990 and after 1990 would not be treated very well by the method. It appeared clear that the gaps prior to 1990 are actually an artifact due to a redesign of Ministry of Health districts from 120 to 222. Therefore, based on litterature, human knowledge on the field, and examination of the dataset, we decided to merge those 222 districts into a more comprehensive 115 districts. We could not get back on our feet with the prior 1990 120 districts distribution because other gaps were due to absence of reports (i.e. real missing data), and therefore the merging is more sophisticated than simply going back to the 120 districts. The result is a more comprehensive dataset that takes into account the changing in Ministry of Health districts design and regional environmental and climate homogeneity. A detailed description of the relation between the 115, 120 and 222 datasets is to come.
Figure 1 - left: 222 districts, right: 115 districts.
We are now confident of the legitimity of the merging we have operated and that 115 districts would be a good enough spatial resolution to address the complex variability of malaria in Sri Lanka.

EM gap filling
Gaps still remain but in much less proportion than previously. The EM algorithm will be more reliable. Figure 2 shows the original covariance matrix, the estimated one and their difference.
Figure 2 - From left to right: the covariance matrix of the original dataset; the one estimated by EM; the difference between both.
Figure 3 shows an estimated standard error of the imputed values provided by the analysis.
Figure 3 - Estimated error of the imputed values by EM; it is set to 0 where there were actual data.
Given the large continuous gaps in the 115 districts dataset, it is necessary to check how each of those gaps were filled and if the filling is plausible. A cluster analysis performed on the gap-filled 115 dataset created some suspicious clustering in the South, where districts in a relatively homogeneous environmental and climatic area were assigned to different clusters. Figure 4 shows the result of such a cluster. Two districts in dark green in the South are associated to cluster #2 where most other districts of that cluster are located in the North. These are the two districts we want to pay close attention to how they were gap-filled.
purple: #61; light blue: #64; yellow: #29; green: #28; dark blue: #26; red: #30
Figure 4 - Cluster analysis on the gap-filled 115 districts malaria anomalies.
Figure 5 shows the original time series and the reconstructed ones for those 6 districts. It appears clearly that the pre-1990 reconstructed signal for districts #64 (light blue) and #29 (yellow) doesn't look much like their direct neighbors that have full time series, especially caracterized by very strong events in the 70s. We also notice that all the 6 districts have a different behavior post 1990. For this reason, we choose to exclude districts #64 and #29 from the study, until further data could be retrieved to caracterize better the 70s which are of most importance. Note that district #28 (green) is in the fourth cluster (i.e. the one where there is little variability and few malaria cases) in spite of its post 1990 peaks and a reconstructed signal pre 1990. Given the poorness of that record, we feel it makes sense.
Figure 5 - Time series of the original gappy data (red) and the gap-gilled data (blue) for 6 districts in the South.