When a message appears, indicating that there are no data to display, this is a result of missing data in the data source you selected. For example, if you have selected “Yellow Fever,” “Russia,” and “2016” as the disease, location, and year of interest, respectively, there are no data for these queries as reported by the data source (World Health Organization, for example), so no information will populate in the map or boxplot in the “Historic/Global Analysis” tab. This is not a bug with the tool, but rather a reflection of a lack of data reported from the data source selected.
Population counts for the selected location of interest differ between the two data sources you are able to select from. Users can use their discretion to select between the two population sources from which disease incidence will be calculated.World Bank population data are provided at the country level from 1960-2016. These data are available for free at https://data.worldbank.org/indicator/SP.POP.TOTL.
LandScan utilizes an algorithm to combine spatial data and imagery analysis technologies as well as a multivariate dasymetic modeling methods to disaggregate census data counts in a given administrative boundary across the world. LandScan data are provided free to government organizations by Oakridge National Laboratory at http://web.ornl.gov/sci/landscan/. We have all historical LandScan data from 1998-2016, excluding 1999 (a year for which a dataset was not generated). Leveraging the country and state boundaries already ingested in our database courtesy of Natural Earth, LandScan provides a 1km x 1km grid of the entire world, where the value of each 1km2 area is the population in that grid. We have overlaid each country/state boundary on this grid, extracted the grid elements within boundaries, and summed their values to create total population counts for a given location.
In order to do this, each LandScan dataset was converted to the standard Esri ASCII grid format. Data processing was done using Python. We use rasterio to read the grid files and their metadata into memory. We use rasterstats to compute zonal statistics for each country/state boundary in our database. Specifically, we ask it to sum the values, which gives us our total population.
LandScan's high-resolution gridded population counts allow us to provide users with both country- and state-level population data, whereas World Bank data are available only at the country-level.
In an effort to provide users with all potential comprehensive case count data sources available, we have incorporated several data source options for case counts users to use in the incidence calculations we provide. Pan-American Health Organization (PAHO) data, for example, provides slightly discrepant case counts for Dengue than the World Health Organization (WHO) data source. Furthermore, PAHO is a regional entity and only has data available for the western hemisphere. So while PAHO may be an ideal data source when investigating Dengue in Brazil, for example, this data source would not be preferable for investigating Dengue in Southeast Asia. As such, the WHO can be considered a more appropriate default data source, as it has the most complete, global case counts.
The spline is calculated using SciPy's univariate spline method. We use the default parameters except for s, the smoothing factor. We use a smoothing factor
where y is the vector of values. s is therefore a fraction of the length of the data times the variance of the data.
One of the goals of RED Alert is to detect potential re-emergence and this is done using a machine learning classifier. Classifiers are algorithms that learn a decision function that maps a new observation to a class (from a set of classes, e.g., spam vs. non-spam email) based on the given labeled data (known observation-class pairs, e.g., examples of spam and non-spam emails).
For creating the labeled dataset for each disease, the subject matter experts (SMEs) in our team were given data for 100 countries selected at random and they labeled each location-year pair as a re-emergence or not. For measles and cholera, the disease trend before and after year 2000 seem quite different and hence, the labeling and classification is performed only on the data starting from 2000. On the other hand, there is a lot of missing data for dengue after 2000, so this is done for a longer time frame (starting 1980). For each disease, SMEs developed a schema that takes into account factors (e.g., raw incidence, case counts, change in incidence, etc.) that help detect potential re-emergence and help guide the labeling process.
These labeled datasets were used to train classifiers for various diseases. We tried two classifiers: decision tree and random forest. For all diseases, random forest performed better than decision tree and hence RED Alert uses random forest to detect if there is a re-emergence for the given disease in a given location and year.
We performed nested cross-validation (where inner cross-validation is used to select optimal parameters and outer cross-validation is used to test for overfitting) 10 times. The results (i.e., mean and standard deviation across 10 nested cross-validations) are as follows:
Information about various performance measures can be found here.
Multiple component causes are necessary to produce a disease outbreak or a re-emergence event. These events occur from a variety of different pathways. This visualization supplies a list of components that are meant to develop hypothesis for sufficient causes in a re-emergence scenario. Not all factors are causal to the user's situation, however each of the nodes have been identified through a literature review as contributing to a disease-specific historical disease outbreak or re-emergence event. The broad categories of host, pathogen, and environment in the center of the wheel fit the epidemiological triad. With increasing distance from the center, the contributing causes become more specific. The outermost layers, or primary indicators, are designed to provide the user with the most actionable factors that can potentially prevent re-emergence on a specific pathway.
This chart shows the association between the variable selected in the dropdown with respect to time and incidence. Here, the location of the point on the y axis shows the variable value, the location of the point on the x axis shows the year, and the size of the bubble corresponds to the incidence (per 100,000 persons). Please refer hover text for point-by-point information.
This chart shows the association between the variable selected in the dropdown with respect to time and incidence. Here, the location of the point on the y axis shows the variable value, the location of the point on the x axis shows the year, and the size of the bubble corresponds to the incidence (per 100,000). There is a series for every country with similar incidence (from the 'Historic/Global Analysis' tab). This means that the countries here had a disease incidence between 50% and 150% of the user's incidence in the year of interest.
No, these points merely indicate the centroid of the country where the re-emergence event occurred, as re-emergence events are determined at a national level.