data4health is a tool developed as part of the HARMONIZE project to facilitate the access, preprocessing, and aggregation of health data at customized spatiotemporal resolutions. Originally designed for data from Colombia, Brazil, Peru, and the Dominican Republic, the tool is intended to be adaptable for any linelist health data.
The R package and offers two modes of operation based on the user's coding experience:
- For users with coding experience: A wide range of functions can be directly used within R.
- For non-coding users: A graphical user interface (GUI) guides users through the data processing pipeline in an intuitive, user-friendly way.
Key Features of the R Package:
- Instructions on how to access health data
- Functions for cleaning and preprocessing health data
- Spatial harmonization, allowing aggregation to any coarser administrative unit
- Temporal harmonization, enabling aggregation to epidemiological weeks or months
- Data visualization capabilities
- Output as a .csv file, formatted to meet user-specified requirements
packages <- c("foreign", "readxl", "writexl", "shiny", "jsonlite")
install.packages(setdiff(packages, rownames(installed.packages())), repos = "http://cran.us.r-project.org")
Since the package is not yet published, you need to get in contact with one of the developers and request a tarball of the package. Then you could install it with the following line:
install.packages("/local/path/to/R-packages/harmonize.data4health_0.0.0.9000.tar.gz", repos = NULL, type="source")There are two main functionalities of data4health. For code-experienced users, a series a functions to support health data analysis are provided that users can implement to simplify their existing data pipeline. Users with less code experience can employ the graphic user interface to clean and aggregate their data in a user friendly way.
Functions
This function loads a dataframe from a file to an dataframe in the R environment. It is not necessary to use this function, you could also load a dataframe on your own. Currently it accepts .csv, .rds, .xls, .xlsx, and .dbf files.
data_loaded <- data4health_load("path/data.csv")It is also possible to load multiple files (by passing list of filenames) into one dataframe, in this case all column names need to match.
Using the following functions, you can
- cols_to_remove / cols_to_include: to pass a vector of the column names to be be removed or to be included respectively.
- remove_cols_missing: remove columns that have missing data above a certain threshold
- remove_rows_missing: remove entries where certain column has a missing value (e.g. delete all entries that have no date)
- remove_rows_threshold: removes certain rows based on threshold based values (works similar as data4health_filter)
- rename columns: to rename columns
- rename_values: to rename values within different columns
- week_to_date: convert a date from epiweek to a Date object
- date_to_week: convert a date to the first date of the epiweek
- date_to_month: convert a date to the first date of the month.
data_cleaned <- data4health_clean(data = data_loaded,
cols_to_include = c("DT_NOTIFIC", "ID_MUNICIP", "CS_SEXO"),
remove_rows_missing = c("DT_NOTIFIC"),
rename_columns = c(DT_NOTIFIC = "notification_date",
ID_MUNICIP = "municipality_code",
CS_SEXO = "sex"),
date_to_week = "notification_date")You can add save = TRUE to permanently save the resultant dataframe to your local disk.
You can filter any column , passing a list specyfying how you want to filter. The possibilities are:
- numeric: "over", "under", "between"
- Date: "after","before", "during"
- chararcter: "include", "exclude"
data_filtered <- data4health_filter(data = data_cleaned,
municipality_code = list(include = c("312710")),
sex = list(include = c("F")),
notification_date = list(during = c("2018-01-01","2018-12-31")))It is possible to aggregate the data temporally and spatially using the data4health_aggregate() function. The function by which to aggregate
- space_col:selects the column by which to spatially aggregate by
- time_col: select the column by which to temporally aggregate by
- add_col: selects any additional column(s) by which you would like to aggreagte
To avoid any missing timesteps or missing regions, you can also pass any of the following
- all_times: a vector of all timesteps (highly recommended to use the seq() function to indicate, start date, end date and timestep to use)
- all_spaces: a vector that contains all regions
data_aggregated <- data4health_aggregate(data= data_cleaned,
space_col = "municipality_code",
time_col = "notification_date_week")To visualise the results it is recommended to GHRexplore. Here are a few example plots.
Yet to come!Graphic user interface
Once data4health is loaded, the user interface can be loaded with the following command:
data4health_ui()A browser window will automatically open. There you can see several tabs:
In the cleaning tab, you can perform all cleaning steps that can be performed with data4health_clean, however every step is explained, and there are graphs to show the content of the data.
Aggregation does the same as data4health_aggregate().
Finally, within the visualisation tab, you can visualise using plots produced the the GHRexplore function.
Harmonize is an international develop cost-effective and reproducible digital tools for stakeholders in hotspots affected by a changing climate in Latin America & the Caribbean (LAC), including cities, small islands, highlands, and the Amazon rainforest.
The HARMONIZE digital toolkits will allow local researchers and users, including national disease control programs, to link, interrogate and use multi-scale spatiotemporal data, to understand the links between environmental change and infectious disease risk in their local context, and to build robust early warning and response systems in low-resource settings.
The project consists of resources and tools developed in conjunction with different teams from Brazil, Colombia, Dominican Republic, Peru and Spain.
Within HARMONIZE, each data source has its own digital toolkit to allow local researchers and users, to prepare, interrogate and eventually merge the data spatio-temporally, to understand the links between environmental change and infectious disease risk in their local context, and to build robust early warning and response systems in low-resource settings. the other toolkits are:
The example website package website includes a function reference, a model outline, and case studies using the package. The site mainly concerns the release version, but you can also find documentation for the latest development version.
|
|
GHR Global Health Resilience |
List the authors/contributors of the package and provide contact information if users have questions or feedback.
|
|
Daniela Lührsen
AI4S Fellow – Health & Climate Data Scientist Barcelona Supercomputing Center Global Health Resilience Climate & Health Data Scientist |
|
|
Raquel Martins Lana
Marie Curie Fellow – Recognised Researcher Barcelona Supercomputing Center Global Health Resilience Recognized Researcher |
- APA Format:
- TBD
