Global Unique Identifiers (GUIDs)
INCLUDE Data Hub: NDA GUID Strategy
Background
In 2020, the NIH INCLUDE (INvestigation of Co-occurring conditions across the Lifespan to Understand Down syndromE) Project funded the establishment of the INCLUDE Data Coordinating Center (DCC). The DCC aims to streamline access to and analysis of data generated from longitudinal studies of individuals with Down syndrome (DS). In March 2022, the DCC launched the INCLUDE Data Hub, an innovative researcher data repository and portal that provides access to de-identified demographics and clinical metadata, as well as multi-omics datasets, including whole genome sequences, transcriptomes, proteomes, and metabolomes. The INCLUDE Data Hub operates on a ‘registered access’ model, allowing researchers to utilize the portal and its datasets upon acceptance of specific usage conditions. Additionally, the hub directs users to 'controlled access' data (such as whole-genome sequences) for specific studies, contingent on receiving authorization from an NIH Data Access Committee (DAC) via the relevant dbGaP study. As of mid 2024, the INCLUDE Data Hub administers data from >9,000 research participants across 13 different studies, with 7 of these studies having controlled access data in their corresponding dbGaP studies. However, the utility of the INCLUDE Data Hub is currently diminished by the inability to identify duplicate participants across studies (Figure 1). A strategic approach to linking data would enable researchers to merge multiple datasets and data types to enrich the data availability and to reduce costly redundancies in large-scale data generation (e.g., -omics) while mitigating potential risks introduced by linkage.

Figure 1: Difficulty in defining whether research participants are unique or duplicated across studies. Screenshots from the INCLUDE Data Hub when filtering for females with trisomy 21 and celiac disease.
Based on recommendations from the NICHD Office of Data Science and Sharing, which completed a comprehensive assessment of existing Privacy Protecting Record Linkage (PPRL) and associated data governance solutions, the INCLUDE DCC and the associated NIH Steering Committee proposed to implement the NIMH Data Archive Global Unique Identifier (NDA GUID) in the INCLUDE Data Hub. The NDA GUID Tool is a government owned and operated PPRL technology that enables researchers to link data from multiple sources to the same individual without revealing personally identifiable information (PII). The NDA GUID has been effectively used in other research ecosystems and enables effective cross-study linkage of complementary datasets. The NDA GUID is already employed by key studies in the INCLUDE Data Hub, such as DS-Connect® and the Human Trisome Project (HTP), and their initial assessment of feasibility indicates that it could be widely adopted in the INCLUDE ecosystem. The generation and use of NDA GUIDs are illustrated in Figure 2.

Figure 2: Generation and use of NDA GUIDs.
Approach
dbGap Registration for INCLUDE Data Hub distribution of NDA GUIDs
NIH INCLUDE staff has registered a study in dbGaP for managing distribution of an INCLUDE GUID Mapping File, which will include all available NDA GUIDs and their associated mappings to study specific participant IDs across INCLUDE Data Hub datasets.
Requests to access the INCLUDE GUID Mapping File will be managed by the NHLBI DAC. Users will be required to sign the associated data user agreement which will include requirements for use of the NDA GUIDs, including:
- Approved Users must adhere to all NDA GUID Tool Terms of Use, as appropriate.
- Approved Users will not publicly distribute or publish NDA GUIDs. NDA GUIDs may be internally distributed to research team members for research purposes only.
- Approved Users are responsible for adhering to all dataset data use limitations even when one participant is represented in multiple datasets.
- Through the INCLUDE Data Hub, Approved Users who have created NDA GUIDs for participants who are part of active IRB protocols may gain access to more data about an individual participant than they, themselves, collected. Consequently, these research activities may be considered “human subjects research” within the scope of 45 C.F.R. 46. Approved Users must comply with the requirements contained in 45 C.F.R. 46, as applicable, which may require IRB approval.
GUID Generation & Submission
A step-by-step guide for how to generate GUIDs with the NDA GUID tool can be found here.
The INCLUDE DCC will have no access to personal identifying information (PII) for any of the studies in the Data Hub, and the responsibility for generating the GUIDs will reside with the study submitter, who may submit GUIDs to the DCC during data ingest protocols. The INCLUDE Data Intake Sheet, which collects study metadata and defines submission requirements, has been modified to incorporate the use of GUIDs.
The INCLUDE Data Hub Data Submission Intake Sheet gives researchers the option to submit NDA GUIDs as part of their data submission.
- The Intake Sheet references this webpage to explain how submitted GUIDs will be shared with other researchers (i.e., after approval via the dbGaP study).
- Submitters are not required to provide GUIDs. They must work with the appropriate parties (e.g., research team, institution) to determine whether generation and submission of GUIDs is appropriate.
- In the Intake Sheet, submitters indicate which data require controlled access and which may be shared in the INCLUDE Data Hub’s registered tier. Controlled access data requires the submission of the GDS Institutional Certification for study-level registration in dbGaP. While registered tier data do not require dbGaP registration, the sharing of GUIDs associated with registered tier data will be managed as controlled access through the single dbGaP study.
- The Intake Sheet references this webpage to encourage researchers to consent for linkage (where feasible). Example consent language is provided that addresses consent as well as assent for when a research study involves children. This language is based on existing consent language but has not necessarily been approved by an IRB in its entirety. Revisions should be made to fit the circumstances of a given record linkage implementation, as appropriate. Note: for the legal guardian, the language refers to “your child”; for the child, the language refers to “you”:
- If [you/your child] join this study, we will gather data about [you/your child]. What we learn in this study will be put in a secure NIH-designated storage location called a data repository, where these data would be shared for future research. Information about [you/your child] will be “de-identified,” which means it will not include anything that identifies [you/your child]. [NIH] will approve researchers from all over the world to access information from the repository. Researchers will agree not to attempt to identify [you/your child]. It is possible that if [you/your child] participate[s] in more than one study, researchers may be able to combine de-identified data from multiple studies to ease the burden on researchers and participants alike. The purpose of sharing this information is to make more research possible that may improve children’s and everyone’s health. This sharing of information will be done without obtaining additional permission from [you/your child].
The Intake Sheet will reference NIMH’s instructions for generation of NDA GUIDs: https://nda.nih.gov/nda/using-the-nda-guid.html
- In order to generate the NDA GUIDs, researchers must obtain approval from the NDA Program Lead, which includes concurrence from an institutional signing official (https://nda.nih.gov/nda/standard-operating-procedures.html#sop8).
- Submitters providing NDA GUIDs must adhere to the NDA GUID Tool Terms of Use.
INCLUDE Data Hub Management of NDA GUIDs
Study-generated GUIDs will be treated with the same safeguards and precautions as other types of controlled access data currently handled by the DCC (e.g., whole genome sequences). These protections have been established to meet NIH’s requirements for genomic data repositories under the NIH Genomic Data Sharing Policy as well as other requirements.
GUIDs will not be displayed in the Data Hub registered tier. Instead, the registered tier will indicate which studies have GUIDs available and Data Hub users will have to obtain approval to access GUIDs through the dedicated dbGaP study.
The INCLUDE DCC will not complete any linkage of datasets for duplicate research participants and would not indicate in the Data Hub registered tier which individual research participants may be enrolled in more than one study. Instead, linkage will have to be completed by users who have obtained access to the INCLUDE GUID Mapping File (Figure 3).

Figure 3: Data Hub users will be able to access the INCLUDE GUIDs Mapping File after securing access from the corresponding dbGaP study.
Final remarks
The adoption and implementation of the NDA GUID will facilitate cross-study research projects and linkage of complementary datasets for duplicate participants and thus accelerate the pace of scientific discoveries that may benefit people with DS.
Updated 6 months ago
