GeneTegra RDW System Overview
GeneTegra’s Research Data Warehouse (RDW) is a software platform for the integration of biomedical data. GeneTegra RDW’s main objective is to organize, annotate and store in a structured form, the diverse datasets collected from multiple sources to ready them for meta-analysis studies. The tasks in a typical meta-analysis research project involve data collection, annotation, storage and analysis. Many challenges that arise during meta-analysis studies are addressed by GeneTegra’s RDW, among them, the integration of data from multiple sources and in multiple formats, the harmonization of the data and access to a scalable warehouse that supports different sets of data, such as clinical and sequence data, and the storage of results and provenance data for reproducibility of studies.
GeneTegra RDW consists of a model-driven Extract-Transform-Load (ETL) engine that applies Semantic Web technologies for the design and execution of the data extraction, transformation and loading procedures, a flexible dimensional model architecture designed to enable the pervasive use of ontologies for data harmonization and semantic consistency, and a Web-based user interface for management of the terminology and the configuration and execution of data storage templates.
GeneTegra RDW was developed under a contract awarded by the Centers for Disease Control. Biomedical researchers, bioinformaticians, and information technology experts have been closely involved from the beginning in its development, evaluation, and testing in a variety of research endeavors.
Ontology Management
In GeneTegra RDW, a standard terminology or data dictionary used for data harmonization is stored in an ontology. Ontologies are powerful Semantic Web mechanisms that provide a logical and coherent representation of conceptual entities and their interrelations within a knowledge domain. Numerous standardized ontologies have been defined for different aspects of biomedical research and the life sciences, including diseases, biological and genetic processes, clinical and demographic patient data, and many others. The Ontology Management Web interface makes it easy to augment the GeneTegra RDW ontology with concepts specific to a particular biomedical domain.
Ontology-Based, Model-Driven ETL Engine
The ETL engine is used to extract and transform incoming biomedical data into annotated data and store it in the GeneTegra RDW warehouse. GeneTegra RDW’s ETL engine builds upon the increased availability of controlled vocabularies and metadata standards to streamline the process of data harmonization, semantic annotation and storage of both clinical and sequence data through a highly scalable research data warehouse model. The warehouse metadata description uses a single modification point ETL approach that was implemented to allow importing data from source to destination data storages through the design and execution of custom ETL templates allowing any changes (e.g. modification of source/destination entities or changes in the ETL processes) to be easily deployed to both the ETL process and a graphical user interface. The warehouse metadata description has three objectives: (1) It is used to materialize a database and (2) used to dynamically construct a web interface for loading data into the warehouse and (3) used to create ETL data transfer templates through the GeneTegra RDW web portal.
A Dynamically Built Web Portal
The RDW Web Portal gives the user the flexibility to load and examine his data, to design ETL templates, to augment or redesign the ontology, and to search for data. The warehouse metadata description file links together the dimensional data model, the system ontology, and the ETL engine to enable the user to create data transfer templates directly from the portal. The encoding of the semantically annotated warehouse metadata ties to the semantics imported from the GeneTegra RDW ontology. This design enables the dynamic generation of the user interface based on the dimensional model, and thus allows for data model modifications, such as adding new dimension and fact tables. The interface guides the user in the construction of the templates that specify ETL processes augmented with specialized transformation modules that implement additional data transfer tasks, including time conversions, concatenation of columns, mapping to ontology concepts via exact or fuzzy matching, and data quality rules. The ETL engine includes a semantic annotation component to generate mappings between lexical entities and concepts in an ontology.