A prototype ETL pipeline that uses HL7 FHIR RDF resources when deploying pure functions to enrich knowledge graph patient data PDF Free Download

1 / 12
0 views12 pages

A prototype ETL pipeline that uses HL7 FHIR RDF resources when deploying pure functions to enrich knowledge graph patient data PDF Free Download

A prototype ETL pipeline that uses HL7 FHIR RDF resources when deploying pure functions to enrich knowledge graph patient data PDF free Download. Think more deeply and widely.

Ansarietal. Journal of Biomedical Semantics (2025) 16:16
https://doi.org/10.1186/s13326-025-00335-4
RESEARCH Open Access
© The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0
International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if
you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or
parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To
view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Journal of
Biomedical Semantics
A prototype ETL pipeline thatuses HL7 FHIR
RDF resources whendeploying pure functions
toenrich knowledge graph patient data
Adeel Ansari1* , Marisa Conte2 , Allen Flynn2 and Avanti Paturkar2
Abstract
Background For clinical care and research, knowledge graphs with patient data can be enriched by extracting
parameters from a knowledge graph and then using them as inputs to compute new patient features with pure func-
tions. Systematic and transparent methods for enriching knowledge graphs with newly computed patient features
are of interest. When enriching the patient data in knowledge graphs this way, existing ontologies and well-known
data resource standards can help promote semantic interoperability.
Results We developed and tested a new data processing pipeline for extracting, computing, and returning newly
computed results to a large knowledge graph populated with electronic health record and patient survey data. We
show that RDF data resource types already specified by Health Level 7’s FHIR RDF effort can be programmatically
validated and then used by this new data processing pipeline to represent newly derived patient-level features.
Conclusions Knowledge graph technology can be augmented with standards-based semantic data processing
pipelines for deploying and tracing the use of pure functions to derive new patient-level features from existing data.
Semantic data processing pipelines enable research enterprises to report on new patient-level computations of inter-
est with linked metadata that details the origin and background of every new computation.
Keywords Linked data, FHIR, Fast health interoperability resources, Pure functions, Data enrichment, Knowledge
graphs, ShEx validation, FAIR, ETL
Background
In the domains of biomedicine and health, an increasing
number of machine-executable pure functions are used
for making computations. A pure function is a stateless
deterministic mapping that always returns the same com-
puted value (or result) when given identical inputs. A
simple example is the Body Mass Index (BMI) function,
which has these two common forms:
e inputs needed to compute with a pure func-
tion like weight and height for the BMI function are cal
led"arguments"or"parameters"[1]. What makes pure
functions"pure"is that, when implemented in software
code, they exclusively map inputs to outputs and have no
other software effects [1].
Examples of other pure functions include functions for
computing numerical scores from questionnaire data, as
with the Generalized Anxiety Disorder 7-item Scale [2],
BMI
=weight(kg)/height(m
2
)
BMI
=703 ∗[
weight(lbs)/height(inches
2
)]
*Correspondence:
Adeel Ansari
adeel.ansari@camh.ca
1 Krembil Centre for Neuroinformatics, Centre for Addiction and Mental
Health, 250 College St., Toronto, ON M5T 1R8, Canada
2 Department of Learning Health Sciences, Knowledge Systems
Laboratory, University of Michigan Medical School, Victor Vaughan
Building, Room 209, 1111 Catherine St., Ann Arbor, MI 48109, USA
Page 2 of 12
Ansarietal. Journal of Biomedical Semantics (2025) 16:16
mathematical regression functions for estimating risks
[3] and some functions arising from machine learning
[4]. is paper explores how to leverage semantic web
technologies to track and organize the use of biomedical
pure functions to enrich existing patient data knowledge
graphs.
In biomedicine, the use of machine learning methods
to generate mathematical-statistical functions is
increasing as is the creation of functions comprised
of production rules with conditional"IF–THEN"logic.
As a consequence, the overall number of biomedical
pure functions is growing rapidly [5]. Many of these
biomedical pure functions are used to compute patient-
level properties (also called patient features) from
existing patient data.
Generally speaking, data enrichment is adding
information to an existing dataset. e focus of this
paper is specifically on data enrichment that arises from
the use of parameters available in a knowledge graph to
produce more and different data via new computations.
In this case, enriching a knowledge graph comes about by
first extracting input parameters to pure functions from a
graph, then computing new features using pure functions
outside of the graph environment, then returning those
newly computed features as additional semantic data
back to the knowledge graph from which the input
parameters came.
Datasets originating from Electronic Health Record
(EHR) databases or patient data registries can be enriched
by adding newly computed patient-level features [6,
7]. Researchers typically use these added patient-level
features to stratify patient records, explore hypothetical
cause and effect relationships, and answer a wide range
of biomedical research questions. To better support this
kind of patient dataset enrichment, this study explores
how to take advantage of linked data, knowledge graphs,
and existing Health Level 7 Fast Health Interoperability
Resource (HL7 FHIR) data resource types to represent
some of the semantics of pure functions and the patient-
level computations they enable [8, 9].
Regarding linked data, the Resource Description
Framework (RDF) specifies a format where any domain
of interest can be represented using a pattern of triples,
where each triple relates a subject resource to an object
resource via a predicate relationship [10, 11]. e linked
data in any RDF dataset can be serialized in a variety of
formats and visualized as a directed knowledge graph
[12]. RDF knowledge graphs can be queried using
SPARQL [13]. ese knowledge graphs can also be
used to compute novel inferences using various logical
reasoners [14].
Some biomedical research and healthcare provider
organizations have begun converting their patient-level
electronic health record (EHR) data from original
proprietary data formats into RDF triples for knowledge
graphs. As an example of this, the Centre for Addiction
and Mental Health (CAMH) in Toronto, Ontario, Canada
routinely converts its EHR data into RDF triples and
then adds these triples to a patient-oriented knowledge
graph built using the open source Blue Brain Nexus
platform [15]. CAMH is Canadas largest mental health
teaching hospital. Within CAMH, computational models
are used to understand and help treat mental illnesses.
To support this research and accelerate discovery using
knowledge graph technology, CAMH integrates vast
amounts of semantic data from genetics, brain imaging,
and EHR records [16]. e EHR data in the CAMH
knowledge graph currently represents patient data
using a variety of HL7 FHIR RDF data resource types.
Additional data models used in the CAMH graph include
the Neuroimaging Data Model (NIDM), the Provenance
Ontology (PROV-O), and various schema from schema.
org [16]. For this study, CAMH wished to build out a
computational pipeline for enriching these patient-level
data by computing with pure functions.
Several challenges must be met when enriching patient
data in knowledge graphs by computing with pure
functions. A critical challenge is to safeguard patient data
privacy and security. One approach is to keep patient
data within the local IT security zone of the organization
that collects them rather than shuttle them to and from
computing systems outside of the organization [17]. To
keep computation local, instances of biomedical pure
functions must be deployed inside the same technical
environments where patient data reside. is study put
technology for computing with pure functions inside
CAMH’s technical environment alongside their existing
patient-oriented knowledge graph.
In addition to upholding data security, a second chal-
lenge is for creators of pure functions to be able to
manage the growing number of them used by their
organizations. As others note, effective management of
pure functions requires metadata sufficient to make pure
functions findable, accessible, interoperable and reus-
able (the FAIR principles) [18, 19]. To help with this, the
Knowledge Systems Laboratory (KSL) at the University
of Michigan (U-M) has previously created the Knowl-
edge Object Implementation Ontology (KOIO) [20].
We have demonstrated how KOIO can assist develop-
ers to produce Knowledge Objects containing software
implementations of pure functions with corresponding
tests and documentation [21]. KSL currently works with
researchers in the global Mobilizing Computable Bio-
medical Knowledge community (MCBK) on realizing the
FAIR principles for pure functions [2224]. A relevant
example of this work is a Knowledge Object containing
Page 3 of 12
Ansarietal. Journal of Biomedical Semantics (2025) 16:16
an implementation of the BMI function with metadata
[25]. Its metadata, serialized as Turtle RDF triples in
Fig.1 below on the left with a corresponding RDF graph
view of the key triples on the right, describe and depict
an implementation and software tests for the BMI func-
tion. A software implementation of the BMI function is
located in a JavaScript file called bmi.js.
A third challenge to enriching knowledge graph data
is to represent pure functions, newly computed patient
features, and their provenance in a common way. Ideally,
all new RDF triples added to the graph would conform to
one or more well-known RDF schemas. For this, we ulti-
mately selected the emerging RDF standard coming from
the HL7 FHIR community [26]. Unlike KOIO which we
developed but is not well-known, HL7 FHIR RDF pro-
vided this project with community-endorsed standard
data resource types for representing pure functions and
new computations produced by using them [27].
Leveraging our prior work with knowledge graphs
and linked metadata about pure functions, we stood
up a local data processing pipeline for Extracting,
Transforming, and Loading (ETL) standardized HL7
FHIR RDF data resources representing pure functions,
new computations generated from using pure functions,
and the provenance of these computations. We report the
results of a technical feasibility study along these lines.
Research questions
In this study, we sought to do the following:
1. Determine the essential information required to
effectively trace and document semantic data enrich-
ment processes involving pure functions.
2. Determine the primary components of ETL pipelines
for semantic data enrichment processes involving
pure functions.
3. Demonstrate what is required to adopt and conform
to existing HL7 FHIR RDF Library, Observation, and
Provenance data resource standards in the context of
semantic data enrichment processes involving pure
functions.
Methods
Project initiation andmanagement
is project was initiated by team members at CAMH
and KSL. e research team includes experienced health
data and computer scientist-engineers plus library
and information scientists familiar with the health
and healthcare domains. Over more than three years,
with some intermittency due to the COVID-19 global
pandemic, the research team met many times via online
video conference to discuss and work on the design,
development, and trial demonstration of the planned ETL
pipeline for enriching CAMH’s patient data knowledge
graph. CAMH team members took the lead on software
development for the ETL pipeline. KSL team members
took the lead on metadata development and HL7 FHIR
RDF Library, Observation, and Provenance data resource
conformance testing. Project documentation was created
and stored either in the Google Docs platform at U-M
or in an instance of Confluence made available to the
project by CAMH.
Fig. 1 Knowledge Object Metadata as RDF Triples. On the left is a Turtle representation of the metadata for a Knowledge Object containing a BMI
function. On the right is a view of the key metadata triples for the BMI Knowledge Object visualized as a graph
Page 4 of 12
Ansarietal. Journal of Biomedical Semantics (2025) 16:16
Technical methods overview
is study utilized a number of technical methods. First,
we leveraged existing knowledge graph technology.
Second, we did formal data modeling to determine the
information required to trace and document semantic
data enrichment. ird, we adopted and validated
relevant HL7 FHIR RDF data resource types. Fourth, we
developed and packaged an example pure function with a
corresponding API mechanism and metadata about the
function. Fifth, we developed and tested an ETL pipeline
for semantic data enrichment by leveraging the prior four
efforts.
Knowledge graph availability viaCAMH
roughout the project, CAMH provided access to
development and production instances of its Knowledge
Graph loaded with patient data. CAMH’s Knowledge
Graph integrates multimodal data including: Electronic
Health Record data, patient questionnaire responses
from a local instance of REDCap, interpretations of
neuroimaging results, laboratory observations, plus sleep,
fitness and other biometric data. All of the patient data
in the CAMH Knowledge Graph are represented using
HL7 FHIR resource types except for the neuroimaging
results data, which are represented in the graph using
the Neuroimaging Data Model ontology (NIDM).
e CAMH Knowledge Graph takes a significant step
towards providing a self-serve data platform for mental
health researcher [16].
Preliminary work tooutline theinformation space
ofinterest
We began by examining information needed to trace
pure functions and their use to enrich patient data in
knowledge graphs. is work included several rounds of
data resource modeling.
We started by outlining the information space of interest
and developing our own"homegrown"RDF data resource
models based on earlier releases of KOIO (1.0 and 2.0) and
on our previous work to identify 13 categories of metadata
relevant for describing pure functions [22].
At the outset, to guide and document our iterative data
resource modeling efforts, we collaboratively developed a
set of competency questions. Our intent was that the data
resources for tracing and documenting pure functions
and their use would contain answers for each competency
question. We borrowed the method of using competency
questions from the field of ontology development, where
Competency Question-driven Ontology Authoring has
been previously described [28]. As we proceeded in this
work, our data resource modeling drew on concepts
and relationships from two other relevant ontologies:
the Function Ontology (FnO) [29] and the Provenance
Ontology (PROV-O) [30].
As a proof of concept using our own homegrown data
resource models, we manually created test instances of
RDF data resources serialized in JSON-LD. ese test
data resources were loaded into a Knowledge Graph
development environment. Once loaded, we used
SPARQL queries to produce answers to the competency
questions we developed.
Adoption ofHL7 FHIR RDF data resource types
After outlining the information space of interest, we
learned that the HL7 and W3C Semantic Web Health
Care and Life Sciences communities had, as part of
the 5th Release of FHIR, developed common, openly
available semantic RDF data models for each FHIR data
resource type [31]. We used our competency questions
to determine whether the combined content of the
HL7 FHIR RDF Library, Provenance, and Observation
data resource types was sufficient for our purposes
[31]. Finding that the content of these three HL7 FHIR
RDF data resource types was sufficient, we set aside our
homegrown data resource models. Adopting HL7 FHIR
RDF data resource types provided an opportunity to
demonstrate that RDF generated by a semantic patient
data enrichment ETL pipeline can conform to an openly
available data resource standard for describing pure
functions and computations resulting from using them.
HL7 FHIR RDF data resource validation using ShEx.js
Adopting HL7 FHIR RDF allowed us to use Shape
Expressions (ShEx) to programmatically validate
instances of FHIR RDF data resources [32, 33]. is
validation focuses on the structure that HL7 resources
must have. With technical support from the HL7 FHIR
RDF community, we stood up a virtual server, loaded
it with the ShEx.js tool [33], and performed validation
tests on the FHIR RDF Obervation and Provenance data
resource types produced by the ETL pipeline. (e FHIR
RDF Library data resource type is not produced by the
ETL pipeline and was not validated.) Validation using
ShEx.js initially indicated the presence of several errors.
After correcting these errors, additional validation
tests were done until conformance with the HL7 FHIR
RDF standard was achieved. e ETL pipeline software
was then developed to produce conformant HL7 FHIR
RDF Observation and Provenance resources. More
information and relevant files can be found at: https://
github. com/ kgrid/ fhir- rdf- valid ation.
Example pure function used
As an example drawn from the domain of mental health
for this study, KSL team members implemented a pure
Page 5 of 12
Ansarietal. Journal of Biomedical Semantics (2025) 16:16
function in Python for computing and interpreting
a total Patient Health Questionnaire (PHQ-9) score
and its standard interpretation from answers to the
questionnaire provided by individual patients [34].
PHQ-9 scores are a common measure used to screen for
severity of depressive symptoms [35].
e PHQ-9 pure function exists outside of the CAMH
knowledge graph where it can also be accessed by other
applications through a corresponding RESTful API
service. e pure function accepts the individual numeric
results from the nine items comprising the PHQ-9 as its
input parameters. It simply computes the total PHQ-9
score and provides an interpretation of that total score.
A file with technical metadata about the origin and
characteristics of the PHQ-9 pure function was also
developed. ese metadata were later reformatted to
conform to the HL7 FHIR RDF Library data resource
type specification for describing software libraries but
not validated using ShEx.js. At CAMH, a software script
was developed to load HL7 FHIR RDF Library metadata
from the Knowledge Object into the CAMH Knowledge
Graph. is script exists outside of the ETL pipeline
because it only needs to run once to load and record
metadata about each pure function. is script can be
run on an ad-hoc basis whenever a new pure function is
to be used for semantic data enrichment at CAMH.
Development oftheETL pipeline using Apache NiFi
andpython code
To establish the ETL pipeline and complete the technical
work for this project, several existing technologies were
used. When building the ETL pipeline, Apache NiFi [36]
was leveraged for its ability to automate the flow of data
between existing software systems. CAMH already used
Apache NiFi to routinely check REDCap for new PHQ-9
responses. At CAMH, whenever new PHQ-9 responses
are detected, Apache NiFi inserts them into the CAMH
knowledge graph as a PHQ-9 response data object. We
used Apache NiFi as a tool for implementing the ETL
pipeline.
In our case, the CAMH Knowledge Graph emits
Server-Sent Events (SSEs) when data is inserted, updated,
or deleted. In our ETL pipeline implementation, Apache
NiFi was configured to monitor these SSEs specifically
for PHQ-9 responses being added to the graph. Upon
detecting such an event, Apache NiFi first retrieves new
PHQ-9 responses for a patient from the graph and then
transmits these responses via an API request to the pure
function for computation.
Each time new PHQ-9 scoring and interpretation
computations are made, the ETL pipeline then pro-
duces a single new conformant HL7 FHIR RDF Obser-
vation resource representing the new computations and
a corresponding single new conformant HL7 FHIR RDF
Provenance resource with semantic information describ-
ing how each computation came about and linking to the
specifics of the pure function used to compute it.
Testing thedata enrichment ETL pipeline
Manual tests were performed at each step in the pipeline
to confirm that the pipeline functioned properly. Because
Blue Brain Nexus will accept essentially any RDF, to
confirm that the data was inserted correctly with the
expected RDF data structures, SPARQL queries were
performed. Blue Brain Nexus supports conformant
SPARQL queries and provides its users with immediate
query results.
Results
To address Research Question 1, we found the essential
information required to effectively trace and document
semantic data enrichment processes involving pure
functions could be represented in two ways. First, as a list
of competency questions (Table1) where each question is
associated with one or more target stakeholder group(s)
thought to be most interested in it. Second, we found
that a combination of three HL7 FHIR RDF data resource
types (Library, Observation, and Provenance) included
the answers to the 13 competency questions in Table1.
A mapping between the 13 competency questions and
elements of the Library, Observation, and Provenance
data resource types is provided in the far right column of
Table1.
To address Research Question 2, we found a number
of components were needed to establish an ETL pipeline
for semantic data enrichment. ese components are
depicted in Fig.2.
Starting on the left of Fig. 2 above, existing data
sources send data to the CAMH Knowledge Graph. In
2019, CAMH initially created this Knowledge Graph by
deploying an instance of the Blue Brain Nexus knowledge
graph platform in an on-premise Kubernetes cluster.
When the ETL pipeline surrounded by the dotted line is
triggered, it fetches and processes data from the graph.
In general, the technical components needed to support
a semantic data enrichment ETL pipeline like the one
we developed are a knowledge graph, a listener (like
Apache NiFi), the pipeline’s software code (which we
implemented in five stages using Python), and a deployed
instance of one or more executable pure functions (which
we accessed via a local API).
e computational sequence supported by these tech-
nical components is shown from top to bottom in the
Fig. 3 immediately below. e sequence starts with a
patient providing responses to the items of the PHQ-9
in REDCap. e Apache NiFi Listener checks REDCap
Page 6 of 12
Ansarietal. Journal of Biomedical Semantics (2025) 16:16
for newly posted patient PHQ-9 survey responses every
10min. When one or more instances of these responses
are found, they are loaded into the Blue Brain Nexus
Knowledge Graph. Next, Blue Brain Nexus generates a
specific server side event message which is detected by
Apache Nifi, causing it to trigger the ETL pipeline. After
querying data available in the graph, the pipeline’s code,
which is also running in Apache Nifi, posts a request for
a summary PHQ-9 score and its interpretation to the
KO-API. Next, the pipeline’s code generates correspond-
ing FHIR RDF Observation and Provenance messages
containing new computations and information about
how the new computations were produced respectively.
Finally, the pipeline’s code loads the new FHIR RDF
Observation and Provenance resources into the graph,
thereby enriching it with more data.
To address Research Question 3, we learned the fol-
lowing about adopting and making any pipeline’s out-
puts conform to existing HL7 FHIR RDF data resource
standards. First, we created instances of the Observation
and Provenance resources, serialized them in Turtle, and
tested them for conformance with the HL7 FHIR RDF
standards using the ShEx.js validation tool.
To test these two HL7 FHIR RDF data resource types
using ShEx.js, a common schema detailing all of the
requirements for each type of data resource is required. We
used openly available HL7 FHIR RDF schema created by
Sharma etal. for this [37]. As an example of schema like
these, the initial lines of the Observation resource schema
that we used look like this:
Next is a complete Observation resource we created that
passed its ShEx.js validation:
Table 1 List of competency questions. These 13 competency questions developed by the research team provided a necessary outline
of the information space of interest for this project
Competency Question Target Stakeholder Groups HL7 FHIR RDF Data
Resource Type with this
Information
1 COMPUTATION TIME & DATE: At what date and time
was the patient-level computation made? Patients, Clinicians, Researchers, IT Professionals OBSERVATION
2 PATIENT: Who was the patient-level computation about? Patients, Clinicians, Researchers OBSERVATION, PROVENANCE
3 NAME OR TITLE: Which pure function was used to compute
the patient-level computation? Clinicians, Researchers, IT Professionals LIBRARY
4 VERSION: What was the version of the pure function
implemented in code that was used to compute the patient-
level computation?
Clinicians, Researchers, IT Professionals LIBRARY
5 PURE FUNCTION CREATION DATE: When was the pure function’s
implementation in code created? Researchers, IT Professionals LIBRARY
6 PURPOSE: What was the purpose of the pure function used
to generate the patient-level computation? Patients, Clinicians, Researchers LIBRARY
7 PURE FUNCTION IDENTIFIER: What was the persistent, unique
identifier for the pure function and its implementation in code? IT Professionals LIBRARY, PROVENANCE
8 INPUTS: What were the parameters ("inputs") used to complete
the patient-level computation? Patients, Clinicians, Researchers PROVENANCE
9 OUTPUTS: What was the computation generated as output
using the inputs and the pure function? Patients, Clinicians, Researchers OBSERVATION
10 UNDERLYING EVIDENCE: What was the evidence upon which
the pure function is based? Clinicians, Researchers LIBRARY
11 FUNCTION CALL: What was the function call to POST patient
data to the API where the function is running? IT Professionals PROVENANCE
12 PURE FUNCTION FILES: What files were used to implement
the function and its API? IT Professionals LIBRARY
13 PURE FUNCTION CREATOR: Who created and implemented
the pure function in code? Clinicians, Researchers, IT Professionals LIBRARY
Page 7 of 12
Ansarietal. Journal of Biomedical Semantics (2025) 16:16
We confirmed that the Observation resource above
passed its validation using ShEx.js when we received the
following truncated output from ShEx.js with no"failure
list"of errors. To show the difference, below that is the
truncated output from a failed validation test with
a"FailureList"of errors. e first failure shown in part
below is a"TypeMismatch"for the Observation resource’s
status field, meaning that the status field did not contain
an entry allowed by the HL7 FHIR RDF standard.
Fig. 2 Components of a Semantic Data Enrichment ETL Pipeline. The dashed line surrounds the actual ETL pipeline. From left to right,
the components start with data sources routinely feeding a knowledge graph with new data. An event listener detects when select new data
reaches the graph and triggers the five stages of the software pipeline to enrich the data in the graph by using a pure function API backed
by an executable implementation of a pure function
Page 8 of 12
Ansarietal. Journal of Biomedical Semantics (2025) 16:16
We performed iterative validation tests like these for our
Observation and Provenance resource examples until our
examples of both resources passed their ShEx.js valida-
tion. At that point, the code for Stage 4 of the ETL pipeline
(see Fig. 2) was programmed to produce conformant new
instances of the Observation and Provenance data resources.
Next, we learned how to use SPARQL queries of an
enriched knowledge graph like the one below to confirm
that the HL7 FHIR RDF Observation and Provenance
data resources loaded by the ETL pipeline into the CAMH
Knowledge Graph at the final pipeline Stage 5 contain infor-
mation sufficient to answer our competency questions.
Fig. 3 Semantic Data Enrichment ETL Pipeline Sequence Diagram. The computational sequence supported by the pipeline is detailed
Page 9 of 12
Ansarietal. Journal of Biomedical Semantics (2025) 16:16
As shown in Fig.4 further below, since we are querying
a graph the SPARQL queries we developed take advan-
tage of semantic data links between new observation and
provenance resources and existing patient, PHQ-9 ques-
tionnaire response, and (software) Library resources.
When subgraphs like that in Fig.4 are queried using
the SPARQL query above, the CAMH Knowledge Graph
produces rows of query output like the example rows in
Table2 below.
is study addressed three research questions. e
results above list essential information needed to trace
and document semantic data enrichment, show the com-
ponents of an ETL pipeline for enriching an RDF data
store, detail what is required for such an ETL pipeline
to be able to produce conformant HL7 FHIR RDF data
resources, and show how to query such data resources
using SPARQL.
Discussion
We conducted this technical study to show how biomedi-
cal research organizations like CAMH that already use
knowledge graphs can further enrich their graph data by
computing with pure functions. To a significant degree,
we overcame the three challenges noted above. To
uphold patient data security and integrity, the ETL pipe-
line we developed operates entirely inside CAMH’s IT
Fig. 4 Relationships Between Relevant Data Resources. The ETL pipeline produces the HL7 FHIR RDF Observation and Provenance resources
on the left which are related to the Patient, Questionnaire Response, and (software) Library resources on the right as shown
Table 2 Two rows of a report showing data of interest about the computation of a derived patient feature. Each row shows data
for one simulated patient including the numeric output from the pure function in the"phq9_total_score"column, the computed
interpretation of the numeric output in the"phq9_total_score_interpretation"column, the date and time when the computations
of response and interpretation were generated (prov_recorded_dt) and the version of the pure function used to compute them
(ko_version)
patient_id phq9_total_score phq9_total_score_
interpretation prov_recorded_dt ko_version
123456 (not a real ID) 18 Severe 2022–03-03T21:30:00.000Z 1.0.0
654321 (not a real ID) 22 Moderately Severe 2022–03-03T19:30:00.000Z 1.0.0
Page 10 of 12
Ansarietal. Journal of Biomedical Semantics (2025) 16:16
security zone. Patient data never leaves CAMH. To assist
in managing growing pure function use over time, we
showed how every computation made with a pure func-
tion can be associated with key provenance information,
including information about the implemented version of
any pure function. CAMH values this provenance infor-
mation for computations about patients because it ena-
bles their researchers to understand when and how each
computation was produced. Finally, to show how exist-
ing semantic data resource standards can be leveraged
when enriching graph data with pure functions, we pro-
grammed the pipeline to generate valid HL7 FHIR RDF
Observation and Provenance data resources.
For our example pure function in this study, we
selected a simple but widely used scoring and inter-
pretation function relevant to clinical depression. is
function combines patient reported data from the
Patient Health Questionnaire-9 instrument to gen-
erate a summary PHQ-9 score and an interpretation
of that PHQ-9 score [38]. In this example, computing
the PHQ-9 summary score and its interpretation in a
standard, reliable, and consistent way enables CAMH
to enrich their patient data knowledge graph automati-
cally every time new PHQ-9 responses are logged.
While CAMH’s implementation of the pipeline uses
SPARQL queries to fetch PHQ-9 data stored in Blue
Brain Nexus, organizations are not required to migrate
their existing data to a knowledge graph platform. e
modular design of pure functions allows the pipeline
architecture to be source-agnostic as to its input param-
eters. Organizations can query PHQ-9 data directly
from their current systems so long as the query returns
the information required for the pure function signa-
ture. However, a key requirement for optimal pipeline
performance is having a versioned data management
approach where individual data points are assigned
unique and persistent identifiers as mentioned in the
FAIR principles. is is what enables the pipeline to
maintain a provenance chain that can trace computa-
tional results back to their source observations, regard-
less of the underlying storage systems.
e fact that the pure function is designed independ-
ent of the pipeline allows it to seamlessly be reused
across different applications, pipelines, and computing
environments provided the input data conforms to the
expected function signature. is reusability ensures
computational consistency across systems and facili-
tates validation and testing efforts.
Additionally, the semantic data enrichment ETL pipe-
line we developed can now be fitted with other locally
executable pure functions beyond PHQ-9 scoring. In
biomedicine overall, computing with pure functions is
increasingly important, as hundreds of medical calcu-
lators, biomedical equations, scoring tools, and com-
putable guidelines exist that are widely and frequently
used. In future work, we look forward to fitting seman-
tic data enrichment pipelines like the one developed for
this study with a wide array of biomedical pure func-
tions relevant to mental health research. Being source-
agnostic allows organizations to leverage their existing
data infrastructure while gaining the provenance ben-
efits demonstrated in our implementation, making the
pipeline broadly applicable across diverse healthcare IT
environments for any computable biomedical function
that follows similar architectural patterns.
Several ongoing developments suggest that the num-
ber of biomedical pure functions is likely to increase. e
most obvious of these may be the recent growth in small
and large-scale machine learning and deep learning (ML/
DL) models for biomedicine. e medical informatics
literature is increasingly filled with reports of new statis-
tical-mathematical pure functions arising from ML/DL
[5]. At the individual patient level, computations using
a growing number of ML/DL pure functions inform and
assist with disease categorization and subcategoriza-
tion, outlier detection, outcomes prediction, and many
other things. Other work is ongoing to improve widely
used biomedical equations for estimating patient-level
features, including physiological functioning and health
risks. As more and higher quality patient data become
available [39], and as the use of race as a proxy for genet-
ics gets set aside [40], widely used biomedical equations
get reworked. As examples, consider how once widely
used equations for predicting kidney function (e.g., the
Cockroft-Gault equation) and for estimating cardiovas-
cular disease risk (e.g., Framingham Risk Score) have
been replaced by newer equations. As these changes
occur, being able to trace how computations were made
enables organizations to distinguish computations made
with older and newer versions of the pure functions they
use. For clinicians and patients, being able to trust that
computations have been properly made is critical. For
researchers, understanding how individual patient-level
computations come about is necessary for reproduc-
ing data analyses. For risk managers, tracing the origins
of select high stakes biomedical computations may be
desirable.
It is also true that biomedical pure functions can
be used in combination to arrive at"compound
computations."For example, estimates of kidney function
may serve as inputs to computable phenotype functions
that determine the stages of chronic kidney disease. It is
Page 11 of 12
Ansarietal. Journal of Biomedical Semantics (2025) 16:16
easy to imagine more elaborate examples where many
pure functions are used in conjunction to compute new
patient-level features. In one of our past projects, more
than 40 pure functions were used to compute a battery
of estimates of the risks and benefits of preventive medi-
cal services [21]. In clinical medicine, now is the time to
develop reliable procedures for producing and reporting
singular and compound computations arising from the
use of pure functions. is work shows how knowledge
graphs and related semantic web technology can assist in
this regard.
is work began as an information modeling exercise.
As the modeling work unfolded, it became apparent that
our"homegrown"models overlapped considerably in their
content with existing data resource models from the
HL7 FHIR community. Adopting HL7 FHIR’s RDF data
resource models brought several advantages. First, reli
nquishing"homegrown"information models makes this
work applicable to a wider audience. Second, adopting
HL7 FHIR RDF models made it possible to complete a
demonstration of automated data resource conform-
ance testing and validation using the newly created ShEx
toolkit. is obviated any need to develop a unique data
conformance testing and validation mechanism of our
own. ird, showing how HL7 FHIR RDF can be used
in a practical application like our ETL pipeline may help
others to do similar things.
e focus of this work is exclusively on semantic
data enrichment using domain-specific pure functions
implemented external to knowledge graphs. is kind
of data enrichment is quite different from using logical
reasoners to produce new inferences from existing graph
data. A critical limitation of this work is that it did not
implement the example pure function or its separate API
using semantic technologies. We look forward to future
work along these lines where we describe the inputs and
outputs of pure functions using semantic data and deploy
pure functions using semantic API technology.
is work is further limited in the following ways. No
attempt was made to develop the ETL pipeline using
other tooling even though Jupyter Notebooks and other
available tools could be used to build similar pipelines.
Only one pure function implementation was used, and
the data enrichment ETL pipeline that resulted is likely
over-fit to the example PHQ-9 pure function.
We also encountered limitations related to FHIR RDF.
A limitation in the FHIR RDF R5 vocabulary is that not
all of the FHIR RDF R5 URIs resolve to a web page. is
makes it harder to understand the meaning of some
FHIR RDF R5 data. rough our discussions with the
FHIR RDF R5 implementation team, we understand work
to address this issue is ongoing. We have provided some
links to the actual definitions from the FHIR vocabulary
for terms used in the above query:
One other thing we noted is that in FHIR RDF’s R5
release, the use of the property fhir:value was changed
to fhir:v for primitive values to avoid any conflicts with
other uses of fhir:value throughout the FHIR standard.
Conclusions
Using open source technologies, an existing patient
knoweldge graph, and HL7 FHIR RDF Library,
Observation, and Provenance linked data resources, a
software pipeline was built to enrich a patient-oriented
knowledge graph with computations arising from using
a pure function. is approach to data enrichment is
an example of bringing"compute"to a knowledge graph
environment using widely available technologies and
in conformance with an emerging data standard (HL7
FHIR RDF). is work enables secure and traceable
computation for healthcare and biomedical research.
Abbreviations
ETL Extract, Transform, and Load
FHIR Fast Health Interoperability Resources
HL7 Health Level 7
PHQ-9 Patient Health Questionnaire 9
RDF Resource Description Framework
Acknowledgements
The authors would like to thank David Booth, Eric Prud’hommeaux and the
HL7 FHIR RDF for Semantic Interoperability Working Group for their interest
and willingness to share information about the emerging HL7 FHIR RDF
standard. This project also benefited from early information modelling efforts
that included Robert Xiao during his term at CAMH.
Authors’ contributions
A.A. and A.F. conceived of this study. A.A., A.F., and A.P. did the information
space modeling work. A.P. completed the ShEx validation work with help from
A.F.. A.A. developed and tested the ETL data enrichment pipeline at CAMH.
M.C. reviewed and commented on all results. A.A., A.F., A.P., and M.C. wrote and
edited the manuscript.
Funding
No grant funding was used to support this work.
Data availability
The data and software supporting the conclusions of this article are included
within the article and in additional files to which links are provided.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Page 12 of 12
Ansarietal. Journal of Biomedical Semantics (2025) 16:16
Competing interests
The authors declare no competing interests.
Received: 16 December 2024 Accepted: 1 July 2025
References
1. Bartosz Milewski. Chapter 3 "Pure Functions, Laziness, I/O, and Monads"
in "Basics of Haskell". School of Haskell. 2013. Online at: https:// www.
schoo lofha skell. com/ school/ start ing- with- haske ll/ basics- of- haske ll/3-
pure- funct ions- lazin ess- io. Retrieved 2024-06-19.
2. Spitzer RL, Kroenke K, Williams JB, Löwe B. A brief measure for assess-
ing generalized anxiety disorder: the GAD-7. Arch Intern Med.
2006;166(10):1092–7.
3. Lu T, Silveira PP, Greenwood CM. Development of risk prediction models
for depression combining genetic and early life risk factors. Front Neuro-
sci. 2023;18(17):1143496.
4. Su C, Xu Z, Pathak J, Wang F. Deep learning in mental health outcome
research: a scoping review. Transl Psychiatry. 2020;10(1):116.
5. Federico CA, Trotsyuk AA. Biomedical data science, artificial intelligence,
and ethics: navigating challenges in the face of explosive growth. Ann
Rev Biomed Data Sci. 2024;10:7.
6. Shivade C, Raghavan P, Fosler-Lussier E, Embi PJ, Elhadad N, Johnson
SB, Lai AM. A review of approaches to identifying patient phenotype
cohorts using electronic health records. J Am Med Inform Assoc.
2014;21(2):221–30.
7. Miotto R, Li L, Kidd BA, Dudley JT. Deep patient: an unsupervised rep-
resentation to predict the future of patients from the electronic health
records. Sci Rep. 2016;6(1):1.
8. Braunstein ML. Health Informatics on FHIR: How HL7’s API is transforming
healthcare. Springer; 2022. https:// link. sprin ger. com/ book/ 10. 1007/ 978-
3- 030- 91563-6# bibli ograp hic- infor mation.
9. Duda SN, Kennedy N, Conway D, Cheng AC, Nguyen V, Zayas-Cabán T,
Harris PA. HL7 FHIR-based tools and initiatives to support clinical research:
a scoping review. J Am Med Inform Assoc. 2022;29(9):1642–53.
10. Resource Description Framework, W3C At: https:// www. w3. org/ RDF/.
Accessed 20 June 2024.
11. Ajileye T, Motik B. Materialisation and data partitioning algorithms for
distributed RDF systems. J Web Semantics. 2022;1(73):100711.
12. JSON for Linking Data. At: https:// json- ld. org/. Accessed 20 June 2024.
13. Quilitz B, Leser U. Querying distributed RDF data sources with SPARQL.
InThe Semantic Web: Research and Applications: 5th European Semantic
Web Conference, ESWC 2008, Tenerife, Canary Islands, Spain, June 1-5,
2008 Proceedings 5 2008 (pp. 524-538). Springer Berlin Heidelberg.
14. Mishra RB, Kumar S. Semantic web reasoners and languages. Artif Intell
Rev. 2011;35:339–68.
15. Sy MF, Roman B, Kerrien S, Mendez DM, Genet H, Wajerowicz W, Dupont
M, Lavriushev I, Machon J, Pirman K, Neela MD. Blue brain nexus: an open,
secure, scalable system for knowledge graph management and data-
driven science. Semantic Web. 2023;14(4):697–727.
16. Rotenberg DJ, Chang Q, Potapova N, Wang A, Hon M, Sanches M, Bogetic
N, Frias N, Liu T, Behan B, El-Badrawi R. The CAMH neuroinformatics plat-
form: a hospital-focused Brain-CODE implementation. Front Neuroinform.
2018;6(12):77.
17. Jeelani OF, Njie M, M Korzhuk V. Methods and algorithms of ensuring
data privacy in AI-based healthcare systems and technologies. In Confer-
ence Proceedings, Paris France April 2024 Apr 11 (Vol. 11, p. 12).
18. Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak
A, et al. The FAIR guiding principles for scientific data management and
stewardship. Sci Data. 2016;3(1): 160018.
19. Barker M, Chue Hong NP, Katz DS, Lamprecht AL, Martinez-Ortiz C,
Psomopoulos F, et al. Introducing the FAIR principles for research soft-
ware. Sci Data. 2022;9(1): 622.
20. Knowledge Systems Lab. KOIO - The knowledge object implementation
ontology. GitHub - kgrid/koio: Available from: https:// github. com/ kgrid/
koio. Cited 2025 Apr 14.
21. Flynn A, Taksler G, Caverly T, Beck A, Boisvert P, Boonstra P, et al. CBK
model composition using paired web services and executable functions:
a demonstration for individualizing preventive services. Learn Health Syst.
2023;7(2): e10325.
22. Alper BS, Flynn A, Bray BE, Conte ML, Eldredge C, Gold S, et al. Categoriz-
ing metadata to help mobilize computable biomedical knowledge.
Learning Health Systems. 2022;6(n/a): e10271.
23. Flynn A, Conte M, Boisvert P, Richesson R, Landis-Lewis Z, Friedman C.
Linked metadata for FAIR digital objects carrying computable knowledge.
Res Ideas Outcomes. 2022;10(8):e94438.
24. McCusker J, McIntosh LD, Shaffer C, Boisvert P, Ryan J, Navale V, Topaloglu
U, Richesson RL. Guiding principles for technical infrastructure to support
computable biomedical knowledge. Learn Health Syst. 2023;7(3): e10352.
25. Knowledge Systems Lab. BMI Calculator Knowledge Object. Available
from: https:// github. com/ kgrid/ koio/ tree/ master/ examp les/ bmi_ calcu
lator_v_3. Cited 2025 Apr 14.
26. Prud’hommeaux E, Collins J, Booth D, Peterson KJ, Solbrig HR, Jiang G.
Development of a FHIR RDF data transformation and validation frame-
work and its evaluation. J Biomed Inform. 2021;1(117):103755.
27. HL7 FHIR RDF Representation. Accessed at: https:// hl7. org/ fhir/ rdf. html
December 5, 2024.
28. Ren Y, Parvizi A, Mellish C, Pan JZ, Van Deemter K, Stevens R. Towards
competency question-driven ontology authoring. In The Semantic Web:
Trends and Challenges: 11th International Conference, ESWC 2014,
Anissaras, Crete, Greece, May 25–29, 2014. Proceedings 11 2014 (pp.
752–767). Springer International Publishing.
29. IDLab. The Function Ontology. The function ontology. Available from:
https:// fno. io/. Cited 2025 April 14.
30. PROV-O: The PROV Ontology. Available from: https:// www. w3. org/ TR/
prov-o/. Cited 2025 Apr 14.
31. Health Level 7. Resource description framework representation. Available
from: https:// www. hl7. org/ fhir/ rdf. html. Cited 2025 April 14.
32. Solbrig HR, Prud’hommeaux E, Grieve G, McKenzie L, Mandel JC, Sharma
DK, Jiang G. Modeling and validating HL7 FHIR profiles using semantic
web shape expressions (ShEx). J Biomed Inform. 2017;67(1):90–100.
33. Prud’hommeaux E . ShEx.js. Available from: https:// github. com/ shexjs/
shex. js. Cited 2024 Dec 5.
34. Knowledge Systems Lab. PHQ9 Knowledge Object. Available from:
https:// github. com/ kgrid- objec ts/ CAMH/ blob/ main/ colle ction/ 99999-
CAMH2- v1.0/ PHQ9_ algor ithm. js. Cited 2025 April 14.
35. Kroenke K, Spitzer RL, Williams JBW. The PHQ-9. J Gen Intern Med.
2001;16(9):606–13.
36. Apache Software Foundation. Apache NiFi. Apache NiFi. Available from:
https:// nifi. apache. org/. Cited 2025 Apr 14.
37. Sharma DK, Prud’hommeaux E, Booth D, Nanjo C, Jiang G. Shape expres-
sions (ShEx) schemas for the FHIR R5 specification. J Biomed Inform.
2023;148(Dec): 104534.
38. Kroenke K, Spitzer RL, Williams JB. The PHQ-9: validity of a brief depression
severity measure. J Gen Intern Med. 2001;16(9):606–13.
39. International Data Corporation. 2018. Digitization of the world from
edge to core. Available from: https:// www. seaga te. com/ files/ www- conte
nt/ our- story/ trends/ files/ idc- seaga te- dataa ge- white paper. pdf. Cited 2025
Apr 14.
40. National Academies of Sciences and Medicine. Using population
descriptors in genetics and genomics research: a new framework for
an evolving field. Washington, DC: The National Academies Press; 2023.
Available from: https:// nap. natio nalac ademi es. org/ catal og/ 26902/ using-
popul ation- descr iptors- in- genet ics- and- genom ics- resea rch-a- new.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.