A prototype ETL pipeline that uses HL7 FHIR RDF resources when deploying pure functions to enrich knowledge graph patient data PDF Free Download

Name: A prototype ETL pipeline that uses HL7 FHIR RDF resources when deploying pure functions to enrich knowledge graph patient data PDF
Author: sjarvis

1 / 12

0 views•12 pages

A prototype ETL pipeline that uses HL7 FHIR RDF resources when deploying pure functions to enrich knowledge graph patient data PDF Free Download

A prototype ETL pipeline that uses HL7 FHIR RDF resources when deploying pure functions to enrich knowledge graph patient data PDF free Download. Think more deeply and widely.

Ansarietal. Journal of Biomedical Semantics (2025) 16:16

https://doi.org/10.1186/s13326-025-00335-4

RESEARCH Open Access

International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long

as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if

you modiﬁed the licensed material. You do not have permission under this licence to share adapted material derived from this article or

parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated

otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not

permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To

view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Journal of

Biomedical Semantics

A prototype ETL pipeline thatuses HL7 FHIR

RDF resources whendeploying pure functions

toenrich knowledge graph patient data

Adeel Ansari1* , Marisa Conte2 , Allen Flynn2 and Avanti Paturkar2

Abstract

Background For clinical care and research, knowledge graphs with patient data can be enriched by extracting

parameters from a knowledge graph and then using them as inputs to compute new patient features with pure func-

tions. Systematic and transparent methods for enriching knowledge graphs with newly computed patient features

are of interest. When enriching the patient data in knowledge graphs this way, existing ontologies and well-known

data resource standards can help promote semantic interoperability.

Results We developed and tested a new data processing pipeline for extracting, computing, and returning newly

computed results to a large knowledge graph populated with electronic health record and patient survey data. We

show that RDF data resource types already speciﬁed by Health Level 7’s FHIR RDF eﬀort can be programmatically

validated and then used by this new data processing pipeline to represent newly derived patient-level features.

Conclusions Knowledge graph technology can be augmented with standards-based semantic data processing

pipelines for deploying and tracing the use of pure functions to derive new patient-level features from existing data.

Semantic data processing pipelines enable research enterprises to report on new patient-level computations of inter-

est with linked metadata that details the origin and background of every new computation.

Keywords Linked data, FHIR, Fast health interoperability resources, Pure functions, Data enrichment, Knowledge

graphs, ShEx validation, FAIR, ETL

Background

In the domains of biomedicine and health, an increasing

number of machine-executable pure functions are used

for making computations. A pure function is a stateless

deterministic mapping that always returns the same com-

puted value (or result) when given identical inputs. A

simple example is the Body Mass Index (BMI) function,

which has these two common forms:

e inputs needed to compute with a pure func-

tion like weight and height for the BMI function are cal

led"arguments"or"parameters"[1]. What makes pure

functions"pure"is that, when implemented in software

code, they exclusively map inputs to outputs and have no

other software eﬀects [1].

Examples of other pure functions include functions for

computing numerical scores from questionnaire data, as

with the Generalized Anxiety Disorder 7-item Scale [2],

BMI

=weight(kg)/height(m

)

BMI

=703 ∗[

weight(lbs)/height(inches

)]

*Correspondence:

Adeel Ansari

adeel.ansari@camh.ca

1 Krembil Centre for Neuroinformatics, Centre for Addiction and Mental

Health, 250 College St., Toronto, ON M5T 1R8, Canada

2 Department of Learning Health Sciences, Knowledge Systems

Laboratory, University of Michigan Medical School, Victor Vaughan

Building, Room 209, 1111 Catherine St., Ann Arbor, MI 48109, USA

Page 2 of 12

Ansarietal. Journal of Biomedical Semantics (2025) 16:16

mathematical regression functions for estimating risks

[3] and some functions arising from machine learning

[4]. is paper explores how to leverage semantic web

technologies to track and organize the use of biomedical

pure functions to enrich existing patient data knowledge

graphs.

In biomedicine, the use of machine learning methods

to generate mathematical-statistical functions is

increasing as is the creation of functions comprised

of production rules with conditional"IF–THEN"logic.

As a consequence, the overall number of biomedical

pure functions is growing rapidly [5]. Many of these

biomedical pure functions are used to compute patient-

level properties (also called patient features) from

existing patient data.

Generally speaking, data enrichment is adding

information to an existing dataset. e focus of this

paper is speciﬁcally on data enrichment that arises from

the use of parameters available in a knowledge graph to

produce more and diﬀerent data via new computations.

In this case, enriching a knowledge graph comes about by

ﬁrst extracting input parameters to pure functions from a

graph, then computing new features using pure functions

outside of the graph environment, then returning those

newly computed features as additional semantic data

back to the knowledge graph from which the input

parameters came.

Datasets originating from Electronic Health Record

(EHR) databases or patient data registries can be enriched

by adding newly computed patient-level features [6,

7]. Researchers typically use these added patient-level

features to stratify patient records, explore hypothetical

cause and eﬀect relationships, and answer a wide range

of biomedical research questions. To better support this

kind of patient dataset enrichment, this study explores

how to take advantage of linked data, knowledge graphs,

and existing Health Level 7 Fast Health Interoperability

Resource (HL7 FHIR) data resource types to represent

some of the semantics of pure functions and the patient-

level computations they enable [8, 9].

Regarding linked data, the Resource Description

Framework (RDF) speciﬁes a format where any domain

of interest can be represented using a pattern of triples,

where each triple relates a subject resource to an object

resource via a predicate relationship [10, 11]. e linked

data in any RDF dataset can be serialized in a variety of

formats and visualized as a directed knowledge graph

[12]. RDF knowledge graphs can be queried using

SPARQL [13]. ese knowledge graphs can also be

used to compute novel inferences using various logical

reasoners [14].

Some biomedical research and healthcare provider

organizations have begun converting their patient-level

electronic health record (EHR) data from original

proprietary data formats into RDF triples for knowledge

graphs. As an example of this, the Centre for Addiction

and Mental Health (CAMH) in Toronto, Ontario, Canada

routinely converts its EHR data into RDF triples and

then adds these triples to a patient-oriented knowledge

graph built using the open source Blue Brain Nexus

platform [15]. CAMH is Canada’s largest mental health

teaching hospital. Within CAMH, computational models

are used to understand and help treat mental illnesses.

To support this research and accelerate discovery using

knowledge graph technology, CAMH integrates vast

amounts of semantic data from genetics, brain imaging,

and EHR records [16]. e EHR data in the CAMH

knowledge graph currently represents patient data

using a variety of HL7 FHIR RDF data resource types.

Additional data models used in the CAMH graph include

the Neuroimaging Data Model (NIDM), the Provenance

Ontology (PROV-O), and various schema from schema.

org [16]. For this study, CAMH wished to build out a

computational pipeline for enriching these patient-level

data by computing with pure functions.

Several challenges must be met when enriching patient

data in knowledge graphs by computing with pure

functions. A critical challenge is to safeguard patient data

privacy and security. One approach is to keep patient

data within the local IT security zone of the organization

that collects them rather than shuttle them to and from

computing systems outside of the organization [17]. To

keep computation local, instances of biomedical pure

functions must be deployed inside the same technical

environments where patient data reside. is study put

technology for computing with pure functions inside

CAMH’s technical environment alongside their existing

patient-oriented knowledge graph.

In addition to upholding data security, a second chal-

lenge is for creators of pure functions to be able to

manage the growing number of them used by their

organizations. As others note, eﬀective management of

pure functions requires metadata suﬃcient to make pure

functions ﬁndable, accessible, interoperable and reus-

able (the FAIR principles) [18, 19]. To help with this, the

Knowledge Systems Laboratory (KSL) at the University

of Michigan (U-M) has previously created the Knowl-

edge Object Implementation Ontology (KOIO) [20].

We have demonstrated how KOIO can assist develop-

ers to produce Knowledge Objects containing software

implementations of pure functions with corresponding

tests and documentation [21]. KSL currently works with

researchers in the global Mobilizing Computable Bio-

medical Knowledge community (MCBK) on realizing the

FAIR principles for pure functions [22–24]. A relevant

example of this work is a Knowledge Object containing

Page 3 of 12

Ansarietal. Journal of Biomedical Semantics (2025) 16:16

an implementation of the BMI function with metadata

[25]. Its metadata, serialized as Turtle RDF triples in

Fig.1 below on the left with a corresponding RDF graph

view of the key triples on the right, describe and depict

an implementation and software tests for the BMI func-

tion. A software implementation of the BMI function is

located in a JavaScript ﬁle called bmi.js.

A third challenge to enriching knowledge graph data

is to represent pure functions, newly computed patient

features, and their provenance in a common way. Ideally,

all new RDF triples added to the graph would conform to

one or more well-known RDF schemas. For this, we ulti-

mately selected the emerging RDF standard coming from

the HL7 FHIR community [26]. Unlike KOIO which we

developed but is not well-known, HL7 FHIR RDF pro-

vided this project with community-endorsed standard

data resource types for representing pure functions and

new computations produced by using them [27].

Leveraging our prior work with knowledge graphs

and linked metadata about pure functions, we stood

up a local data processing pipeline for Extracting,

Transforming, and Loading (ETL) standardized HL7

FHIR RDF data resources representing pure functions,

new computations generated from using pure functions,

and the provenance of these computations. We report the

results of a technical feasibility study along these lines.

Research questions

In this study, we sought to do the following:

1. Determine the essential information required to

eﬀectively trace and document semantic data enrich-

ment processes involving pure functions.

2. Determine the primary components of ETL pipelines

for semantic data enrichment processes involving

pure functions.

3. Demonstrate what is required to adopt and conform

to existing HL7 FHIR RDF Library, Observation, and

Provenance data resource standards in the context of

semantic data enrichment processes involving pure

functions.

Methods

Project initiation andmanagement

is project was initiated by team members at CAMH

and KSL. e research team includes experienced health

data and computer scientist-engineers plus library

and information scientists familiar with the health

and healthcare domains. Over more than three years,

with some intermittency due to the COVID-19 global

pandemic, the research team met many times via online

video conference to discuss and work on the design,

development, and trial demonstration of the planned ETL

pipeline for enriching CAMH’s patient data knowledge

graph. CAMH team members took the lead on software

development for the ETL pipeline. KSL team members

took the lead on metadata development and HL7 FHIR

RDF Library, Observation, and Provenance data resource

conformance testing. Project documentation was created

and stored either in the Google Docs platform at U-M

or in an instance of Conﬂuence made available to the

project by CAMH.

Fig. 1 Knowledge Object Metadata as RDF Triples. On the left is a Turtle representation of the metadata for a Knowledge Object containing a BMI

function. On the right is a view of the key metadata triples for the BMI Knowledge Object visualized as a graph

Page 4 of 12

Ansarietal. Journal of Biomedical Semantics (2025) 16:16

Technical methods overview

is study utilized a number of technical methods. First,

we leveraged existing knowledge graph technology.

Second, we did formal data modeling to determine the

information required to trace and document semantic

data enrichment. ird, we adopted and validated

relevant HL7 FHIR RDF data resource types. Fourth, we

developed and packaged an example pure function with a

corresponding API mechanism and metadata about the

function. Fifth, we developed and tested an ETL pipeline

for semantic data enrichment by leveraging the prior four

eﬀorts.

Knowledge graph availability viaCAMH

roughout the project, CAMH provided access to

development and production instances of its Knowledge

Graph loaded with patient data. CAMH’s Knowledge

Graph integrates multimodal data including: Electronic

Health Record data, patient questionnaire responses

from a local instance of REDCap, interpretations of

neuroimaging results, laboratory observations, plus sleep,

ﬁtness and other biometric data. All of the patient data

in the CAMH Knowledge Graph are represented using

HL7 FHIR resource types except for the neuroimaging

results data, which are represented in the graph using

the Neuroimaging Data Model ontology (NIDM).

e CAMH Knowledge Graph takes a signiﬁcant step

towards providing a self-serve data platform for mental

health researcher [16].

Preliminary work tooutline theinformation space

ofinterest

We began by examining information needed to trace

pure functions and their use to enrich patient data in

knowledge graphs. is work included several rounds of

data resource modeling.

We started by outlining the information space of interest

and developing our own"homegrown"RDF data resource

models based on earlier releases of KOIO (1.0 and 2.0) and

on our previous work to identify 13 categories of metadata

relevant for describing pure functions [22].

At the outset, to guide and document our iterative data

resource modeling eﬀorts, we collaboratively developed a

set of competency questions. Our intent was that the data

resources for tracing and documenting pure functions

and their use would contain answers for each competency

question. We borrowed the method of using competency

questions from the ﬁeld of ontology development, where

Competency Question-driven Ontology Authoring has

been previously described [28]. As we proceeded in this

work, our data resource modeling drew on concepts

and relationships from two other relevant ontologies:

the Function Ontology (FnO) [29] and the Provenance

Ontology (PROV-O) [30].

As a proof of concept using our own homegrown data

resource models, we manually created test instances of

RDF data resources serialized in JSON-LD. ese test

data resources were loaded into a Knowledge Graph

development environment. Once loaded, we used

SPARQL queries to produce answers to the competency

questions we developed.

Adoption ofHL7 FHIR RDF data resource types

After outlining the information space of interest, we

learned that the HL7 and W3C Semantic Web Health

Care and Life Sciences communities had, as part of

the 5th Release of FHIR, developed common, openly

available semantic RDF data models for each FHIR data

resource type [31]. We used our competency questions

to determine whether the combined content of the

HL7 FHIR RDF Library, Provenance, and Observation

data resource types was suﬃcient for our purposes

[31]. Finding that the content of these three HL7 FHIR

RDF data resource types was suﬃcient, we set aside our

homegrown data resource models. Adopting HL7 FHIR

RDF data resource types provided an opportunity to

demonstrate that RDF generated by a semantic patient

data enrichment ETL pipeline can conform to an openly

available data resource standard for describing pure

functions and computations resulting from using them.

HL7 FHIR RDF data resource validation using ShEx.js

Adopting HL7 FHIR RDF allowed us to use Shape

Expressions (ShEx) to programmatically validate

instances of FHIR RDF data resources [32, 33]. is

validation focuses on the structure that HL7 resources

must have. With technical support from the HL7 FHIR

RDF community, we stood up a virtual server, loaded

it with the ShEx.js tool [33], and performed validation

tests on the FHIR RDF Obervation and Provenance data

resource types produced by the ETL pipeline. (e FHIR

RDF Library data resource type is not produced by the

ETL pipeline and was not validated.) Validation using

ShEx.js initially indicated the presence of several errors.

After correcting these errors, additional validation

tests were done until conformance with the HL7 FHIR

RDF standard was achieved. e ETL pipeline software

was then developed to produce conformant HL7 FHIR

RDF Observation and Provenance resources. More

information and relevant ﬁles can be found at: https://

github. com/ kgrid/ fhir- rdf- valid ation.

Example pure function used

As an example drawn from the domain of mental health

for this study, KSL team members implemented a pure

Page 5 of 12

Ansarietal. Journal of Biomedical Semantics (2025) 16:16

function in Python for computing and interpreting

a total Patient Health Questionnaire (PHQ-9) score

and its standard interpretation from answers to the

questionnaire provided by individual patients [34].

PHQ-9 scores are a common measure used to screen for

severity of depressive symptoms [35].

e PHQ-9 pure function exists outside of the CAMH

knowledge graph where it can also be accessed by other

applications through a corresponding RESTful API

service. e pure function accepts the individual numeric

results from the nine items comprising the PHQ-9 as its

input parameters. It simply computes the total PHQ-9

score and provides an interpretation of that total score.

A ﬁle with technical metadata about the origin and

characteristics of the PHQ-9 pure function was also

developed. ese metadata were later reformatted to

conform to the HL7 FHIR RDF Library data resource

type speciﬁcation for describing software libraries but

not validated using ShEx.js. At CAMH, a software script

was developed to load HL7 FHIR RDF Library metadata

from the Knowledge Object into the CAMH Knowledge

Graph. is script exists outside of the ETL pipeline

because it only needs to run once to load and record

metadata about each pure function. is script can be

run on an ad-hoc basis whenever a new pure function is

to be used for semantic data enrichment at CAMH.

Development oftheETL pipeline using Apache NiFi

andpython code

To establish the ETL pipeline and complete the technical

work for this project, several existing technologies were

used. When building the ETL pipeline, Apache NiFi [36]

was leveraged for its ability to automate the ﬂow of data

between existing software systems. CAMH already used

Apache NiFi to routinely check REDCap for new PHQ-9

responses. At CAMH, whenever new PHQ-9 responses

are detected, Apache NiFi inserts them into the CAMH

knowledge graph as a PHQ-9 response data object. We

used Apache NiFi as a tool for implementing the ETL

pipeline.

In our case, the CAMH Knowledge Graph emits

Server-Sent Events (SSEs) when data is inserted, updated,

or deleted. In our ETL pipeline implementation, Apache

NiFi was conﬁgured to monitor these SSEs speciﬁcally

for PHQ-9 responses being added to the graph. Upon

detecting such an event, Apache NiFi ﬁrst retrieves new

PHQ-9 responses for a patient from the graph and then

transmits these responses via an API request to the pure

function for computation.

Each time new PHQ-9 scoring and interpretation

computations are made, the ETL pipeline then pro-

duces a single new conformant HL7 FHIR RDF Obser-

vation resource representing the new computations and

a corresponding single new conformant HL7 FHIR RDF

Provenance resource with semantic information describ-

ing how each computation came about and linking to the

speciﬁcs of the pure function used to compute it.

Testing thedata enrichment ETL pipeline

Manual tests were performed at each step in the pipeline

to conﬁrm that the pipeline functioned properly. Because

Blue Brain Nexus will accept essentially any RDF, to

conﬁrm that the data was inserted correctly with the

expected RDF data structures, SPARQL queries were

performed. Blue Brain Nexus supports conformant

SPARQL queries and provides its users with immediate

query results.

Results

To address Research Question 1, we found the essential

information required to eﬀectively trace and document

semantic data enrichment processes involving pure

functions could be represented in two ways. First, as a list

of competency questions (Table1) where each question is

associated with one or more target stakeholder group(s)

thought to be most interested in it. Second, we found

that a combination of three HL7 FHIR RDF data resource

types (Library, Observation, and Provenance) included

the answers to the 13 competency questions in Table1.

A mapping between the 13 competency questions and

elements of the Library, Observation, and Provenance

data resource types is provided in the far right column of

Table1.

To address Research Question 2, we found a number

of components were needed to establish an ETL pipeline

for semantic data enrichment. ese components are

depicted in Fig.2.

Starting on the left of Fig. 2 above, existing data

sources send data to the CAMH Knowledge Graph. In

2019, CAMH initially created this Knowledge Graph by

deploying an instance of the Blue Brain Nexus knowledge

graph platform in an on-premise Kubernetes cluster.

When the ETL pipeline surrounded by the dotted line is

triggered, it fetches and processes data from the graph.

In general, the technical components needed to support

a semantic data enrichment ETL pipeline like the one

we developed are a knowledge graph, a listener (like

Apache NiFi), the pipeline’s software code (which we

implemented in ﬁve stages using Python), and a deployed

instance of one or more executable pure functions (which

we accessed via a local API).

e computational sequence supported by these tech-

nical components is shown from top to bottom in the

Fig. 3 immediately below. e sequence starts with a

patient providing responses to the items of the PHQ-9

in REDCap. e Apache NiFi Listener checks REDCap

Page 6 of 12

Ansarietal. Journal of Biomedical Semantics (2025) 16:16

for newly posted patient PHQ-9 survey responses every

10min. When one or more instances of these responses

are found, they are loaded into the Blue Brain Nexus

Knowledge Graph. Next, Blue Brain Nexus generates a

speciﬁc server side event message which is detected by

Apache Niﬁ, causing it to trigger the ETL pipeline. After

querying data available in the graph, the pipeline’s code,

which is also running in Apache Niﬁ, posts a request for

a summary PHQ-9 score and its interpretation to the

KO-API. Next, the pipeline’s code generates correspond-

ing FHIR RDF Observation and Provenance messages

containing new computations and information about

how the new computations were produced respectively.

Finally, the pipeline’s code loads the new FHIR RDF

Observation and Provenance resources into the graph,

thereby enriching it with more data.

To address Research Question 3, we learned the fol-

lowing about adopting and making any pipeline’s out-

puts conform to existing HL7 FHIR RDF data resource

standards. First, we created instances of the Observation

and Provenance resources, serialized them in Turtle, and

tested them for conformance with the HL7 FHIR RDF

standards using the ShEx.js validation tool.

To test these two HL7 FHIR RDF data resource types

using ShEx.js, a common schema detailing all of the

requirements for each type of data resource is required. We

used openly available HL7 FHIR RDF schema created by

Sharma etal. for this [37]. As an example of schema like

these, the initial lines of the Observation resource schema

that we used look like this:

Next is a complete Observation resource we created that

passed its ShEx.js validation:

Table 1 List of competency questions. These 13 competency questions developed by the research team provided a necessary outline

of the information space of interest for this project

Competency Question Target Stakeholder Groups HL7 FHIR RDF Data

Resource Type with this

Information

1 COMPUTATION TIME & DATE: At what date and time

was the patient-level computation made? Patients, Clinicians, Researchers, IT Professionals OBSERVATION

2 PATIENT: Who was the patient-level computation about? Patients, Clinicians, Researchers OBSERVATION, PROVENANCE

3 NAME OR TITLE: Which pure function was used to compute

the patient-level computation? Clinicians, Researchers, IT Professionals LIBRARY

4 VERSION: What was the version of the pure function

implemented in code that was used to compute the patient-

level computation?

Clinicians, Researchers, IT Professionals LIBRARY

5 PURE FUNCTION CREATION DATE: When was the pure function’s

implementation in code created? Researchers, IT Professionals LIBRARY

6 PURPOSE: What was the purpose of the pure function used

to generate the patient-level computation? Patients, Clinicians, Researchers LIBRARY

7 PURE FUNCTION IDENTIFIER: What was the persistent, unique

identiﬁer for the pure function and its implementation in code? IT Professionals LIBRARY, PROVENANCE

8 INPUTS: What were the parameters ("inputs") used to complete

the patient-level computation? Patients, Clinicians, Researchers PROVENANCE

9 OUTPUTS: What was the computation generated as output

using the inputs and the pure function? Patients, Clinicians, Researchers OBSERVATION

10 UNDERLYING EVIDENCE: What was the evidence upon which

the pure function is based? Clinicians, Researchers LIBRARY

11 FUNCTION CALL: What was the function call to POST patient

data to the API where the function is running? IT Professionals PROVENANCE

12 PURE FUNCTION FILES: What ﬁles were used to implement

the function and its API? IT Professionals LIBRARY

13 PURE FUNCTION CREATOR: Who created and implemented

the pure function in code? Clinicians, Researchers, IT Professionals LIBRARY

Page 7 of 12

Ansarietal. Journal of Biomedical Semantics (2025) 16:16

We conﬁrmed that the Observation resource above

passed its validation using ShEx.js when we received the

following truncated output from ShEx.js with no"failure

list"of errors. To show the diﬀerence, below that is the

truncated output from a failed validation test with

a"FailureList"of errors. e ﬁrst failure shown in part

below is a"TypeMismatch"for the Observation resource’s

status ﬁeld, meaning that the status ﬁeld did not contain

an entry allowed by the HL7 FHIR RDF standard.

Fig. 2 Components of a Semantic Data Enrichment ETL Pipeline. The dashed line surrounds the actual ETL pipeline. From left to right,

the components start with data sources routinely feeding a knowledge graph with new data. An event listener detects when select new data

reaches the graph and triggers the ﬁve stages of the software pipeline to enrich the data in the graph by using a pure function API backed

by an executable implementation of a pure function

Page 8 of 12

Ansarietal. Journal of Biomedical Semantics (2025) 16:16

We performed iterative validation tests like these for our

Observation and Provenance resource examples until our

examples of both resources passed their ShEx.js valida-

tion. At that point, the code for Stage 4 of the ETL pipeline

(see Fig. 2) was programmed to produce conformant new

instances of the Observation and Provenance data resources.

Next, we learned how to use SPARQL queries of an

enriched knowledge graph like the one below to conﬁrm

that the HL7 FHIR RDF Observation and Provenance

data resources loaded by the ETL pipeline into the CAMH

Knowledge Graph at the ﬁnal pipeline Stage 5 contain infor-

mation suﬃcient to answer our competency questions.

Fig. 3 Semantic Data Enrichment ETL Pipeline Sequence Diagram. The computational sequence supported by the pipeline is detailed

Page 9 of 12

Ansarietal. Journal of Biomedical Semantics (2025) 16:16

As shown in Fig.4 further below, since we are querying

a graph the SPARQL queries we developed take advan-

tage of semantic data links between new observation and

provenance resources and existing patient, PHQ-9 ques-

tionnaire response, and (software) Library resources.

When subgraphs like that in Fig.4 are queried using

the SPARQL query above, the CAMH Knowledge Graph

produces rows of query output like the example rows in

Table2 below.

is study addressed three research questions. e

results above list essential information needed to trace

and document semantic data enrichment, show the com-

ponents of an ETL pipeline for enriching an RDF data

store, detail what is required for such an ETL pipeline

to be able to produce conformant HL7 FHIR RDF data

resources, and show how to query such data resources

using SPARQL.

Discussion

We conducted this technical study to show how biomedi-

cal research organizations like CAMH that already use

knowledge graphs can further enrich their graph data by

computing with pure functions. To a signiﬁcant degree,

we overcame the three challenges noted above. To

uphold patient data security and integrity, the ETL pipe-

line we developed operates entirely inside CAMH’s IT

Fig. 4 Relationships Between Relevant Data Resources. The ETL pipeline produces the HL7 FHIR RDF Observation and Provenance resources

on the left which are related to the Patient, Questionnaire Response, and (software) Library resources on the right as shown

Table 2 Two rows of a report showing data of interest about the computation of a derived patient feature. Each row shows data

for one simulated patient including the numeric output from the pure function in the"phq9_total_score"column, the computed

interpretation of the numeric output in the"phq9_total_score_interpretation"column, the date and time when the computations

of response and interpretation were generated (prov_recorded_dt) and the version of the pure function used to compute them

(ko_version)

patient_id phq9_total_score phq9_total_score_

interpretation prov_recorded_dt ko_version

123456 (not a real ID) 18 Severe 2022–03-03T21:30:00.000Z 1.0.0

654321 (not a real ID) 22 Moderately Severe 2022–03-03T19:30:00.000Z 1.0.0

Page 10 of 12

Ansarietal. Journal of Biomedical Semantics (2025) 16:16

security zone. Patient data never leaves CAMH. To assist

in managing growing pure function use over time, we

showed how every computation made with a pure func-

tion can be associated with key provenance information,

including information about the implemented version of

any pure function. CAMH values this provenance infor-

mation for computations about patients because it ena-

bles their researchers to understand when and how each

computation was produced. Finally, to show how exist-

ing semantic data resource standards can be leveraged

when enriching graph data with pure functions, we pro-

grammed the pipeline to generate valid HL7 FHIR RDF

Observation and Provenance data resources.

For our example pure function in this study, we

selected a simple but widely used scoring and inter-

pretation function relevant to clinical depression. is

function combines patient reported data from the

Patient Health Questionnaire-9 instrument to gen-

erate a summary PHQ-9 score and an interpretation

of that PHQ-9 score [38]. In this example, computing

the PHQ-9 summary score and its interpretation in a

standard, reliable, and consistent way enables CAMH

to enrich their patient data knowledge graph automati-

cally every time new PHQ-9 responses are logged.

While CAMH’s implementation of the pipeline uses

SPARQL queries to fetch PHQ-9 data stored in Blue

Brain Nexus, organizations are not required to migrate

their existing data to a knowledge graph platform. e

modular design of pure functions allows the pipeline

architecture to be source-agnostic as to its input param-

eters. Organizations can query PHQ-9 data directly

from their current systems so long as the query returns

the information required for the pure function signa-

ture. However, a key requirement for optimal pipeline

performance is having a versioned data management

approach where individual data points are assigned

unique and persistent identiﬁers as mentioned in the

FAIR principles. is is what enables the pipeline to

maintain a provenance chain that can trace computa-

tional results back to their source observations, regard-

less of the underlying storage systems.

e fact that the pure function is designed independ-

ent of the pipeline allows it to seamlessly be reused

across diﬀerent applications, pipelines, and computing

environments provided the input data conforms to the

expected function signature. is reusability ensures

computational consistency across systems and facili-

tates validation and testing eﬀorts.

Additionally, the semantic data enrichment ETL pipe-

line we developed can now be ﬁtted with other locally

executable pure functions beyond PHQ-9 scoring. In

biomedicine overall, computing with pure functions is

increasingly important, as hundreds of medical calcu-

lators, biomedical equations, scoring tools, and com-

putable guidelines exist that are widely and frequently

used. In future work, we look forward to ﬁtting seman-

tic data enrichment pipelines like the one developed for

this study with a wide array of biomedical pure func-

tions relevant to mental health research. Being source-

agnostic allows organizations to leverage their existing

data infrastructure while gaining the provenance ben-

eﬁts demonstrated in our implementation, making the

pipeline broadly applicable across diverse healthcare IT

environments for any computable biomedical function

that follows similar architectural patterns.

Several ongoing developments suggest that the num-

ber of biomedical pure functions is likely to increase. e

most obvious of these may be the recent growth in small

and large-scale machine learning and deep learning (ML/

DL) models for biomedicine. e medical informatics

literature is increasingly ﬁlled with reports of new statis-

tical-mathematical pure functions arising from ML/DL

[5]. At the individual patient level, computations using

a growing number of ML/DL pure functions inform and

assist with disease categorization and subcategoriza-

tion, outlier detection, outcomes prediction, and many

other things. Other work is ongoing to improve widely

used biomedical equations for estimating patient-level

features, including physiological functioning and health

risks. As more and higher quality patient data become

available [39], and as the use of race as a proxy for genet-

ics gets set aside [40], widely used biomedical equations

get reworked. As examples, consider how once widely

used equations for predicting kidney function (e.g., the

Cockroft-Gault equation) and for estimating cardiovas-

cular disease risk (e.g., Framingham Risk Score) have

been replaced by newer equations. As these changes

occur, being able to trace how computations were made

enables organizations to distinguish computations made

with older and newer versions of the pure functions they

use. For clinicians and patients, being able to trust that

computations have been properly made is critical. For

researchers, understanding how individual patient-level

computations come about is necessary for reproduc-

ing data analyses. For risk managers, tracing the origins

of select high stakes biomedical computations may be

desirable.

It is also true that biomedical pure functions can

be used in combination to arrive at"compound

computations."For example, estimates of kidney function

may serve as inputs to computable phenotype functions

that determine the stages of chronic kidney disease. It is

Page 11 of 12

Ansarietal. Journal of Biomedical Semantics (2025) 16:16

easy to imagine more elaborate examples where many

pure functions are used in conjunction to compute new

patient-level features. In one of our past projects, more

than 40 pure functions were used to compute a battery

of estimates of the risks and beneﬁts of preventive medi-

cal services [21]. In clinical medicine, now is the time to

develop reliable procedures for producing and reporting

singular and compound computations arising from the

use of pure functions. is work shows how knowledge

graphs and related semantic web technology can assist in

this regard.

is work began as an information modeling exercise.

As the modeling work unfolded, it became apparent that

our"homegrown"models overlapped considerably in their

content with existing data resource models from the

HL7 FHIR community. Adopting HL7 FHIR’s RDF data

resource models brought several advantages. First, reli

nquishing"homegrown"information models makes this

work applicable to a wider audience. Second, adopting

HL7 FHIR RDF models made it possible to complete a

demonstration of automated data resource conform-

ance testing and validation using the newly created ShEx

toolkit. is obviated any need to develop a unique data

conformance testing and validation mechanism of our

own. ird, showing how HL7 FHIR RDF can be used

in a practical application like our ETL pipeline may help

others to do similar things.

e focus of this work is exclusively on semantic

data enrichment using domain-speciﬁc pure functions

implemented external to knowledge graphs. is kind

of data enrichment is quite diﬀerent from using logical

reasoners to produce new inferences from existing graph

data. A critical limitation of this work is that it did not

implement the example pure function or its separate API

using semantic technologies. We look forward to future

work along these lines where we describe the inputs and

outputs of pure functions using semantic data and deploy

pure functions using semantic API technology.

is work is further limited in the following ways. No

attempt was made to develop the ETL pipeline using

other tooling even though Jupyter Notebooks and other

available tools could be used to build similar pipelines.

Only one pure function implementation was used, and

the data enrichment ETL pipeline that resulted is likely

over-ﬁt to the example PHQ-9 pure function.

We also encountered limitations related to FHIR RDF.

A limitation in the FHIR RDF R5 vocabulary is that not

all of the FHIR RDF R5 URIs resolve to a web page. is

makes it harder to understand the meaning of some

FHIR RDF R5 data. rough our discussions with the

FHIR RDF R5 implementation team, we understand work

to address this issue is ongoing. We have provided some

links to the actual deﬁnitions from the FHIR vocabulary

for terms used in the above query:

One other thing we noted is that in FHIR RDF’s R5

release, the use of the property fhir:value was changed

to fhir:v for primitive values to avoid any conﬂicts with

other uses of fhir:value throughout the FHIR standard.

Conclusions

Using open source technologies, an existing patient

knoweldge graph, and HL7 FHIR RDF Library,

Observation, and Provenance linked data resources, a

software pipeline was built to enrich a patient-oriented

knowledge graph with computations arising from using

a pure function. is approach to data enrichment is

an example of bringing"compute"to a knowledge graph

environment using widely available technologies and

in conformance with an emerging data standard (HL7

FHIR RDF). is work enables secure and traceable

computation for healthcare and biomedical research.

Abbreviations

ETL Extract, Transform, and Load

FHIR Fast Health Interoperability Resources

HL7 Health Level 7

PHQ-9 Patient Health Questionnaire 9

RDF Resource Description Framework

Acknowledgements

The authors would like to thank David Booth, Eric Prud’hommeaux and the

HL7 FHIR RDF for Semantic Interoperability Working Group for their interest

and willingness to share information about the emerging HL7 FHIR RDF

standard. This project also beneﬁted from early information modelling eﬀorts

that included Robert Xiao during his term at CAMH.

Authors’ contributions

A.A. and A.F. conceived of this study. A.A., A.F., and A.P. did the information

space modeling work. A.P. completed the ShEx validation work with help from

A.F.. A.A. developed and tested the ETL data enrichment pipeline at CAMH.

M.C. reviewed and commented on all results. A.A., A.F., A.P., and M.C. wrote and

edited the manuscript.

Funding

No grant funding was used to support this work.

Data availability

The data and software supporting the conclusions of this article are included

within the article and in additional ﬁles to which links are provided.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Page 12 of 12

Ansarietal. Journal of Biomedical Semantics (2025) 16:16

Competing interests

The authors declare no competing interests.

Received: 16 December 2024 Accepted: 1 July 2025

References

1. Bartosz Milewski. Chapter 3 "Pure Functions, Laziness, I/O, and Monads"

in "Basics of Haskell". School of Haskell. 2013. Online at: https:// www.

schoo lofha skell. com/ school/ start ing- with- haske ll/ basics- of- haske ll/3-

pure- funct ions- lazin ess- io. Retrieved 2024-06-19.

2. Spitzer RL, Kroenke K, Williams JB, Löwe B. A brief measure for assess-

ing generalized anxiety disorder: the GAD-7. Arch Intern Med.

2006;166(10):1092–7.

3. Lu T, Silveira PP, Greenwood CM. Development of risk prediction models

for depression combining genetic and early life risk factors. Front Neuro-

sci. 2023;18(17):1143496.

4. Su C, Xu Z, Pathak J, Wang F. Deep learning in mental health outcome

research: a scoping review. Transl Psychiatry. 2020;10(1):116.

5. Federico CA, Trotsyuk AA. Biomedical data science, artiﬁcial intelligence,

and ethics: navigating challenges in the face of explosive growth. Ann

Rev Biomed Data Sci. 2024;10:7.

6. Shivade C, Raghavan P, Fosler-Lussier E, Embi PJ, Elhadad N, Johnson

SB, Lai AM. A review of approaches to identifying patient phenotype

cohorts using electronic health records. J Am Med Inform Assoc.

2014;21(2):221–30.

7. Miotto R, Li L, Kidd BA, Dudley JT. Deep patient: an unsupervised rep-

resentation to predict the future of patients from the electronic health

records. Sci Rep. 2016;6(1):1.

8. Braunstein ML. Health Informatics on FHIR: How HL7’s API is transforming

healthcare. Springer; 2022. https:// link. sprin ger. com/ book/ 10. 1007/ 978-

3- 030- 91563-6# bibli ograp hic- infor mation.

9. Duda SN, Kennedy N, Conway D, Cheng AC, Nguyen V, Zayas-Cabán T,

Harris PA. HL7 FHIR-based tools and initiatives to support clinical research:

a scoping review. J Am Med Inform Assoc. 2022;29(9):1642–53.

10. Resource Description Framework, W3C At: https:// www. w3. org/ RDF/.

Accessed 20 June 2024.

11. Ajileye T, Motik B. Materialisation and data partitioning algorithms for

distributed RDF systems. J Web Semantics. 2022;1(73):100711.

12. JSON for Linking Data. At: https:// json- ld. org/. Accessed 20 June 2024.

13. Quilitz B, Leser U. Querying distributed RDF data sources with SPARQL.

InThe Semantic Web: Research and Applications: 5th European Semantic

Web Conference, ESWC 2008, Tenerife, Canary Islands, Spain, June 1-5,

2008 Proceedings 5 2008 (pp. 524-538). Springer Berlin Heidelberg.

14. Mishra RB, Kumar S. Semantic web reasoners and languages. Artif Intell

Rev. 2011;35:339–68.

15. Sy MF, Roman B, Kerrien S, Mendez DM, Genet H, Wajerowicz W, Dupont

M, Lavriushev I, Machon J, Pirman K, Neela MD. Blue brain nexus: an open,

secure, scalable system for knowledge graph management and data-

driven science. Semantic Web. 2023;14(4):697–727.

16. Rotenberg DJ, Chang Q, Potapova N, Wang A, Hon M, Sanches M, Bogetic

N, Frias N, Liu T, Behan B, El-Badrawi R. The CAMH neuroinformatics plat-

form: a hospital-focused Brain-CODE implementation. Front Neuroinform.

2018;6(12):77.

17. Jeelani OF, Njie M, M Korzhuk V. Methods and algorithms of ensuring

data privacy in AI-based healthcare systems and technologies. In Confer-

ence Proceedings, Paris France April 2024 Apr 11 (Vol. 11, p. 12).

18. Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak

A, et al. The FAIR guiding principles for scientiﬁc data management and

stewardship. Sci Data. 2016;3(1): 160018.

19. Barker M, Chue Hong NP, Katz DS, Lamprecht AL, Martinez-Ortiz C,

Psomopoulos F, et al. Introducing the FAIR principles for research soft-

ware. Sci Data. 2022;9(1): 622.

20. Knowledge Systems Lab. KOIO - The knowledge object implementation

ontology. GitHub - kgrid/koio: Available from: https:// github. com/ kgrid/

koio. Cited 2025 Apr 14.

21. Flynn A, Taksler G, Caverly T, Beck A, Boisvert P, Boonstra P, et al. CBK

model composition using paired web services and executable functions:

a demonstration for individualizing preventive services. Learn Health Syst.

2023;7(2): e10325.

22. Alper BS, Flynn A, Bray BE, Conte ML, Eldredge C, Gold S, et al. Categoriz-

ing metadata to help mobilize computable biomedical knowledge.

Learning Health Systems. 2022;6(n/a): e10271.

23. Flynn A, Conte M, Boisvert P, Richesson R, Landis-Lewis Z, Friedman C.

Linked metadata for FAIR digital objects carrying computable knowledge.

Res Ideas Outcomes. 2022;10(8):e94438.

24. McCusker J, McIntosh LD, Shaﬀer C, Boisvert P, Ryan J, Navale V, Topaloglu

U, Richesson RL. Guiding principles for technical infrastructure to support

computable biomedical knowledge. Learn Health Syst. 2023;7(3): e10352.

25. Knowledge Systems Lab. BMI Calculator Knowledge Object. Available

from: https:// github. com/ kgrid/ koio/ tree/ master/ examp les/ bmi_ calcu

lator_v_3. Cited 2025 Apr 14.

26. Prud’hommeaux E, Collins J, Booth D, Peterson KJ, Solbrig HR, Jiang G.

Development of a FHIR RDF data transformation and validation frame-

work and its evaluation. J Biomed Inform. 2021;1(117):103755.

27. HL7 FHIR RDF Representation. Accessed at: https:// hl7. org/ fhir/ rdf. html

December 5, 2024.

28. Ren Y, Parvizi A, Mellish C, Pan JZ, Van Deemter K, Stevens R. Towards

competency question-driven ontology authoring. In The Semantic Web:

Trends and Challenges: 11th International Conference, ESWC 2014,

Anissaras, Crete, Greece, May 25–29, 2014. Proceedings 11 2014 (pp.

752–767). Springer International Publishing.

29. IDLab. The Function Ontology. The function ontology. Available from:

https:// fno. io/. Cited 2025 April 14.

30. PROV-O: The PROV Ontology. Available from: https:// www. w3. org/ TR/

prov-o/. Cited 2025 Apr 14.

31. Health Level 7. Resource description framework representation. Available

from: https:// www. hl7. org/ fhir/ rdf. html. Cited 2025 April 14.

32. Solbrig HR, Prud’hommeaux E, Grieve G, McKenzie L, Mandel JC, Sharma

DK, Jiang G. Modeling and validating HL7 FHIR proﬁles using semantic

web shape expressions (ShEx). J Biomed Inform. 2017;67(1):90–100.

33. Prud’hommeaux E . ShEx.js. Available from: https:// github. com/ shexjs/

shex. js. Cited 2024 Dec 5.

34. Knowledge Systems Lab. PHQ9 Knowledge Object. Available from:

https:// github. com/ kgrid- objec ts/ CAMH/ blob/ main/ colle ction/ 99999-

CAMH2- v1.0/ PHQ9_ algor ithm. js. Cited 2025 April 14.

35. Kroenke K, Spitzer RL, Williams JBW. The PHQ-9. J Gen Intern Med.

2001;16(9):606–13.

36. Apache Software Foundation. Apache NiFi. Apache NiFi. Available from:

https:// niﬁ. apache. org/. Cited 2025 Apr 14.

37. Sharma DK, Prud’hommeaux E, Booth D, Nanjo C, Jiang G. Shape expres-

sions (ShEx) schemas for the FHIR R5 speciﬁcation. J Biomed Inform.

2023;148(Dec): 104534.

38. Kroenke K, Spitzer RL, Williams JB. The PHQ-9: validity of a brief depression

severity measure. J Gen Intern Med. 2001;16(9):606–13.

39. International Data Corporation. 2018. Digitization of the world from

edge to core. Available from: https:// www. seaga te. com/ ﬁles/ www- conte

nt/ our- story/ trends/ ﬁles/ idc- seaga te- dataa ge- white paper. pdf. Cited 2025

Apr 14.

40. National Academies of Sciences and Medicine. Using population

descriptors in genetics and genomics research: a new framework for

an evolving ﬁeld. Washington, DC: The National Academies Press; 2023.

Available from: https:// nap. natio nalac ademi es. org/ catal og/ 26902/ using-

popul ation- descr iptors- in- genet ics- and- genom ics- resea rch-a- new.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in

published maps and institutional aﬃliations.

0 views·12 pages

A prototype ETL pipeline that uses HL7 FHIR RDF resources when deploying pure functions to enrich knowledge graph patient data PDF Free Download

A prototype ETL pipeline that uses HL7 FHIR RDF resources when deploying pure functions to enrich knowledge graph patient data PDF free Download. Think more deeply and widely.

Uploaded by sjarvis on 4/10/2026

/12

100%