Viagra gibt es mittlerweile nicht nur als Original, sondern auch in Form von Generika. Diese enthalten denselben Wirkstoff Sildenafil. Patienten suchen deshalb nach viagra generika schweiz, um ein günstigeres Präparat zu finden. Unterschiede bestehen oft nur in Verpackung und Preis.
Microsoft word - d3_1_semcare_architecture_v1_final.docx

Semantic Data Platform for Healthcare 
Lead beneficiary: MUG 
D3.1 Sketch of system 
Date: 31/03/2014 
architecture specification 
Nature: Report 
WP3 – Architecture and 
Dissemination level: PU 

D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1-0 Final 
TABLE OF CONTENTS 
DOCUMENT INFORMATION . 4 
DOCUMENT HISTORY . 4 
DEFINITIONS . 5 
EXECUTIVE SUMMARY . 6 
KEY WORDS (WORDLE STYLE) . 7 
INTRODUCTION . 8 
ABOUT SEMCARE . 8 
MOTIVATION AND BACKGROUND . 8 
PROJECT DESCRIPTION . 8 
ABOUT THIS DOCUMENT . 9 
AIM OF THIS DOCUMENT . 9 
DOCUMENT STRUCTURE . 9 
APPLICATION SCENARIO / REQUIREMENTS . 10 
USE CASE . 10 
BACKGROUND AND MOTIVATION . 10 
APPROACH . 11 
TOPICS OF INTEREST AND THEIR (TEXTUAL) REPRESENTATION IN EHRS . 11 
REQUIREMENTS . 14 
FUNCTIONAL REQUIREMENTS . 14 
NON-FUNCTIONAL REQUIREMENTS . 15 
ARCHITECTURE . 16 
OVERVIEW . 16 
INTERFACES . 19 
DATA MODELS . 20 
INPUT DATA . 20 
TERMINOLOGIES . 21 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 

D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1-0 Final 
I2B2 STAR SCHEMA . 22 
SEMCARE PATIENT RECORD SOLR DOCUMENT . 24 
SEMCARE DATA LOADING FLOW . 26 
MODULES & FUNCTIONAL VIEW . 27 
SEMCARE DATA IMPORTER . 28 
SOLR . 28 
AVERBIS TEXT ANALYTICS (AEP) . 28 
SEMCARE PORTAL WEB APPLICATION . 28 
I2B2 APPLICATIONS . 29 
THIRD PARTY TOOLS AND APPLICATIONS . 29 
SCALABILITY . 29 
USERS & ROLES . 29 
OPEN POINTS . 30 
DATA PRIVACY / TECHNICAL AND ORGANIZATIONAL SECURITY PROCEDURES . 31 
DATA PROCESSING . 31 
DATA TRANSFER AND DATA LOCATION . 31 
ROLE CONCEPT . 32 
AVAILABILITY CONTROL . 32 
DATA SEPARATION CONTROL . 32 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 

D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1-0 Final 
TABLE OF FIGURES 
FIGURE 1: SYSTEMS INVOLVED IN THE SEMCARE ARCHITECTURE . 16 
FIGURE 2: ARCHITECTURE SKETCH . 18 
FIGURE 3: ARCHITECTURE LAYERING . 19 
FIGURE 4: AVERBIS SEARCH REST API . 20 
FIGURE 5: REFINEMENT PROCESS FOR CRITERIA . 21 
FIGURE 6: I2B2 STAR SCHEMA . 23 
FIGURE 7: I2B2 CUSTOM_META TABLE . 23 
FIGURE 8: CUSTOM METADATA IN I2B2 TERM NAVIGATOR . 24 
FIGURE 9: SOLR TO I2B2 MAPPING . 24 
FIGURE 10: MAPPING OF SOLR DOCUMENTS TO I2B2 DATABASE . 25 
FIGURE 11: SEMCARE DATA LOADING FLOW . 26 
FIGURE 12: TALEND OPEN STUDIO DATA IMPORTER . 27 
FIGURE 13: SEMCARE COMPONENTS . 27 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 

D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1-0 Final 
DOCUMENT INFORMATION 
ICT-611388 Acronym 
Number Full title 
Semantic Data Platform for Healthcare 
EU Project officer 
Saila Rinne ([email protected]) 
Deliverable Number 
of system architecture specification 
Architecture and Requirements 
Contractual 31.03.2014 
Version V1.0 Final 
Draft  Final  
 Prototype  Other  
Dissemination Level Public  Confidential  
Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Authors (Partner) 
Stefan Schulz (MUG) Stefan Schulz 
Responsible Author Partner MUG 
+43 699 150 96 270 
DOCUMENT HISTORY 
DESCRIPTION 
Initial Creation 
Corrections, comments, additions 
Corrections, additions 
Corrections, additions 
Corrections, additions 
A. Honrado, E. Chavarría 
Internal formal review 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 

D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1-0 Final 
• Partners of the SEMCARE Consortium are referred to herein according to the following codes: 
AVERBIS - Averbis GmbH (Germany) Coordinator 
EMC - Erasmus Universitair Medisch Centrum Rotterdam (Netherlands) – Beneficiary 
MUG - Medical University of Graz (Austria) – Beneficiary 
SGUL - Saint George's University of London (UK) – Beneficiary 
SYNAPSE - Synapse Research Management Partners S.L. (Spain) – Beneficiary 
• Project: The sum of all activities carried out in the framework of the Grant Agreement. 
• AEP: Averbis Extraction Platform; text analysis tool to extract information units such as facts and 
relations from unstructured text 
• CUI: Concept unique identifier in the Unified Medical Language System (UMLS) 
• EHR: Electronic health record; clinical data record of a patient 
• ETL: extract – transfer – load; Process in data warehousing that is often used to integrate data 
from multiple sources. A common ETL tool is Talend Open Studio. 
• GUI: Graphical user interface of the application 
• Graph DB: Database using graph structures with nodes, edges, and properties to represent and 
store data. Compared to a relational database it is faster and better scalable for large data sets. 
• HL7v2 format: Health Level Seven; universal standard for the exchange of electronic health 
• i2b2: Informatics for Integrating Biology and the Bedside; scalable informatics framework for 
• REST: Representational State Transfer; communication service between two components using 
JSON (JavaScript Object Notation) messages 
• Solr: Open source search platform from Apache Lucene, with Java client Solrj 
• Terminology: General term for information artefacts that provide controlled terms for a domain, 
identifiers of meaning and semantic relations. e.g. SNOMED, ICD-10, MeSH 
• Term Browser: Tool provided by Averbis to load, view, modify and export terminologies. It can 
also be used to create new terminologies. 
• UIMA: Unstructured Information Management Architecture; framework by Apache enabling the 
generation of analysis pipelines for arbitrary content such as text, image or video data 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 

D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1-0 Final 
EXECUTIVE SUMMARY 
The initial task in work package 3 is the agreement on a generic architecture for the semantic data 
platform SEMCARE. This document gives a first overview of the planned system architecture for the 
project. The considerations about the architecture are a fundamental step in the development of such 
a data platform. They therefore constitute an essential task right from the beginning of the project. 
The architectural design decisions are driven by several dimensions. First, the use cases covered by 
the project must be defined in order to evaluate the resulting requirements. As the SEMCARE 
software will be installed within the different partner hospitals, it must also be considered that the 
integration into the clinical IT landscape should be simple. Furthermore, aspects about data 
governance, privacy and security should be kept in mind when developing the system architecture. 
Another important requirement is the scalability of the system to allow processing of large data sets. 
Finally, the SEMCARE architecture should be constructed in a way that enables a seamless 
integration of other platforms and applications, which is also called an ‘Open Architecture'. This can 
be allowed by using standard components. 
The main goal of the architecture is to provide a framework to extract meaningful information out of a 
broad range of structured and unstructured information from the Electronic Health Record. To this 
end, several systems and resources have to be integrated within a common framework. Some of 
these components are brought in and adapted by the partners, such as an extraction platform and a 
terminology browser. Others are available as free software such as indexing tools and semantic 
repositories. Domain terminology resources constitute another cornerstone of this framework. 
Whereas the coverage of existing terminologies is already very good for English, the two other 
languages addressed by SEMCARE, viz. Dutch and German, are less well served, which will require 
efforts in filling the terminology gaps by the combination of automated and manual term acquisition 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 

D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1-0 Final 
KEY WORDS (Wordle style)1 
 1 http://www.wordle.net/ 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's Seventh Programme for research, technological 
development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
1.1. About SEMCARE 
Motivation and Background 
The exploitation of medical data from clinical trials and thus the monitoring and improving of 
healthcare delivery is of increasing interest. However, up to 80% of the clinical trials fail to meet their 
patient enrolment quotas on time. This recruitment delay currently causes up to $8 million per day in 
loss of revenue for the pharmaceutical industry. The SEMCARE project will provide a more efficient 
way of patient recruitment, which will be helpful to prevent recruitment delays. 
Furthermore, SEMCARE also addresses another challenge in the field of health care, which is the 
identification of rare diseases. For the doctors it is often hard to diagnose such diseases as they are 
hardly known, and hence this results in a number of undiagnosed or even wrongly diagnosed 
patients. SEMCARE will use available clinical patient data to combine signs and symptoms, thereby 
detecting undiagnosed patients suffering from rare diseases. This will contribute to speed up the 
research on this group of diseases. For the pharmaceutical companies, every newly diagnosed 
patient is of huge interest as it generates up to $300,000 drug revenue per year. 
1.1.2. Project 
The two-year research project SEMCARE ‘Semantic Data Platform for Healthcare' is funded by the 
European Commission's Seventh Framework Programme. The aim of the project is the development 
of a software platform that facilitates the diagnosis of rare diseases in various health care contexts, 
and supports the selection of appropriate patients for clinical studies, the basis being the automated, 
contextual evaluation of existing patient data. SEMCARE will combine current text-mining 
technologies with multi-lingual terminologies in order to develop solutions for typical problems that 
arise when interpreting medical narratives, e.g. ambiguities, abbreviations, spelling variations or 
typos. Testing and optimization of the analysis software for the analysis of routine medical data in 
clinics will be performed in leading European health centres in Great Britain, the Netherlands and 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
1.2. About this document 
Aim of this document 
A fundamental task in work package 3 is the agreement on a generic architecture for the semantic 
data platform SEMCARE. This document gives a first overview of the planned system architecture for 
the project. The considerations about the architecture are a crucial step in the development of such a 
data platform, which makes them essential right from the beginning of the project. In order to be able 
to design the architecture it must be defined which use cases are covered in the project and which are 
the resulting requirements. This deliverable contains only the basic requirements. More specific, user-
defined requirements related to the prototype will be provided in D3.2. 
1.2.2. Document 
This document has been structured into four main parts. Following an introduction into the SEMCARE 
project and the document, the use case that will be focused on during the project, is described. The 
definition of the use case is necessary for the identification of the requirements and the demands that 
are made on the platform and the underlying architecture. As a third step, we show the generic design 
of the SEMCARE architecture and describe the different modules and how they interact. Last but not 
least the document includes information about the technical and organizational procedures performed 
at the hospitals with regards to data privacy and security in the context of the SEMCARE systems and 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
2. Application Scenario / Requirements 
The three participating European health centres have agreed on one first general use case on which 
they will focus during the project. The use case is called 'Risk Stratification and Differential 
Diagnosis of Patients suffering from transient loss of consciousness'. This use case is 
described in detail in the following subsections. 
Background and motivation 
Cardiovascular disease is the cause of 47% of all deaths in Europe, the majority of which are related 
to underlying coronary artery disease2. Sudden cardiac death accounts for approximately half of 
coronary artery disease related deaths3 and also occurs in those with non-coronary artery disease 
related cardiovascular diseases such as cardiomyopathies and inherited channelopathies. Sudden 
death is also more prevalent in patients with epilepsy and is often unexplained when it is known as 
SUDEP (Sudden unexplained Death in Epilepsy). We are currently unable to determine those who 
are at most risk from SUDEP. 
The symptom of transient loss of consciousness (T-LOC) occurs in up to 50% of the general 
population and leads to 1% of all hospital admissions4,5,6. A wide range of conditions can lead to T-
LOC. Causes of T-LOC can be broadly categorized as cardiac (such as arrhythmia when it is known 
as syncope) or non-cardiac (such as epilepsy). Cardiac syncope carries a much more sinister 
prognosis as it is associated with sudden cardiac death. Fortunately effective treatments, such as 
anti-ischemic and heart failure medication and implantation of implantable cardioverter-defibrillator 
(ICD), can dramatically improve outcomes. Unfortunately, the clinically assigned aetiology and 
prognosis of T-LOC is frequently incorrect4, predominantly due to an inability to differentiate between 
cardiac syncope and epilepsy and a lack of appreciation of high-risk markers such as exertional T-
LOC, T-LOC with palpitation and function and/or pre-existent coronary and/or structural heart disease. 
 2 European Cardiovascular Disease Statistics, 2012 edition. European Heart Network and European Society of Cardiology 3 Myerburg RJ1, Junttila MJ. Sudden cardiac death caused by coronary heart disease. Circulation. 2012 Feb 28;125(8):1043-52. 4 Fitzpatrick AP1, Cooper P. Diagnosis and management of patients with blackouts. Heart. 2006 Apr;92(4):559-68. 5 Petkar S, Jackson M, Fitzpatrick A. Management of blackouts and misdiagnosis of epilepsy and falls. Clin Med. 2005 Sep-Oct;5(5):514-20. 6 Brignole M et al. A new management of syncope: prospective systematic guideline-based evaluation of patients referred urgently to general hospitals. Eur Heart J. 2006 Jan;27(1):76-82. Epub 2005 Nov 4. 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
2.1.2. Approach 
A number of phenotypic features can help risk stratify patients, most of which are available from 
routine assessment and investigations. Using a semantic data platform, we seek to identify high-risk 
patient cohorts based on patient-level criteria scattered in heterogeneous clinical data contained in 
electronic healthcare records (EHRs). 
Subjects belonging to a universal set of interest will have their electronic medical records processed 
for natural language expressions that denote often detailed descriptions about patients' clinical 
history, procedures or investigations planned or carried out. In our specific use case, the cases of 
interest are patients with prior myocardial infarction (MI), syncope of presumed cardiac origin or 
seizure disorder. The universal set of interest is defined as patients with Transient Loss of 
Consciousness and/or Sudden Cardiac Arrest and/or sustained Ventricular Arrhythmia and/or 
Cardiomyopathy and/or Ischemic Heart Disease and/or Seizure Disorder. 
However, the information extraction methodology we develop and describe is generic and could be 
adapted to whatsoever patient cohorts and medical inquiries. 
Topics of interest and their (textual) representation in EHRs 
In the following, medical topics of interest like procedures and investigations, but also information 
about the patients' medication and history are listed that will be used in order to identify subjects 
belonging to the universal set of interest described in the use case above. In the table below, only the 
most frequent topics of interest are listed. Contents of electronic medical records will be processed for 
typical phrases for topics and attributes. The values of the attributes are assumed to be numeric or 
Boolean and are therefore not of terminological interest. This means, that, e.g. "normal ECG" would 
be represented by the attribute "ECG normal" and the value "true", or "QRS interval 0.12 s" would be 
represented by the attribute "QRS interval in seconds" and the numeric value "0.12". 
For each topic of interest, some examples of indicative phrases and related attributes are listed in the 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
Indicative phrases for topics 
Indicative phrases for related attributes 
Topic of Interest 
(selected examples in English, Dutch 
(selected examples in English, Dutch and 
and German) 
"electrocardiogram", 
"normal", "normaal" 
"elektrocardiogram", 
"abnormal ECG", "abnormal 
"Elektrokardiogram", 
electrocardiogram" 
"PR Interval Duration" 
"atrioventricular", "atrioventriculaire", "AV", "QRS Interval Duration" 
Electrocardiogram 
"T wave inversion", "T wave abnormality" "ST segment depression", "ST segment elevation" "Bundle Branch Block", "RBBB", "LBBB" "pathological Q Waves" "Atrial fibrillation" 
"exercise tolerance test", "ETT", 
"normal", "ischemic", "ST segment depression" 
"T wave inversion", "blood pressure response" "ventricular tachycardia", "VT" 
Exercise Tolerance 
"ventricular ectopics", "VEs", "ectopics 
present", "ectopics absent", "couplets", "triplets", "salvos", "PVCs", "premature ventricular contractions" 
"holter monitoring", "holter", 
"24 hour tape", "48 hour tape", "event 
"non sustained VT", "non sustained ventricular 
tachycardia", "nsVT", "ventricular tachycardia", 
"ventricular ectopics", "VEs", "ectopics present", "couplets", "triplets", "salvos", "PVCs", "premature ventricular contractions", 
"cardiac catherisation", 
"normal", "unobstructed", "normal coronaries", 
"catherization", "cath", "angiogram", 
"normal coronary arteries", "normal coronary 
"coronary angiogram" 
angiography", "smooth coronary arteries", "stenosis", "stenoses", "obstruction" 
"echocardiogram", "echocardiografie" 
"normal heart", "no cardiomyopathy", "normal 
"echo", "TTE", "heart scan", 
"Echokardiogramm" 
"ejection fraction", "ventricular function" "ventricular dysfunction", "poor ventricular 
function", "impaired LV", "impaired left ventricular" "aortic stenosis", "mitral stenosis" "pulmonary hypertension" 
"CMR", "CMRI", "Cardiac MRI", "MRI - "normal" Cardiac", "Kardiales MRI", "Herz-MRI" "ejection fraction", "ventricular function" 
"late gadolinium enhancement", "Scar" "regional wall motion abnormality" 
"Blood Tests", "Bloods", 
"normal", "abnormal", "elevated", "raised", "low" 
"Biochemistry", "Full Blood Count", 
"FBC", "Troponin", "Toxicology", "Blutbild" "age", "DOB", "date of birth", 
"Geburtsdatum", "Alter", 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
Indicative phrases for topics 
Indicative phrases for related attributes 
Topic of Interest 
(selected examples in English, Dutch 
(selected examples in English, Dutch and 
and German) 
"geboortedatum" "alter" "medications", "meds", "drugs 
"drug name", "substance name", "dose", 
History", "is on", "Medikamente" 
"Furosemide", "Frusemide", "Metolazone", "Eplerenone", "Spironolactone", "Dosis" 
"family history of", "sudden cardiac 
arrest", "unexplained death", "brother 
"degree of relative", "first", "second", "mother", 
died suddenly", "cousin died 
"father", "brother", "sister", "aunt", "uncle", 
"son", "daughter", "Vater", "Mutter", "Onkel", 
"vader", "broer", "zuster", "tante", "oom", "zoon", "dochter", 
"VT", "VF", "ventricular tachycardia", 
in context of ventricular fibrillation: 
"polymorphic VT", "ventricular 
"idiopathic", "no cause", "no aetiology", 
fibrillation", "torsades", "resuscitated 
"idiopathisch", "ohne erkennbare Ursache" 
sudden death", "resuscitated SCD" 
"Arrest", "Cardiac arrest", "VF arrest", 
"Plötzlicher Herztod", "Sekundentod" 
"syncope", "near syncope", "pre-
"on exertion", "exertional", "on exercise", 
syncope", "presyncope" 
"exercise related", "exercise induced", "stress 
"blackout", "black-out", "collapse", 
related", "catecholamine related", "emotion 
"faint", "loss of consciousness", 
induced", "while running", "whilst running", 
"LOC", "TLOC", "T-LOC", "pass out", 
"mid-stride", "in Verbindung mit Stress", 
"passing out", "passed out", 
"prolonged standing", 
"prodromal symptoms", "coughing", "micturition", "passing water", "urinating", "swallowing" 
"heart failure", "HF", "CCF", 
"Severe", "Gross", "Moderate", "Mild", 
"cardiomyopathy", "breathlessness", 
"NYHA Class I", "NYHA Class II", "NYHA Class 
"NYHA II", "NYHA III", "NYHA IV", 
III", "NYHA Class IV", "NYHA I" 
"Herzversagen", "Herzinsuffizienz" "Myocardial infarction", "STEMI", 
"nonSTEMI", "non-STEMI", 
"NSTEMI", "acute coronary 
syndrome", "ACS", "ischaemic heart 
"Unstable Angina" 
disease", "IHD", "CAD", "Angina", 
"Previous stents" 
"Troponin rise" "Previous stents", "PCI", "angioplasty", "CABG" 
"seizure disorder", "epilepsy", 
"Type", "Petit Mal", 
"fitting", "fits" 
"Status Epilepticus", 
"limb jerking" "status epilepticus", "Krämpfe", "Epilepsie", "Anfall" 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
2.2. Requirements 
In this section the basic requirements arising from the described use case are described. A more 
detailed description of the user specific requirements will be provided after developing the first 
prototype. This description will be part of D3.2. 
2.2.1. Functional 
In order to identify candidates matching the aforementioned criteria, arbitrary types of free-text 
documents in patient records have to be gathered, pre-processed and analysed. Hence, in a first 
stage, interfaces to existing clinical IT systems have to be established to consolidate the data from 
each relevant resource. This stage in general also includes a data transformation process mapping, 
for instance, HL7 encoded data to a target schema of a central knowledge store. These kinds of tasks 
are perfectly solved by the aid of ETL (extraction, transformation, loading) tools such as Talend Open 
Studio7 or Pipeline Pilot8. 
Furthermore, the identification of use case specific criteria (e.g. ‘loss of consciousness whilst running') 
within clinical narratives require that an information extraction system needs to be prepared to a 
variety of isosemantic lexical and syntactic variants found in the texts. Consequently, for each criterion 
and attribute of interest numerous synonymous expressions have to be considered in order to 
guarantee a high recall of relevant candidates. To handle this huge complexity we will use a Solr 
search engine combined with several domain terminologies like SNOMED CT, ICD-10 or MeSH. One 
main focus of the SEMCARE platform is the end-user support in the criteria refinement process. This 
is not trivial as it will require a dialogue with the users in order to acquire custom expressions that 
would enhance the terminological coverage. Details on this refinement process are described within 
section 3 below. 
Another key aspect is the language of the document. Text processing tools have to consider the 
particular syntax and grammar, but also the terminology to be dealt with has to be specific for a 
language. Furthermore, regional particularities such as punctuation have to be accounted for. 
Examples are the decimal point in English, opposed to the decimal comma in German and Dutch, or 
different units of measurement used for the same laboratory observations. 
 7 http://talend.com/products/talend-open-studio 8 http://accelrys.com/products/pipeline-pilot 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
2.2.2. Non-Functional 
The non-functional requirements elaborate the performance characteristics of the SEMCARE system. 
The handling of the SEMCARE graphical user interphase (GUI) should be 
easy and intuitive. 
The ranking of the results after submission of a user query should be 
Transparent ranking 
transparent and traceable. Users should be able to understand how they can 
refine their query in order to get better results. 
Compatibility of GUI 
The web-based GUI of the system must be compatible with the browser 
for browsers in use 
versions used in the hospitals. 
The response times while using the SEMCARE platform should be short in 
order to provide a user-friendly service. 
The performance of the system depends on several parameters such as: 
Low response time 
b) size of the index and main storage 
c) number of parallel requests 
d) strategy of authorization 
Each component of the SEMCARE architecture is platform independent as 
Java will be used for the implementation. 
It must be guaranteed that only authorized people can access the clinical 
Security / privacy 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
3.1.1. Involved 
Figure 1 shows an overview of the different systems involved in the SEMCARE architecture and how 
data is transferred from one system to another. 
Figure 1: Systems involved in the SEMCARE architecture 
Each of the systems is briefly described below. 
Production data system 
The production data system contains the hospital production data that may be structured or 
unstructured and is spread over different sources. 
Possible components of the system are: 
Multiple components that constitute a HIS (hospital information system) 
Staging data system 
The staging data system is a copy of the hospital production data used for feeding the SEMCARE 
staging system. The reason for copying the hospital data is that it is usually not allowed to directly 
operate on the live data. By operating on a copy of the data, potential damages on the live-system are 
The staging data system has the same components as the production system: 
HIS (hospital information system) 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
SEMCARE staging system 
The SEMCARE staging system reads the data from the hospital staging system. This is done via an 
ETL process that aggregates data from different data sources into one data store. A common tool for 
such an ETL process is Talend Open Studio. Once the data is loaded, patient data of interest is 
analysed, and the resulting data populates the SEMCARE staging databases as well as the Solr 
The different components are: 
Relational database: SEMCARE data store where the aggregated clinical data is stored. 
Database importer process: An ETL process that loads data from the staging data system 
into the SEMCARE staging system. 
Solr server and index: Indexes documents and searches indexes. 
Graph database: Stores concept hierarchies and relations between documents and 
concepts. For now, this is an experimental extension to the system. It will be further evaluated 
if it can add additional value to the SEMCARE platform. 
Averbis text analysis pipeline (AEP): Analyses text in order to extract structured data. 
SEMCARE portal for testing: Provides capability for configuring and testing the staging 
SEMCARE production system 
The SEMCARE production system contains the structured data exported from the SEMCARE staging 
system. It is the system that is used by the end users to perform search queries and view reports. 
The system contains the following components: 
Relational database: SEMCARE data store where the aggregated clinical data is stored. 
Solr server and index: Indexes documents and searches indexes. 
Graph database: Stores concept hierarchies and relations between documents and 
concepts. For now, this is an experimental extension to the system. It will be further evaluated 
if it can add additional value to the SEMCARE platform. 
Averbis text analysis pipeline (AEP): Analyses text in order to extract structured data. 
SEMCARE portal for end users: The portal for building queries and searching the system. 
3.1.2. Architecture 
Figure 2 shows an overview of the complete architecture planned for the semantic analysis platform 
SEMCARE. The individual components have been described in section 3.1.1 above. 
Furthermore, the figure shows that it will be possible to apply third party tools on the data store of the 
SEMCARE production system in order to perform further analytics like visualisation or statistics. This 
will be enabled by providing a common data model (the i2b2 star schema) that can easily be used by 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
third party applications (e.g. tranSMART, QlikView, Rapidminer). As a consequence, hospitals can 
install third party tools if they want to use them on the SEMCARE data. 
Figure 2: Architecture sketch 
Architecture Layering 
The SEMCARE system can be divided into three layers, which are described in the following 
paragraphs from the bottom to the top and graphically showed in Figure 3 below. 
The bottom layer contains the data sources, which consist of different types of patient data arising in 
a hospital, for example unstructured data like discharge summaries or findings reports, and structured 
data like lab results or other routine data acquired and structured for health care, research and quality 
assessment. Also coded data could be available, which is mainly used for reimbursement. The data is 
scattered over different databases or stored in files, which can be of different format (e.g. Word, XML, 
Text, and PDF). Data may also be available as messages, generally in HL7v2 format as a universal 
health care messaging standard. 
The second layer is the semantic middleware. First, it contains tools for information extraction, ETL 
and text mining as an interface to the data sources. The loaded and analysed data is then stored in a 
unifying semantic database. This layer also includes terminologies and texts stored in a graph 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
database and Solr index. The third part of the middleware is the communication between the 
SEMCARE data store and the topmost layer, which is the presentation layer. 
The presentation layer is the highest level and represents the interface to the user who could be a 
researcher, clinician or administrator. Possible components of the presentation layer are: 
the terminology editor 
a search interface including a query generator 
dashboards and analytics 
Ad m in is tra tion
H os p italD ata
Figure 3: Architecture layering 
The following interfaces between components of the SEMCARE system have been identified: 
• Staging data to data importer: Imports data from the hospital information system as 
documents or messages. Formats to be expected are xml, HL7, plain text, possibly also jpeg 
or other formats for scanned documents, DICOM. 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
• Data importer to staging Solr: The Solrj (Solr Java client) API is used for sending patient 
record information to Solr to be analysed. 
• SEMCARE staging portal to staging Solr: The Averbis search REST API will be used for 
the communication between the two components. This API uses JSON messages to 
communicate with Solr. Example message definitions are shown in Figure 4: Averbis search 
• SEMCARE production portal to production Solr: The Averbis search REST API will be 
used for the communication between the two components. Example message definitions are 
shown in Figure 4: Averbis search REST API. 
public class Request { 
public class Result { 
 private String query; 
 private String query; 
 private Integer rows; 
 private Integer start; 
 private Integer start; 
 private String highlightQuery; 
 private String highlightQuery; 
 private Integer numFound; 
 private List<SortField> sortFields; 
 private List<Facet> facets; 
 private Boolean facetHighlighting; 
 private List<Document> documents; 
 private Integer facetLimit; 
 private String didYouMean; 
 private String facetPrefix; 
 private String facetSort; 
 private List<Facet> facets; 
 private List<Field> fields; 
 private List<Param> params; 
 private User user; 
Figure 4: Averbis search REST API 
3.3. Data Models 
SEMCARE employs a number of different data formats and systems. These include unstructured 
input data, relational databases, terminologies, and Solr indexes. 
3.3.1. Input 
Data 
The input data for the SEMCARE project may vary with regards to the data source and the data 
For each treatment episode, several sources are of interest: 
• documents, either original ones (e.g. findings reports) or aggregated ones (discharge letters) • messages, e.g. HL7v2 messages • raw data, e.g. images, measurement data (e.g. ECG) • database entries 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
The input data may exhibit different degrees of structure, such as 
• unstructured, e.g. free text, images • semi-structured, e.g. free text with standardized organizing patterns (e.g. headings) • structured, e.g. tables of lab values • coded, e.g. LOINC-coded lab values, ICD-10 coded diseases 
The SEMCARE system will import these different formats from the various data sources with an ETL 
3.3.2. Terminologies 
Medical terminologies provide meaning identifiers (codes) for terms or groups of synonymous terms, 
the latter generally referred to as concepts. In SEMCARE, terminologies will enrich the search 
process by knowledge about the meaning of domain terms, their groupings into concepts, and certain 
relations between concepts such as broader / narrower. In addition, SEMCARE will enable users to 
add new concepts and terms to the existing terminology, where needed, e.g. when they miss an 
important synonym. 
As some terminologies support several languages they will also allow for multilingual text analysis by 
grouping terms from different languages into the same concept. The continual process for refining 
terminologies is described in this section. 
Figure 5: Refinement process for criteria 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
As shown in Figure 5 above, the SEMCARE platform will provide a term browser and a dictionary 
creator for users to view and edit their terminology. The term browser will be able to import standard 
terminologies such as SNOMED CT, ICD-10 or MeSH and store them in a relational database (RDB). 
The users can then build their own terminology by enhancing, merging, or modifying existing 
terminologies. The most important medical terminologies are contained in the UMLS metathesaurus, 
which is a rich source of synonyms in different languages that also groups concepts into top-level 
categories via the UMLS Semantic Network. We will make use of all of this by enhancing the user 
interface of the term browser, so that also non-English terms can be used to search for concepts. 
In all stages of the terminology creation process, the terminology can be exported to the AEP analysis 
pipeline. The terminology can then be used to index and search documents via the SEMCARE search 
When the users build their search query, they may find that their terminology needs to be modified in 
order to produce better search results. They can then go back to the term browser to make changes 
to the terminology. This refinement process is crucial for optimizing the SEMCARE platform. Users 
should be able to quickly see how terminology changes affect search facets and results. 
Whereas there is a certain preference for SNOMED CT, ICD-10, and MeSH, a final decision of which 
terminologies to use for annotation will have to be made at the start of the work in WP2. Another 
decision to be made is how the known vocabulary gap for Dutch and German will be filled. One 
possible strategy is the use of machine translation, together with human review of the terms 
generated by this method. Manual additions to the terminologies, mainly driven by the use case, will 
be the option of choice wherever queries have to be fine-tuned. 
3.3.3. I2B2 
Star 
In order to use a standard schema for the data storage and to ensure that we provide a common data 
model that is also widely used by third party providers (e.g. tranSMART), the i2b2 star schema will be 
used in SEMCARE to store the data. 
In the i2b2 star schema, observations or, more precise, factoid (fact-like) statements, are stored in the 
observation_fact table and linked to four so-called "dimension" tables for patient, visit, provider and 
concept details. These dimension tables contain descriptive information about factoid statements. 
Figure 6 below shows an overview of the i2b2 star schema. 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
Figure 6: i2b2 star schema 
I2b2 also uses metadata tables to define terminologies. SEMCARE terminologies can be stored in the 
i2b2 custom_meta table (Figure 7). This table stores hierarchical terminologies that are used to build 
queries in the i2b2 query and analysis tool. The c_fullname column is used to store the full path of 
each term with the ' ' character delimiting the hierarchical levels. After the custom_meta table is filled 
with SEMCARE terminologies via an import process, concept_dimensions can be created that link to 
the custom_meta terms. 
 character varying(700) 
 character varying(2000) 
 c_visualattributes 
 character varying(50) 
 c_facttablecolumn 
 character varying(50) 
 character varying(50) 
 character varying(50) 
 c_columndatatype 
 character varying(50) 
 character varying(10) 
 character varying(700) 
 character varying(900) 
 character varying(700) 
 timestamp without time zone 
Figure 7: i2b2 custom_meta table 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
An example of how a custom terminology may look in the i2b2 term navigator is shown in Figure 8. 
Figure 8: Custom metadata in i2b2 term navigator 
In addition to the standard i2b2 tables, a new table will be created to map i2b2 records to Solr 
documents. This table will contain the encounter_num key, the original unstructured record, the Solr 
document and ID, and a copy of the CAS (Common Analysis System) object from the text analysis. 
Figure 9 shows this additional SEMCARE record table and its relation to the existing i2b2 tables. 
Figure 9: Solr to i2b2 mapping 
SEMCARE Patient Record Solr Document 
Solr documents will be used to store patient record information for text search. Each Solr document 
will contain IDs that map the Solr document to corresponding records in the i2b2 database (see also 
Figure 10 below). With this linkage, only data required for search indexing will be stored in the Solr 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
document, and additional information can be pulled from the i2b2 database if needed. Dynamic fields 
can be used in the Solr document to store multiple concepts. 
References to terminology codes or concepts are stored in Solr as CUIs (concept unique identifier) to 
enable multilingual searches. Preferred terms and synonyms will not be stored in Solr because all 
documents and queries will be processed by the AEP to replace synonyms and preferred terms with 
CUIs before sending the query to Solr. 
The Solr system will provide a faceted search, which means that the search results are organized 
according to a faceted classification system, thus allowing the user to explore a collection of 
information by applying multiple filters. Facets correspond to properties of the search result. 
Solr will store multiple dynamic fields for each concept: 
• a list of all the types of concepts used for faceting 
(Note that this field is the set of all concept types in the document and it has no linkage to the 
relational database. Only individual concepts are linked to the database.) 
• a value for searching • an ID to link to the relational database • a path for hierarchical faceting 
For example, for medication with the CUI a1234 Solr would store the following fields: 
• concept_medications="a1234,b5678,c2313" • concept_med_val_a1234=50 • concept_med_id_a1234=123456 • concept_med_path_a1234=/c1000/b1023/a1234 
Figure 10: Mapping of Solr documents to i2b2 database 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
SEMCARE Data Loading Flow 
The data loading flow begins when the data importer ETL process loads unstructured data. The 
unstructured data is stored in the relational database and then sent to Solr to be analysed and 
indexed. The Solr process and text analysis pipeline stores data in a graph database, e.g. Neo4j9, and 
builds the Solr index. Finally, the structured data from the analysis is added to the relational database 
to enhance the unstructured data. A diagram of the data import flow is show in Figure 11. 
Figure 11: SEMCARE data loading flow 
The data importer process could be created with an ETL tool such as Talend Open Studio. Figure 12 
below shows an example Talend job that reads a directory of plain text files and commits them to Solr 
and a PostgreSQL database. 
 9 http://www.neo4j.org/ 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
Figure 12: Talend Open Studio data importer 
3.4. Modules & Functional View 
The SEMCARE system contains the following modules and components as shown in Figure 13 below 
and described in this section. 
Figure 13: SEMCARE components 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
3.4.1. SEMCARE 
Data 
The SEMCARE data importer is the entry point for health care data in the SEMCARE system. It could 
be an ETL process defined by a tool such as Talend Open Studio, or a custom coded software 
process. When it receives data, the data importer will write the unstructured data to the database and 
then send the unstructured data to Solr for analysis. 
3.4.2. Solr 
Solr is an open source search platform from Apache Lucene. In the SEMCARE project it is used to 
index and search patient record data. Solr will use the Averbis text analytic tools to create structured 
data from unstructured text. After the text is analysed, Solr will write the structured data to the 
Averbis Text Analytics (AEP) 
The Averbis Extraction Platform (AEP) describes a text analysis tool that can be simply applied to 
arbitrary information extraction scenarios. It provides solutions to extract individual information units 
such as facts and relations from unstructured text having the highest relevance for a user. The AEP 
consists of a number of modular text analysis components, so called Analysis Engines (AEs), stick 
together in the Apache UIMA10 framework building an overall solution for different use cases. 
Depending on the requirements, rule-based, statistical methods or a combination of both are used to 
reveal the semantic from the content. 
Annotations between AEs are exchanged using an object named Common Analysis System (CAS). 
The CAS is UIMA's object-based data structure that allows memory based storage and exchange of 
annotations with respect to pre-defined type systems of hierarchically organized annotations. With the 
aid of this data structure it is possible to generate a common base to analyse unstructured text. 
SEMCARE Portal Web Application 
The SEMCARE portal provides a graphical user interface, which allows users to build queries on the 
clinical data and to manage the system. The users will get immediate feedback from a search, which 
helps them to decide how to refine their query in order to get better results. The portal will also provide 
users with an interface for defining and refining terminologies. More specific requirements and details 
about the user interface will be provided in D3.2. 
 10 http://opennlp.apache.org 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
3.4.5. I2b2 
I2b2 tools and components such as the i2b2 query and analysis tool can be installed in the system if 
needed. I2b2 runs on the JBoss application server. 
Third Party Tools and Applications 
Third party tools can also be installed in the system as required. These tools could possibly interface 
with the i2b2 database or the Solr server, but because of the varying requirements and functionality of 
third party applications, they are not shown in Figure 13 or described in detail here. 
3.4.7. Scalability 
All of the components in the SEMCARE system can be deployed across multiple machines to support 
the processing of large data sets if needed. Multiple data importer processors can be launched to 
read input data. Solr Cloud can be used to distribute Solr indexes and search processing across 
multiple machines. The Averbis text analysis pipeline can also be deployed as a distributed system. 
By adding more machines and distributing SEMCARE components the SEMCARE system can scale 
to meet the processing requirements of large data sets. 
3.5. Users & Roles 
In the context of the SEMCARE project different types of users can be distinguished. Their roles are 
briefly described below: 
Production Database Administrator 
The production database administrator manages the copying of production patient data into the 
staging data system. He/she also manages the following export to the SEMCARE staging system via 
ETL process. The Production Database Administrator is located at the hospital site. 
SEMCARE Administrator 
The SEMCARE administrator is responsible for managing the SEMCARE databases, the Solr 
configuration and the SEMCARE portal. He/she configures terminology and text analysis 
configuration. The administrator manages copying of data from SEMCARE staging to the SEMCARE 
production environment and creates custom dashboards, scripts and third party integrations. 
SEMCARE User 
Typically, SEMCARE users will be researchers and clinicians who use the SEMCARE portal for 
search and analytics. 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
The role concept will be further verified during the project and refined if needed. Furthermore, it must 
be guaranteed that all roles have the access rights to the data to be analysed at the level of the 
hospital information system. 
3.6. Open points 
A few points that are still open and need further clarification within the course of the project are listed 
below. More specific details about these points will be given in deliverable D3.2. 
• One challenge for the SEMCARE platform is the search for constellation of symptoms that are 
spread over several documents. A strategy will be developed in order to cover this 
• As the SEMCARE system will be installed within the hospital, a further analysis of the IT 
landscape within the hospitals will have to be performed. The interfaces need to be defined 
and the interchange formats to be specified. 
• Another point to think about is a possible weighting of criteria for a specific use case. For 
example, it should be possible to define mandatory and optional criteria when creating a 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
Data Privacy / Technical and organizational 
security procedures 
4.1. Data processing 
The data processing within the scope of the SEMCARE project takes place entirely within each 
participating hospital. The project integrates into the existing IT landscape of the hospital with regards 
to admission (physical access), computer access, and data access control to the used IT components 
(servers and network components). This also affects the security of particularly sensitive health care 
data arising in a hospital. 
The architectural design of the SEMCARE platform permits data processing and storage on separate 
hard drives if needed because of the involvement of different departments and appropriate user 
4.2. Data transfer and data location 
In the scope of the SEMCARE project patient data will not leave the hospitals at any time. Patient 
data may, however, be shared between different departments of each hospital. In these cases, 
already installed (pseudo-) anonymisation processes will be applied. The de-identification procedures 
for each of the three participating hospitals are explained in detail in deliverable D1.1. 
Regarding test data, SGUL will prepare anonymised data to be used by Averbis GmbH for the 
development of algorithms, interfaces and the final product. The legal basis for the transfer of such 
test data is section 251 of the NHS Act 2006. Transferred test data will be encrypted either at rest or 
in transition. The hospitals EMC and MUG will not provide any data to Averbis or to any other clinical 
Both, data processing and the operation of the data platform will be performed within a dedicated 
server infrastructure in the hospital. It will be ensured that no project-related data is stored on 
locations where unauthorised persons have access to. 
Furthermore, an additional encryption of the data that is e.g. stored in the Solr index is possible by 
using TrueCrypt11. 
 11 http://www.truecrypt.org/ 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
D3.1 – Sketch of system architecture specification 
WP3: Architecture and Requirements 
Dissemination level: Public 
Authors: Philipp Daumke, Carla Haid, Luke Mertens (Averbis), 
Stefan Schulz (MUG) 
Version: 1.0 Final 
4.3. Role concept 
A role concept will be applied that assures that only authorised users can access the data related to 
the SEMCARE project. 
Data Upload, Query generation 
SEMCARE Administrator 
Create, Edit, Delete Users 
SEMCARE Administrator 
System maintenance 
Local system administrator 
A connection to the local LDAP (Lightweight Directory Access Protocol) can be implemented in order 
to take over existing access rights. 
A logging of the activities will be performed in order to be able to examine if personal data has been 
entered, changed or deleted, and by whom. Only allocated and defined personnel will have access to 
the system components and applications of the SEMCARE applications 
4.4. Availability control 
Actions will be considered in order to protect personal data against accidental destruction or loss. For 
example, the SEMCARE systems will not directly work on the hospital live data but on a copy (staging 
system) to ensure that no real patient data is affected in any way. 
High availability of the SEMCARE platform is no priority as the application is not crucial for patient 
4.5. Data separation control 
It must be assured that data from different scenarios or different departments are separated from 
each other. The SEMCARE architecture allows this separation if needed, e.g. different Solr indexes 
The SEMCARE systems will only be run locally and queries will only be performed on relevant patient 
data. Other information that is not relevant for the defined use case will not be extracted from the 
hospital systems. A development system and a production system will be provided separately. 
 Copyright 2014-2015 SEMCARE Consortium. This project has received funding from the European Union's 
Seventh Programme for research, technological development and demonstration under grant agreement No 611388. 
Source: http://semcare.eu/wp-content/uploads/2015/01/SEMCARE_D3.1-Architecture_v1_FINAL1.pdf
   Publication trends and knowledge maps of global translational medicine research  Fei-Cheng Ma• Peng-Hui Lyu• Qiang Yao• Lan Yao • Shi-Jing Zhang  Translational medical research literatures have increased rapidly in last decades and there have been  fewer attempts or efforts to map global research context of translational medical related research. The  main purpose of this study is to evaluate the global progress and to assess the current quantitatively 
  
   SAIC/CHCS Doc. TC-4.5-0359 29 Jul 1996 PHR: OUTPATIENT PHARMACY FUNCTIONS Copyright 1996 SAIC License is granted under Contract DAHC94-88-D-0005 and the provisions of  DFAR 52.227-7013 (May 1987) to the U.S. Government and third parties under its employ  to reproduce this document, in whole or in part, for Government purposes only.