Abstract
In recent years there has been an increase in the generation of health care documents such as clinical
trials, discharge summaries, and Electronic Health Records (EHRs). These documents contain a lot
of actionable data buried in them. Actionable data includes a set of events and activities that occur in
health care processes. This valuable information has led to an increased scope for research on biomedical
literature. However, most of the data reside in the form of free text which makes it difficult to extract
useful information. The thesis develops methods to automatically extract semantics from the health
care documents in an effort to check conformance of the treatment processes with standard treatment
guidelines.
Discharge summary is one of the major sources of information about the treatment process. A
discharge summary contains information about a patient’s one or more encounters with health care
service providers, stored electronically to share across different stakeholders in the health care system.
The discharge summary has summarized information that includes a wide range of information like chief
complaint, physical examination, vital signs, lab test results, recommended medications, and discharge
status. Text analytics over these documents has various applications and would help the caregivers to
provide better health care. The main theme of the thesis is to automatically extract medical semantics
from discharge summaries. We focus on semantics such as medical entities, attributes of medical entities
and relationships between these entities. Further, we illustrate an application of these semantics on a
Treatment Process Conformance Checking usecase.
Automatically identifying medical entities from biomedical literature is referred to as Biomedical
Named Entity Recognition (BMNER). BMNER is one of the important tasks in the field of biomedical
text mining. BMNER involves identifying continuous named entities and discontinuous named entities.
Discontinuous entities are comprised of two or more non-consecutive components. For example, in a
sentence, “The aortic root is moderately dilated.”, the disjoint entity “aortic root...dilated” is composed
of two components, “aortic root” and “dilated”. Most of the prior works on BMNER were based on
feature-dependent machine learning techniques and focused only on continuous named entities and not
discontinuous entities. Traditional BIO tagging schema is unable to tag sentences with discontinuous
entities. In this thesis, we propose a novel systematic BIODT tagging schema to identify both continuous
and discontinuous named entities. We explore deep learning models that require limited feature engineering for tagging the entities. Our results illustrate that our BIODT tagging schema performs better
than traditional BIO and other tagging schemas and overcomes label sparsity problem in identifying both
continuous and discontinuous biomedical entities. We also show that our neural network model with
BIODT tagging schema has shown superior performance than state-of-the-art methods on CLEF 2013
and SemEval 2013 datasets which were based on feature-dependent machine learning techniques.
Mere medical entities cannot give enough information for understanding the condition of the patient.
In a given context, characteristic of a medical entity is based on different attributes like temporal
information, severity and progression of the disease. In this work, we consider ten attributes that allow
us to understand the main details regarding the condition of the patient. They are Negation Indicator,
Subject Class, Uncertainty Indicator, Course Class, Severity Class, Conditional Class, Generic Class,
Body Location, DocTime Class, and Temporal Expression. In this thesis, we present a methodology
with rule-based and machine learning approaches to identify each of these attributes. We evaluate our
methodology on ShARe/CLEF eHealth Evaluation Lab 2014 Challenge dataset on attribute level and
system level accuracy.
Mining relationships between treatment(s), test(s) and medical problem(s) is vital in the biomedical
domain. This helps in various applications such as decision support system, safety surveillance, and
new treatment discovery. In this thesis, we propose a deep learning approach that utilizes both word
level and sentence-level representations to extract the relationships between treatment and problem.
While deep learning techniques demand a large amount of data for training, we make use of a rule-based
system particularly for relationship classes with fewer samples. Our final relations are derived by jointly
combining the results from deep learning and rule-based models. Our system achieved a promising
performance on the relationship classes of I2b2 2010 relation extraction task.
Finally, we employ the above pipeline of tasks such as Biomedical Named Entity Recognition, Medical
Entity Attribute extractio