Abstract
Social media has provided new opportunities to users for creating, accessing, and sharing information in a location independent and timeless fashion. Recent years have witnessed exponential rise
in the number of individuals and organizations using social media for sharing healthcare information
which is publicly accessible. Medical social media is a subset of social media, in which the interests of the users are specifically devoted to medicine and health related issues. Medical social media
encompasses healthcare related texts in generic social media platforms such as WordPress, Twitter,
Facebook, Quora, Instagram, YouTube as well as medical forums such as patientslikeme.com,
https://patient.info/forums, doctorslounge.com and kevinmd.com/blogs. People use medical social media for a variety of reasons such as seeking answers to specific questions,
giving expert advice about a particular drug or a treatment, spreading awareness, sharing experiences,
reporting discoveries and findings, voicing opinions and forming communities. Apart from individual
users, a lot of medical organizations such as hospitals, clinics, and insurance companies are also actively
contributing to medical social media.
Medical social media data plays a crucial role in several applications such as studying the unintended
effects of a drug (pharmacovigilance), hiring potential participants for a clinical trial, promoting a drug,
monitor public health and healthcare delivery. However, extracting information from medical social
media is challenging due to various reasons. First, medical social media posts are highly noisy; they
are plagued with incorrect spellings, incorrect grammar, non-standard abbreviations and slang words.
Owing to this, the current medical information extraction tools fail to extract concept mentions such as
imence pain in ma leg which are very frequent in medical social media. Secondly, several applications
rely on human labeled examples for training supervised machine learning models. Manually creating
such datasets is effort-intensive and expensive. Lastly, several applications require the persona associated with a medical social media post. Medical social media contributors belong to various persona such
as patient, consultant, journalist, caretaker, researcher and pharmacist. Identifying the medical persona
from the content of a medical social media post is a challenging task.
In this thesis, we propose solutions for three important problems related to information extraction
from medical social media. In our first contribution, we address the problem of medical persona classification which refers to computationally identifying the medical persona associated with a particular
medical social media post. We formulate this as a supervised multi-class text classification task and
propose a neural model for it. In order to minimize the human labeling effort, we propose a distant supervision based approach to heuristically obtain labeled examples which can be used for training the
model.
In our second contribution, we address the task of medical concept normalization, which aims to map
concept mentions such as not able to sleep to their corresponding medical concepts such as Insomnia.
We propose neural models which are capable of mapping any concept mention to its corresponding
medical concept in standard medical vocabularies such as SNOMED CT1
. There are several challenges
associated with existing methods for normalizing medical concept mentions. First, creating training
data is effort intensive, as it requires manually mapping medical concept mentions to entries in a target lexicon such as SNOMED CT. Secondly, existing models fail to map a mention to target concepts
which were not encountered during the training phase. Thirdly, current models have to be retrained
from scratch whenever new concepts are added to the target lexicon, which is computationally expensive. We propose a neural model which overcomes these limitations. Our model scales to millions of
target concepts and trivially accommodates a growing target lexicon size without incurring a significant
computational cost. While our approach reduces the need for human-labeled examples, it does not completely eliminate it. In order to overcome this grave practical challenge we propose a distant supervision
based approach to train our model. We extract informal medical phrases and medical concepts from patient discussion forums using a synthetically trained classifier and an off-the-shelf medical entity linker
respectively. We use pretrained sentence encoding models to find the k-nearest phrases corresponding
to each medical concept. The resultant mappings are used to train our model, which shows significant
performance improvements over previous methods while avoiding manual labeling.
In our third contribution, we focus on the problem of automatic simplification of medical text. Patients and caregivers are increasingly usi