Clinical Natural Language Processing using spaCy, scispacy, and medspacy
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
The goal of clinspacy is to perform biomedical named entity recognition, Unified Medical Language System (UMLS) concept mapping, and negation detection using the Python spaCy, scispacy, and medspacy packages.
You can install the CRAN version of clinspacy with:
install.packages('clinspacy')
You can install the GitHub version of clinspacy with:
remotes::install_github('ML4LHS/clinspacy', INSTALL_opts = '--no-multiarch')
library(clinspacy)
Note: the very first time you run clinspacy_init()
or clinspacy()
after installing the package, you may receive an error stating that spaCy
was unable to be imported because it was not found. Restarting your R session should resolve the issue.
Initiating clinspacy is optional. If you do not initiate the package using clinspacy_init()
, it will be automatically initiated without the UMLS linker. The UMLS linker takes up ~12 GB of RAM, so if you would like to use the linker, you can initiate clinspacy with the linker. The linker can still be added on later by reinitiating with the use_linker
argument set to TRUE
.
clinspacy_init() # This is optional! The default functionality is to initiatie clinspacy without the UMLS linker
The clinspacy()
function can take a single string, a character vector, or a data frame. It can output either a data frame or a file name.
clinspacy('This patient has diabetes and CKD stage 3 but no HTN.')
clinspacy('HISTORY: He presents with chest pain. PMH: HTN. MEDICATIONS: This patient with diabetes is taking omeprazole, aspirin, and lisinopril 10 mg but is not taking albuterol anymore as his asthma has resolved. ALLERGIES: penicillin.', verbose = FALSE)
clinspacy(c('This pt has CKD and HTN', 'Pt only has CKD but no HTN'),
verbose = FALSE)
data.frame(text = c('This pt has CKD and HTN', 'Diabetes is present'),
stringsAsFactors = FALSE) %>%
clinspacy(df_col = 'text', verbose = FALSE)
The output_file
can then be piped into bind_clinspacy()
or bind_clinspacy_embeddings()
. This saves a lot of time because you can try different strategies of subsetting in both of these functions without needing to re-process the original data.
if (!dir.exists(rappdirs::user_data_dir('clinspacy'))) {
dir.create(rappdirs::user_data_dir('clinspacy'), recursive = TRUE)
}
mtsamples = dataset_mtsamples()
mtsamples[1:5,]
clinspacy_output_file =
mtsamples[1:5, 1:2] %>%
clinspacy(df_col = 'description',
verbose = FALSE,
output_file = file.path(rappdirs::user_data_dir('clinspacy'),
'output.csv'),
overwrite = TRUE)
clinspacy_output_file
Negated concepts, as identified by the medspacy cycontext flag, are ignored by default and do not count towards the frequencies. However, you can now change the subsetting criteria.
Note that you now need to re-provide the original dataset to the bind_clinspacy()
function.
mtsamples[1:5, 1:2] %>%
clinspacy(df_col = 'description', verbose = FALSE) %>%
bind_clinspacy(mtsamples[1:5, 1:2])
clinspacy_output_data =
mtsamples[1:5, 1:2] %>%
clinspacy(df_col = 'description', verbose = FALSE)
clinspacy_output_data %>%
bind_clinspacy(mtsamples[1:5, 1:2])
clinspacy_output_data %>%
bind_clinspacy(mtsamples[1:5, 1:2],
cs_col = 'entity')
clinspacy_output_data %>%
bind_clinspacy(mtsamples[1:5, 1:2],
subset = 'is_uncertain == FALSE & is_negated == FALSE')
clinspacy_output_file
clinspacy_output_file %>%
bind_clinspacy(mtsamples[1:5, 1:2])
clinspacy_output_file %>%
bind_clinspacy(mtsamples[1:5, 1:2],
cs_col = 'entity')
clinspacy_output_file %>%
bind_clinspacy(mtsamples[1:5, 1:2],
subset = 'is_uncertain == FALSE & is_negated == FALSE')
With the UMLS linker disabled, 200-dimensional entity embeddings can be extracted from the scispacy Python package. For this to work, you must set return_scispacy_embeddings
to TRUE
when running clinspacy()
. It’s also a good idea to write the output directly to file because the embeddings can be quite large.
clinspacy_output_file =
mtsamples[1:5, 1:2] %>%
clinspacy(df_col = 'description',
return_scispacy_embeddings = TRUE,
verbose = FALSE,
output_file = file.path(rappdirs::user_data_dir('clinspacy'),
'output.csv'),
overwrite = TRUE)
clinspacy_output_file %>%
bind_clinspacy_embeddings(mtsamples[1:5, 1:2])
The UMLS linker can be turned on (and off) even if clinspacy_init()
has already been called. The first time you turn it on, it takes a while because the linker needs to be loaded into memory. On subsequent removal and addition, this occurs much more quickly because the linker is only removed/added to the pipeline and does not need to be reloaded into memory.
clinspacy_init(use_linker = TRUE)
By turning on the UMLS linker, you can restrict the results by semantic type. In general, restricting the result in clinspacy()
is not a good idea because you can always subset the results later within bind_clinspacy()
and bind_clinspacy_embeddings()
.
clinspacy('This patient has diabetes and CKD stage 3 but no HTN.')
clinspacy('This patient with diabetes is taking omeprazole, aspirin, and lisinopril 10 mg but is not taking albuterol anymore as his asthma has resolved.',
semantic_types = 'Pharmacologic Substance')
clinspacy('This patient with diabetes is taking omeprazole, aspirin, and lisinopril 10 mg but is not taking albuterol anymore as his asthma has resolved.',
semantic_types = 'Disease or Syndrome')
This function binds columns containing concept unique identifiers with which scispacy has 99% confidence of being present with values containing frequencies. Negated concepts, as identified by the medspacy cycontext is_negated flag, are ignored and do not count towards the frequencies. However, this behavior can be changed using the subset
argument.
Note that by turning on the UMLS linker, you can restrict the results by semantic type.
clinspacy_output_file =
mtsamples[1:5, 1:2] %>%
clinspacy(df_col = 'description',
return_scispacy_embeddings = TRUE, # only so we can retrieve these below
verbose = FALSE,
output_file = file.path(rappdirs::user_data_dir('clinspacy'),
'output.csv'),
overwrite = TRUE)
clinspacy_output_file %>%
bind_clinspacy(mtsamples[1:5, 1:2])
clinspacy_output_file %>%
bind_clinspacy(
mtsamples[1:5, 1:2],
subset = 'is_negated == FALSE & semantic_type == "Diagnostic Procedure"'
)
The default embeddings are from the scispacy R package. If you want to use the cui2vec embeddings (only available with the linker enabled), you ned to set the type
arguement to cui2vec
. Up to 500-dimensional embeddings can be returned.
Note that by turning on the UMLS linker, you can restrict the results by semantic type (with either type of embedding).
With the UMLS linker enabled, you can restrict by semantic type when obtaining scispacy embeddings.
Note: The mean embeddings may be slightly different than if the linker was disabled because entities may be captured twice (as entities may map to multiple concepts). Thus, if you do not need to restrict by semantic type, the recommended setting is to turn the UMLS linker off by re-running clinspacy_init(use_linker = FALSE)
(note that use_linker = FALSE
is the default in clinspacy_init()
).
clinspacy_output_file %>%
bind_clinspacy_embeddings(mtsamples[1:5, 1:2])
clinspacy_output_file %>%
bind_clinspacy_embeddings(
mtsamples[1:5, 1:2],
subset = 'is_negated == FALSE & semantic_type == "Diagnostic Procedure"'
)
These are only available with the UMLS linker enabled.
clinspacy_output_file %>%
bind_clinspacy_embeddings(mtsamples[1:5, 1:2],
type = 'cui2vec')
clinspacy_output_file %>%
bind_clinspacy_embeddings(
mtsamples[1:5, 1:2],
type = 'cui2vec',
subset = 'is_negated == FALSE & semantic_type == "Diagnostic Procedure"'
)
cui2vec_definitions = dataset_cui2vec_definitions()
head(cui2vec_definitions)