EchoExtractor

The goal of the current application is to extract Concept-Value pairs for metrics measured during an echocardiogram study. The input is a text document to be processed. The output is a dataset where each record represent a Concept-Value relationship.

EchoExtractor

If using this system, please cite:

Patterson OV, Freiberg MS, Skanderson M, et al (2017) Unlocking echocardiogram measurements for heart disease research through natural language processing. BMC Cardiovasc Disord 17:151 . doi: 10.1186/s12872-017-0580-8

Project description:

The goal of the current application is to extract Concept-Value pairs for metrics measured during an echocardiogram study. The input is a text document to be processed. The output is a dataset where each record represent a Concept-Value relationship.

Definitions

String
a sequence of characters found in text.
For example, "lvef", "e:e", "5 mg", "50-55%", "calculated lv ejection fraction was 50-55%", "Attending: Dr. DoLittle"   are strings.
Term
a string that has a specific meaning, which may or may not be identified.
For example, "lvef" is a string, but it is also a term because it represents a clinical variable.
Concept
a specific meaning.
For example, terms "ef", "lvef", "lv ej frac", "ejection fraction", "edjection fractian"  are all terms (the last one is a misspelling) that represent the same concept "left ventricular ejection fraction"
Mapping
a link between a term and a concept.
For example, "lv ef" maps to "left ventricular ejection fraction”
Value
ValueString - a string that represents a numeric value of the target concept.
Unit
is a string that represents a unit for measure for the numeric value of the target concept
For example, in a phrase "calculated lv ejection fraction was 50-55%”, ValueString is 50-55 with associated Unit % for Term “lv ejection fraction”
Assessment
a string that represents a qualitative assessment of the target concept
For example, in a phrase “trace mitral regurgitation” Assessment is “trace” for Term “mitral regurgitation”
Concept-Value pair
an association between a term, which is mapped to a concept, and a value (qualitative or quantitate) found in text. Creating a concept-value pair hinges on correct identification of the strings that represent term, numeric value and unit, and also correct mapping of the term to the dictionary entry for the concept.
For example, in a sentence "The ejection fraction was visually estimated in a range of 50% to 55%.” * Term is “ejection fraction” * ValueString is “50% to 55%” * Unit for the value is "%” * Mapping is “left ventricular ejection fraction”

Installation and deployment

EchoExtractor is based on Leo architecture extending UIMA AS. For more info on Leo [http://department-of-veterans-affairs.github.io/Leo/]
To use EchoExtractor:

Follow the instructions to install and configure UIMA AS Steps 2.1-2.9.

  • Start UIMA AS Broker.
  • Configure EchoExtractor reader and listeners.
  • Five readers are available:
    • FileCollectionReaderConfig.groovy - Enter the path to input directory to read simple text files. The files need to have .txt extention.
    • BatchDatabaseCollectionReaderConfig.groovy - Enter the database engine, database name, and input query. Update the batch parameters. If you have only one batch, change the ending index to be less than the batch size. If you are using this reader for batch reads, add sequential numbering column called “RowNo” to your input table. The tags {min} and {max} will be automatically replaced with starting and ending RowNo for each batch until edning RowNo reaches the last endingIndex.
    • SQLServerPagedDatabaseCollectionReaderConfig.groovy - Enter the database engine, database name, and input query. Make sure the input query ends with “order by” clause. The query will be automatically transformed for SQL Server fetching new batch with offset row number. This approach becomes very slow when the number of records reaches over 2.5M records. MS SQL Server queries become very slow at that point.
    • MySQLBatchDatabaseCollectionReaderConfig.groovy - Enter the database engine, database name, and input query.
      KnowtatorCollectionReaderConfig.groovy -
  • Eight listeners are available:
    • SimpleCsvListenerConfig.groovy - Enter the path to the output directory. A new file will be created with a standard output.
    • SimpleXmiListenerConfig.groocy - Enter the path to the output directory. A new directory with xmi files will be created.
    • CsvListenerConfig.groovy - this is an example of a custom CSV listener
    • DatabaseListenerConfig.groovy - this is an example of a custom database listener.
    • AuCompareCsvListenerConfig.groovy -
    • AuCompareSummaryListenerConfig.groovy -
    • KnowtatorListenerConfig.groovy -
    • MySQLDatabaseListenerConfig.groovy - this is an example of a custom database listener using mysql.

Use runService.sh or runService.bat script to start the service.
Modify runClient.sh or runClient.bat script with the selected readers and listeners and start the client.

System Description

The pipeline has the following modules:

  1. ConceptAnnotator - regex to identify unusual combination of characters that describe a concept [MVA(p1/2t), AVA(i,d), …]
  2. AnatomyAnnotator - regex to identify strings that represent heart anatomy [left, right, ventricle, atrium, systole, …]
  3. MeasurementAnnotator - regex to identify strings that represent measurements [size, diameter, velocity, mean, peak, …]
  4. QValueAnnoator - regex to identify strings that represent qualitative measurements [normal, mild, severe, dysfunction, …]
  5. NumericValueAnnotator - regex to identify stings that represent numeric values for the measurements [2, 2.5, about 2, > 55 ,…]
  6. UnitAnnotator - regex to identify strings that represent units of measure for the numeric value [mm, hgmm, m/sec, cm^2, … ]
  7. MiddleStuffAnnotator - regex to identify text commonly used to link concepts and their values [was, is calculated at, found to be, …]
  8. MethodAnnotator - regex to identify strings that represent method or mode of measurement acquisition [mmode, doppler, Simpson’s, …]
  9. HeaderAnnotator - regex to identify strings that most often represent subsection headers in Echo reports. Used to identify scope for anatomy.
  10. ExcludeValueAnnotator - regex to identify special cases when a numeric or qualitative value should be ignored [\d+\/\d+\/\d+ because it is a date…]
  11. ExcludeConceptAnnotator - regex to identify cases when a term should be disregarded for further processing because it does not represent a valid concept when is mentioned by itself [time, velocity, date, …]
  12. ModifierAnnotator - regex to identify strings that provide more insight into the context of the term [visually estimated, biplane, …]
  13. AnnotationFilter1 - a custom annotator that removes smallest annotation when overlaps with another annotation of the same type or some other types.
  14. RangeAnnotator - an APA to combine multiple NumericValue annotations into one that represent a range.
  15. ConceptCollectorAnnotator - a custom annotator that combines sequences of Anatomy and Measurement annotations into one annotation.
  16. AnnotationFiler2 - a custom annotator that removes smallest annotation when it overlaps with other annotation of the same type or some other type.
  17. ConceptMapping - a custom annotator that extends LookupAnnotator that includes additional logic for flexible concept mapping to include most frequent mapping. Results in Mapping annotations.
  18. ConceptDisambiguation - a custom annotator that extends ConceptMapping that changes flexible mapping logic to include all mappings for the terms. Changes Mapping annotations.
  19. RelationshipPatternAnnotator - APA that combines all other annotations into relations. Results in RelationPattern annotations.
  20. AnnotationFilter3 - a custom annotator that removes the smallest annotation when it overlaps with other annotations of RelationPattern type.
  21. RelationAnnotator - a custom annotator that contains the main logic for determining Concept-Value pairs. Results in the following annotations:
    • Relation1 - for terms that were unambiguously mapped to a single target concept
    • Relation2 - for terms that were unambiguously mapped to a single concept that is not one of the target concepts
    • Relation3 - for terms that were mapped ambiguously (had more than 1 mappings)

The output into a database or csv file has the following columns:

  1. DocID - varchar(25) - document id. TIUDocumentSID for TIU docs, EchoSID for Echo691 docs, concat([RadNucMedReportIEN],‘_’,Sta3n) for RadiologyNotes.
  2. PatientID - bigint - patient id. PatientSID for TIU docs and Echo691, ScrSSN for RadiologyNotes.
  3. ReferenceDate - date - ReferenceDate from TIU docs, Datetime from Echo691, ExamDateTime for RadiologyNotes
  4. InstanceID - int - sequential number of the instance in the document
  5. SpanStart - int - Relation span start
  6. SpanEnd - int - Relation span end
  7. Snippets - varchar(1000) - Relation covered text
  8. Term - varchar(500) - Term covered text
  9. ValueString - varchar(1000) - NumericValue covered text
  10. Value - float - first numeric value of NumericValue covered text. Represents a lower bound of the range if the NumericValue is a range.
  11. Value2 - float - second numeric value of NumericValue covered text. Represents an upper bound of the range if the NumericValue is a range.
  12. Unit - varchar(25) - Units covered text
  13. Assessment - varchar(1000) - QValue covered text
  14. ConceptType - pattern used to create the RelationPattern that was used to create the Relation. Used to filter unused patterns.
  15. Mapping - varchar(5000) - a pipe-delimited numbered list of mappings. Numbering starts at 0. If Mapping not like ‘%1%’ that means the field does not have a second mapping therefore there is just one mapping for that term.
  16. Modifier - varchar(1000) - additional text that provides more context for the term. Can potentially be used in post-processing.

Target concepts:

Concept DefinitionExample
aortic valve mean gradient The difference between the ventricular pressure and the recovered aortic pressure averaged across multiple measurements. [mmHg] AV PG
AV mean grad
aortic mean gradient
aortic valve orifice area Area of the aortic valve opening measured at systole. [mm^2] AV area
aortic valve regurgitation(aka aortic insufficiency) - a condition when aortic valve does not close tightly. Measured qualitatively [trace, mild, severe..]) or on a scale [0..4+] ai 1+
ar
aortic insufficiency
aortic valve regurgitation peak velocity Velocity of the regurgitant jet [m/sec]aortic pk vel
aortic valve stenosisValve disease in which the opening of the aortic valve is narrowed. [mild, severe...] AS
AV stenosis
e/e prime ratio The ratio of mitral peak velocity of early filling (E) to early diastolic mitral annular velocity (E') (E/E' ratio). Used to detect left ventricular diastolic dysfunction. Normal value is > 8. e:e’
e to e prime
E/Ea ratio
inter-ventricular septum dimension at end diastoleInter-ventricular septal wall thickness [mm] ivs ed
IVS(ED)
IVSd
left atrium size at end systole diameter of the left atrium measured at the end-systole, when the LA chamber is at its greatest dimension. Normal 28-40 mm [mm] LA dimension
dilated LA
left atrium
LA dilatation
LA chamber size
LA
left ventricular dimension at end diastoleThe diameter across a ventricle at the end of diastole. [mm]LVEDD
LVIDD
LVED
LVD ed
end diastolic lv diameter
left ventricular dimension at end systolesimilar to the end-diastolic dimension, but is measured at the end of systole (after the ventricles have pumped out blood) rather than at the end of diastole. [mm] LVESD
LV systole
left ventricular sizegeneral description of size of the left ventricle [normal, dilated, enlarged] LV size
dilated left ventricle
left ventricular ejection fractionthe percentage of blood pumped out of a heart chamber with each contraction [%, preserved, reduced]
Same as LV systolic function or dysfunction [normal, reduced]
Same as LV contractility [normal, low, reduced]
LVEF
EF
systolic dysfunction
left ventricular posterior wall thickness at end diastole The thickness of the posterior left ventricular wall. [mm]LVPWd
post LV wall
mitral valve mean gradientThe pressure gradient across the mitral valve in mitral stenosis is determined by measurement of the maximum recorded velocity of the mitral jet at end-. [mmHg]MV PG
mitral valve orifice area The normal area of the mitral valve orifice is about 4 to 6 cm2MVA
mitral valve regurgitation(aka mitral insufficiency) is defined as the abnormal flow of blood through the mitral valve from the left ventricle to the left atrium during systole. [mild, severe, or scale 0...4]MR 2-3+
1+ MI
trace MI
MV insufficiency
mitral valve regurgitation peak velocitypeak mitral regurgitant velocity
Mitral Regurgitation jet Vmax [m/s]
mr jet vel
mitral valve stenosisnarrowing of the orifice of the mitral valve of the heart.[mild, severe, is present, no evidence of] MS
mitral stenosis
pulmonary artery pressure(aka PA pressure) is a measure of the blood pressure found in the pulmonary artery, usually measured during systole. Mean pulmonary arterial pressure is normally 9 - 18 mmHg [mmHg, hypertension] PAP
PASP
PA systolic pressure
right atrial pressureThe pressure in the thoracic vena cava near the right atriumRAP
RA pressure
tricuspid valve mean gradientmean diastolic gradient across the tricuspid valve [mmHg]TR mean grad
tricuspid valve orifice area tricuspid valve orifice area (TOA) [mm]TR area
tricuspid valve regurgitationa disorder in which the heart's tricuspid valve does not close properly, causing blood to flow backward (leak) into the right upper heart chamber (atrium) when the right lower heart chamber (ventricle) contracts [trace, mild, 3-4+]TR
TI
TV insufficiency
tricuspid valve regurgitation peak velocity (aka tricuspid regurgitant jet velocity) is measured in order to estimate the right ventricular and pulmonary pressure. [m/s]TR jet velocity
TR max vel
Tricuspid regurgitant vel