View on GitHub

PennTURBO Documentation

The Github Pages site for PennTURBO

TURBO: Mapping Medication Text to Semantic Terms and Pharmacological Roles


TURBO (Transforming and Unifying Research with Biomedical Ontologies) is an initiative based in the University of Pennsylvania’s Institute for Biomedical Informatics. It includes automated reconciliation of data from multiple sources (via referent tracking), modeling data values as well as the things that the data are about (ontological realism), and the generation of rule-based conclusions despite missing and contradictory data. Current applications include



Questions frequently arise in biomedical- and healthcare informatics along the lines of “how many of our patients received a prescription for an antiarrhythmic dug?” or “of the patients who received intervention A vs B, which group received more prescriptions for analgesics?”

Unfortunately, the prescription records currently used by TURBO (which come from the University of Pennsylvania Healthcare System) do not consistently utilize terms from an ontology or controlled vocabulary like RxNorm. Furthermore, the classification of these medications into pharmaceutical classes appears incomplete, exclusive, and overfly flat. Presumably, this limitation affects other EHR systems and biomedical data sets.

The TURBO team has access to prescription orders for ~ 50k patients with Penn Medicine Biobank consent and can pull them directly from the Penn Data Store relational database (PDS). Specifically, a ~ 11k whole-exome-sequenced sub-cohort is especially well characterized within TURBO at this time. In additional to this patient-centric medication information, the TURBO team can also pull all general reference knowledge about medications from PDS. This is important because the textual part of medication orders is noisy and variable, but each order is keyed to a row in the medication reference table, narrowing the search space.

For example, just among the 11k whole-exome cohort, there are five order names for the medication named “HYDROMORPHONE PCA 30 MG/30ML (CNR)”:

This document specially addresses methods for examining the textual “full name” of a medication and determining a matching term from RxNorm using lexical and machine learning techniques. Having done that, it is possible to decisively link to a large semantic universe of knowledge about those medications, including their (structural) classes, capabilities and roles.

Beyond PDS, the TURBO team can also obtain prescription and medication knowledge from UPHS’ EPIC system via collaborators with credentials for Clarity, a SQL-accessible reshaping of EPIC’s MUMPS database. PDS and EPIC have different scopes, in terms of when and where a patient received care. Both of these sources contain RxNorm classifications for a subset of the records, but the two resources do not use the same text-to-RxNorm mapping methods. EPIC mappings come from the EPIC system itself; PDS mappings come from 3M Health Information Systems


Many of the techniques mentioned here would be applicable to mapping any minimally constrained (but generally topical) free-text phrases to ontology terms. (This method is not intended for mining multiple concepts out of long narratives.)

A method is provided for extracting term labels to a Solr index and searching the free text against that index. One should keep in mind that Solr will return multiple hits that it finds to be relevant from a TF-IDF perspective.

While Solr provides a score, there’s no static threshold that differentiates a good match from a poor match. Therefore, methods are also provided for training a machine learning algorithm, specifically a random forest classifier, to distinguish the correct hits from noise. While the method is distinguished by not requiring any negative training data, it does require some positive training data. Furthermore, the algorithm doesn’t actually pick one term match for each input medication text. Rather, it assumes that the input medication text denotes some unknown medication which can be modeled with semantic term X and then classifies the terms returned by each Solr result as

Out of the 30 results returned for each medication text fed to Solr, there could be 0, 1 or more classifications of each type. Each classification comes with a probability, and each type of classification has a different level of usefulness, so even though the algorithm doesn’t pick one result as exclusively correct, it is possible to weight and prioritize the results.

Since this method is able to align free text with semantic terms, it is obviously possible to project the properties of those terms onto whatever things are represented by the text. For example, the text “500 mg Tylenol po tabs” might be mapped to, with the label “Acetaminophen 500 MG Oral Tablet [Tylenol]”. DrOn knows that this is a subclass of “Acetaminophen 500 MG Oral Tablet”, which is a subclass of the anonymous axiom:

has_proper_part some (scattered molecular aggregate and 
     (is bearer of some active ingredient) and 
         (is bearer of some (mass and (has measurement unit label value milligram) and 
             (has specified value value ))) and (has granular part some Acetaminophen))

Where “Acetaminophen” is modeled as, and ChEBI asserts that Acetaminophen has an analgesic role.

Here, ChEBI is being used as a value-added authority on the pharmacological and therapeutic roles borne by drugs. EPIC does provide role-like classifications, but they are sparse (22% coverage) and exclusive (one classification per drug.) Nonetheless, when an EPIC classification is available, it can be reconciled against the ChEBI role(s). Furthermore, TURBO has already been designed with the option of including alternative drug-role knowledge from additional public sources, or knowledge provided by local clinical experts.

EPIC as a medication knowledge source

The most recent source of EPIC medication knowledge comes from a 25,182,649 byte file EPIC medication hierarchy.xlsx, created on 2018-09-18.

This file is not unique by MedicationName, as each MedicationName can be assigned to zero or more RxNorm codes from two predominant categories. (If no assignment is available, the RXNORM_CODE is -.)

> table(EPIC_medication_hierarchy$RXNORM_CODE_LEVEL)

                - MED_FORM_STRENGTH          MED_ONLY      MED_STRENGTH 
           120264            118121            143264                 1 

In other words, there are 381,650 rows, 170,093 unique MedicationNames, 261,386 total rows with a RxNorm annotation from EPIC, and 58,282 unique medications with at least one RxNorm annotation.

EPIC medication hierarchy.xlsx includes records that are deprecated in various ways. For the subsequent per-medication RxNorm annotation counts, only the 196356 records with an active state (not deleted or hidden or inactive) are considered.

Additionally, the RxNorm codes used by EPIC may have been retired by RxNorm as an organization. That is independent of whether EPIC has deprecated the record or not, and it can’t be determined without checking non-EPIC references (like the interactive RxNav web site.

While it isn’t unusual for an RxNorm-annotated EPIC medication to have a single RxNorm annotation, there are EPIC medications with over 20 RxNorm MED_FORM_STRENGTH annotations and/or over 40 MED_ONLY annotations.

Hexbin histogram: number of RxNorm annotations per EPIC medication

For example, MULTI PRENATAL 27-0.8 MG PO TABS is annotated with 17 different MED_ONLY RxNorm codes, in order to represent each of the vitamin and mineral components. It’s also annotated with RxNorm term 802792, which models the combination product. Vaccines are also among the medications with a high number of RxNorm annotations.

Because of these complex, irregular relationships between EPIC medications and RxNorm terms, EPIC RxNorm assertions were not used to train this project’s machine learning classifier.

PDS as a medication knowledge source

PDS medication knowledge was obtained with SQL queries against the master data management (MDM) and operational data store (ODS) schemas and then dumped to the files mdm_r_medication_180917.csv and ods_r_medication_source_180917.csv. The two files both had 937,084 rows, and neither used any primary key values that weren’t present in the other. The MDM file contains all of the columns present in the ODS file, except for an indicator of the upstream source of medication knowledge. Therefore they were merged by primary key into a single R data frame.

FULL_NAME medication text values were available for 99.7% of the rows. There are only 3260 records lacking medication FULL_NAME values, all of which come form the THERADOC upstream source. Those 3260 do have strings in the SOURCE_ORIG_ID column that could probably be used as if they were FULL_NAMEs. All of the medication FULL_NAMEs were identical between the MDM or ODS sources, with the minor exception of 15 rows containing non-ASCII or non-printable characters that were tidied with different rules.

RxNorm terms in PDS

There are 833,999 unique MDM FULL_NAMEs in the 937,084 rows. RxNorm annotations are present on 48,020 rows representing 41,970 unique FULL_NAMEs. In contrast with the EPIC medication file, 99% (41,501) of those medications have only one single RxNorm annotation.

In summary: PDS has a slightly smaller set of medications annotated with RxNorm terms, but the PDS medications are much more likely to have a single RxNorm annotation.



medication mapping executive overview

Number of Medication Texts per PDS source

PDS Source Medication Count
EMTRAC 424637
SCM 245922
EPIC 178629

Number of Medication Texts with useful RxNorm assertions, by PDS source

    > table(complete.uphs.with.current.rxnorm[, "source"])
     EPIC   SCM 
     7340 29573 

Random sample of medication texts with 0 Solr hits

> print(sort(sample(solr.dropouts, 10)))
 [1] "abilif"            "brovanna"          "chlorohexadine"    "cholistmethate"    "coenzymeq"         "fosomax"          
 [7] "hydroxychoroquine" "kenalo"            "midod"             "tylen"     

Number of Solr results

> nrow(result.frame)
[1] 932297

932297 Solr results is roughly (36913 - 925) * 26 results/query (out of 30 requested).

Number of Solr results with a direct or mapped RxNorm term available

> table($rxnifavailable))

643 729 288 568 
> table(result.frame$combo_likely, useNA = 'always')

607884  35845 288568 
> result.frame$combo_likely[$combo_likely)] <- FALSE
> table(result.frame$combo_likely, useNA = 'always')

896452  35845      0 
> nrow(result.frame)
[1] 932297
> nrow(result.frame.rxnavailable)
[1] 607884

    > table(result.frame.rxnavailable$rxnMatchMeth)
       BP or DRON           CUI          LOOM RxNorm direct      SAME_URI 
            16177        182230         66105        343313            59 
- Boolean multi-column recast of the matchType string values
    > table(result.frame.rxnavailable$matchType)
     altLabel     label prefLabel 
       231602     19926    356356 
- Boolean multi-column recast of the single column of concatenated UMLS TUI semantic types.  Columns were only retained for TUIs which appeared in at least 1% of the rows.
- Boolean multi-column recast of the source ontology string column.  Columns were only retained for TUIs which appeared in at least 1% of the rows.
- Four sting distance measures between the expanded PDS medication text and the term label returned by Solr
    - Levenshtein
    - longest common substring
    - qgram
    - cosine
    - jaccard
    - Jaro-Winkler
    pairs.for.graph <-
      unique(result.frame.rxnavailable[, c("rxnifavailable", "RXNORM_CODE_URI")])
    > dim(pairs.for.graph)
    [1] 120585      2
X a <> ;
    <>  <>;
    <>  <>.
    > tripleCount(temp.rdf)
    [1] 361795
pasted relatedness count

‘500 mg Tylenol oral tablet’ is-a ‘500 mg acetaminophen oral tablet’

Random Forest optimization

The training error doesn’t improve much beyond 200 trees, so that choice was imposed in all subsequent steps.




> sort(important.features)
 [1] "altLabel"        "cosine"          "hchars"          "hwords"          "id.scaled"       "jaccard"         "jw"             
 [8] "lcs"             "lv"              "matchType"       "ontology"        "ontology.scaled" "prefLabel"       "qchars"         
[15] "qgram"           "qwords"          "rank"            "rxnMatchMeth"    "RXNORM"          ""   "score"          
[22] "T109"            "T121"            "T200"        

Training the random Forrest with these settings took 744 seconds on a modern server with 32 GB RAM and can be summarized like this:

 randomForest(formula = my.form, data = trainframe, ntree = my.ntree, mtry = my.mtry, get.importance = get.importance.Q) 
               Type of random forest: classification
                     Number of trees: 200
No. of variables tried at each split: 10

        OOB estimate of  error rate: 9.16%

The accuracy, assessed with the 10% validation set, is similar.

Confusion Matrix and Statistics

  FALSE-FALSE-FALSE-FALSE                   20130                    556                    361                    66
  FALSE-FALSE-FALSE-TRUE                      225                   4034                     48                    21
  FALSE-FALSE-TRUE-FALSE                      200                     54                   4916                    25
  FALSE-FALSE-TRUE-TRUE                        31                     53                     10                  2437
  FALSE-TRUE-FALSE-FALSE                      419                     16                     10                     2
  FALSE-TRUE-FALSE-TRUE                        10                      1                      3                     3
  TRUE-FALSE-FALSE-FALSE                       57                      6                     16                    78
  TRUE-TRUE-FALSE-FALSE                       402                     48                      1                     4
  FALSE-FALSE-FALSE-FALSE                    643                    35                     85                   587
  FALSE-FALSE-FALSE-TRUE                      18                     1                      7                    55
  FALSE-FALSE-TRUE-FALSE                      19                    11                      8                     8
  FALSE-FALSE-TRUE-TRUE                        1                     2                     66                     1
  FALSE-TRUE-FALSE-FALSE                    7267                    10                      3                   239
  FALSE-TRUE-FALSE-TRUE                       10                   656                      1                    15
  TRUE-FALSE-FALSE-FALSE                       0                     0                   2417                     1
  TRUE-TRUE-FALSE-FALSE                      203                     7                      4                  6822

Overall Statistics
               Accuracy : 0.9108          
                 95% CI : (0.9084, 0.9132)
    No Information Rate : 0.4018          
    P-Value [Acc > NIR] : < 2.2e-16       
                  Kappa : 0.8836          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

Sensitivity                                  0.9374                       0.84606                       0.91631
Specificity                                  0.9270                       0.99230                       0.99324

Sensitivity                               0.92451                        0.8905                      0.90859
Specificity                               0.99677                        0.9846                      0.99918

                     Class: TRUE-FALSE-FALSE-FALSE Class: TRUE-TRUE-FALSE-FALSE
Sensitivity                                0.93284                       0.8828
Specificity                                0.99689                       0.9854

High level summary of lexical mapping results

Random sample of PDS FULL_NAMEs solely mapped to terms outside of RxNorm:

 [1] "Basic Metabolic Panel and Mg on Wednesday, 3/12/14 with results faxed to Dr. Smith (215) 349-8309 (Fax)"                                                                                              
 [2] "ALLO TRANSPLANT LABS EVERY MONDAY & FRIDAY"                                                                                                                                                           
 [3] "Saline Enema"                                                                                                                                                                                         
 [4] "LABS EVERY MONDAY AND THURSDAY. PLEASE DRAW CMP, CBC, MAGNESIUM, FERRITIN, PT/INR, PTT, FIBRINOGEN AND D-DIMER."                                                                                      
 [5] "u-500 regular insulin 0.15 ml subcutaneous each evening with DINNER"                                                                                                                                  
 [6] "CT Lumbar Spine without Contrast"                                                                                                                                                                     
 [7] "NEUTROGENA AGELESS INTENSIVES EX"                                                                                                                                                                     
 [8] "NEBULIZER ALBUTEROL AND ATROVENT X 3"                                                                                                                                                                 
 [9] "IFE-PG20 INTRACAVER"                                                                                                                                                                                  
[10] "Labs: CBC w/diff, CMP, PT/INT/PTT, Mg, Phos every Monday and Thursday starting 10/27/14.  Please fax results to Dr. Selina Luger at 215-662-4064.  Please call with critical results to 215-614-1847."  

FULL_NAME #6 is mapped to,”Computed tomography of thoracic and lumbar spine with contrast”

PDS_meds_to_turbo_terms_and_roles_17col.csv contains the The best match-per-medication-text. It provides no mapping at all for 1,493 of the 833,999 PDS FULL_NAMEs. That could theoretically happen if the FULL_NAME results in zero Solr search results, but it appears that in this case, it was always the results of a successful Solr search but the failure of the random forest to classify any of the Solr results as semantically acceptable.

For example, consider these FULL_NAMEs randomly selected from the group of 1,493:

 [1] "CARDIZEM 10 MG IV\r\n  INTRAVENOUS"                                                                    
 [2] "CONTINUOUS: 0\r\nBOLUS: 1\r\nINTERVAL: 10 MINUTES  INTRAVENOUS"                                        
 [4] "COMPAZINE 10 PO NOW\r\n"                                                                               
 [5] "0.3 MG  CLONIDINE PO \r\n  PER MOUTH"                                                                  
 [6] "LABETALOL 20 MG\r\nLABETALOL 40 MG\r\nLABETALOL 80 MG"                                                 
 [7] "DC DECADRON\r\n"                                                                                       
 [8] "TORADOL 30MG IM X 1\r\n  INTRAMUSCULAR"                                                                
 [9] "US PREGNANCY MULTI 1ST TRIMESTER TRANSABDOMINAL TRANSVAGINAL\r\n\r\n"                                  

The best Solr hit for #10 is, “every 3-4 hours as needed”, and the lower ranking hits have similar labels. One of the features used to train the random forest is the string similarity between the FULL_NAME and the label of the Solr match, so it’s no surprise that this match was classified as semantically unrelated and therefore omitted from PDS_meds_to_turbo_terms_and_roles_17col.csv.

Methods for linking RxNorm terms to pharmacological roles

At this point, PDS medication texts have been lexically mapped to RxNorm terms that model mediation products and ingredients. RxNorm was selected due to the availability of RxNorm training data within PDS and to the extensive links to RxNorm via DrOn and the UMLS.

ChEBI was selected as an authority on medication roles, due to its comprehensiveness and its commitment to BFO ontological realism.

Having said that, there is no one general semantic path to ChEBI terms from the various RxNorm terms returned from the lexical mapping. Therefore, SPARQL insert statements were used to materialize shortcut relationships from any RxNorm term to those RxNorm terms that are best aligned with ChEBI compounds.

These statements build chains of up to three predicates from a white-list:

rxnorm:form_of and rxnorm:has_form describe relationships like those between a base (dextromethorphan) and a salt form (dextromethorphan hydrochloride). Otherwise, all of those predicates describe s p o relationships in which the subject is more inclusive than the object. For example, we say that since

'cough and cold combo' rxnorm:has_ingredient acetaminophen


'cough and cold combo' inherits-roles-from acetaminophen

however, we don’t say that dextromethorphan has an analgesic role, despite these short semantic paths:

dextromethorphan rxnorm:ingredient-of 'cough and cold combo'
'cough and cold combo' rxnorm:has_ingredient acetaminophen
acetaminophen has-role analgesic


Coughatussin `rxnorm:brandname-of` dextromethorphan
Coughatussin `rxnorm:brandname-of` acetaminophen
acetaminophen `has-role` analgesic

Having inserted the role inheritance statements, it becomes possible to construct a semantic path all the way from a patient’s prescription, to a PDS medication (based on its medication text), to a RxNorm term (thanks to Solr and the random forest), on to RxNorm ingredients (the sources of role inheritance), a ChEBI compound, and finally a ChEBI role or class. If PDS says a patient received a prescription for “500 mg Tylenol po tabs”, the semantic graph will say that they received a prescription for an analgesic drug.

Alternatively, there are RDF-formatted linked data sets about medications which make assertions about the capabilities and therapeutic classes of medications in a way that requires only a single hop to an RxNorm term, although the semantics of this hop is usually the somewhat vague “shares concept unique identifier”

There is always some degree of disagreement between these authorities. PDS provides one exclusive pharmaceutical classification for a small subset of its medications, so that can provide one form of confirmation when the graph provides contradictory roles information, or fails to assign a particular role to a particular medication.

Count of PDS medications with the antipsychotic role, by semantic authority

antipsychotic role venn diagram by authority

In this case, the antipsychotic role assignments in PDS are small in number (631) but largely in agreement with ChEBI and the WHO’s Anatomical and Therapeutic Classification (with 566 in the intersection of all three authorities.) Therefore it would seem entirely reasonable to confirm the antipsychotic role of the additional 7204 PDS medications appearing in the intersection of the ChEBI and ATC sets, for a greater than 10-fold increase in the number of PDS medications linked to an antipsychotic role.

When there’s less agreement between authorities, that can be an indicator that subject matter expertise from a clinician might be useful.


Alternative approaches

Future opportunities

Quality of Results

For example, the miss-spelled medication text “hydroxychoroquine” retrieves zero Solr hits on its own, but when modified with the ~ operator,anyLabel,score&q=anyLabel:hydroxychoroquine~&wt=csv

It does retrieve Solr hits, with a generally reasonable score of 13.9, from multiple ontologies

id anyLabel score hydroxychloroquine 13.968817 hydroxychloroquine 13.968817 hydroxychloroquine 13.968817 hydroxychloroquine 13.968817 hydroxychloroquine 13.968817 hydroxychloroquine 13.968817

Performance, code quality and consistency