ART User Guide

 

What is ART?

The Assisted Referral Tool (ART) was developed by the NIH Center for Scientific Review (CSR) to recommend potentially appropriate study sections, based on the scientific content of a user’s grant application.   The information you provide ART is only used to recommend study sections and is not stored or persisted.   The recommendations made by ART are solely for the benefit of the user.

 

How ART Works

 

ART uses natural language processing and large-scale machine learning technology to make recommendations.  It uses indexed representations of both the application text that the user enters as well as all the grant application data used to train the models.  

As of August, 2020, 175 study sections are represented in ART.   Models have been trained and validated for every pair of study sections, for a total of nearly 15,000 unique pairwise models.   The corpora of text for training the models are drawn from up to ten years of indexed text for each study section.  

After the user’s application text has been entered, it is compacted into an indexed form and sent to each of the 15,000 pairwise classification models.   Each pairwise model is a support vector machine (SVM) that determines which of the two study sections comprising that model is more appropriate for the application text in question.   The indexed application text is thus sent to the bank of 15,000 SVM models, each of which votes on a study section.

 

Choosing the Mode of Operation

When you enter the ART landing page, you are asked a couple of questions to choose the operating mode: 

There are essentially two modes of ART:   recommending study sections and recommending SBIT/STTR Special Emphasis Panels.   These two modes involve completely different models that are trained independently.  That is, none of the SBIR/STTR models are used in the primary mode, nor are study section models used in SBIR/STTR mode.  Whereas study sections are recommended at the SRG (study section) level, SBIR/STTR SEPs are recommended at the IRG (parent) level.

The Animal Usage checkbox is used to filter the list of potentially relevant study sections.   If the checkbox is left open, all study section models are included in the search.   If the checkbox is checked, however, study sections with less than 5% animal research are omitted from the search.

 

Extracting Scientific Concepts

When you enter your application text into the main textbox (and your application title into the title textbox), your application is indexed in two ways.   First, the RCDC ERA UberIndexer resource extracts scientific concepts from the RCDC Thesaurus and weights them according to relative frequency in the text.  (See https://report.nih.gov/rcdc/index.aspx for a description of RCDC).  Concepts found in the title are given full weight.  Second, a novel indexing scheme based on a self-learned dictionary of n-grams is used to provide an alternative index, intended to capture new concepts that may not be found in a curated thesaurus.   The two indices are concatenated into a single vector representing the text you entered.   All the grant applications used to train the models (essentially ten years’ worth of grant applications) are similarly indexed.   ART does not store the documents used to train the models nor their indices, but rather only the model representations.   Your text and the extracted indices are discarded when your job is complete.

 

Recommending Study Sections

The vector representing the concatenated index of your text is sent to the bank of 15,000 pairwise SVM models (or a smaller bank in the case of the SBIR/STTR SEPs), each of which votes on which of its two study sections is more appropriate for your application.   Study sections are ranked according to those that received the most pairwise votes.   A variety of criteria are used to select how many (generally three to six) study sections are chosen as having “Strong” relevance as well as how many (also three to six) additional study sections are indicated as having “Possible” relevance.   Within these two groups, the study sections are listed alphabetically.

In the example below, four study sections are listed in the “Strong” group and five in the “Possible” group.   Note that the listing of the study sections is alphabetic within a group by the SRG abbreviation.   For each study section, links to the SRG as well as the parent IRG descriptions are provided, as well as a link to the SRG roster.  Users are encouraged to browse these links to determine whether the recommended study sections are a match for their research.

 

Tips for Using ART

1.       Entering the Title is optional but strongly recommended – ART can operate without text entered into the Title box but this is not recommended.   Scientific concepts found in the Title are given full weight by the ERA RCDC Uberindexer, matching the indexing scheme of applications as they were used to train the models.   The user could enter the text in the main text box but this does not allow the Uberindexer to apply full weight to title concepts.

2.       Entering both Abstract and Specific Aims is recommended – In general, an abundance of text improves performance.   ART ignores stop words such as “Abstract” and “Specific Aim” intended to delimit sections of text.   Therefore, it is in your interest to include both the Abstract and the Specific Aims from your application.  You can just copy and paste the text, section headers and all.  However, we do not recommend including sections other than Abstract and Specific Aims.  ART will not filter out other sections based on section header.

3.       Minimum text requirement – ART requires that at least 10 scientific concepts from the RCDC Thesaurus be identified by the Uberindexer from your text, comprising both the Title and main text box combined.   For ART to classify accurately, it needs a strong enough “signal” to overcome potential “noise.”   Empirical testing revealed that a signal strength threshold of 10 concepts is necessary to ensure accurate classifications.

 

Keeping ART Updated

As CSR updates its study sections periodically, so too must ART be updated.    This involves training new models for the new study sections and retiring models for study sections that are being phased out.   Three or four new releases of ART are issued per year, to keep up with changes in study sections.   For the new models to be accurate, there need to be enough assignments of applications to the new study sections.   In some cases, if too few assignments are available for accurate classification, ART may resort to business rules redirecting recommendations from retiring models to the new study sections, until the new models can stand on their own in a future release.  For every new release of ART, a comprehensive testing and validation suite is applied on all study sections.