Skip to main content

Table 1 Datasets and model parameters considered across 870 simulations

From: Aligning text mining and machine learning algorithms with best practices for study selection in systematic literature reviews

Datasets (size)

Data scenarios

Downsampling

Word frequencies

Classification algorithms

Model metricsb

Psoriasis (4442)

Abstract screening

With

Removing words appearing < 5 times across all citations

SVM

ROC

Lung cancer (12,769)

Full-text screening

Without

Removing words appearing < 10 times across all citations

Naïve Bayes

Sensitivity

Liver cancer (8507)

Removing full-text excludes

 

Removing words appearing < 100 times across all citations

Bagged CART

 

Melanoma (3089)

  

Removing words appearing < 500 times across all citations

  

Obesity (5187)

  

Keeping top 50 words in terms of variable importancea

  
   

Keeping top 100 words in terms of variable importancea

  
   

Keeping top 500 words in terms of variable importancea

  
  1. aNot applicable to the SVM algorithm
  2. bNot applicable to the bagged CART algorithm