Skip to main content

Table 5 Roadmap for refactoring corpora. The list of corpora came from [32] and [33], where there are links to the corpora. Column headings indicate the steps that corpora may need to undergo to be refactored; those corpora that would require that step are noted with a dot. The heading "get original" means the original text needs to be retrieved. "Detect spans" means the corpus is a metadata corpus so spans of entities need to be detected. "Alt. search" means techniques other than exact-match searching must be used.

From: Corpus Refactoring: a Feasibility Study

 

get original

detect spans

alt. search

Arabidopsis Thaliana Circadian Rhythms [34]

•

  

Bio1 [35]

•

  

BioCreative 2004 Task 1A [28]

•

 

•

BioCreative 2004 Task 1B [36]

 

•

•

BioCreative 2004 Task 2 [37]

 

•

•

BioCreative 2006 Task GM [38]

   

BioCreative 2006 Task GN [39]

   

BioCreative 2006 Task IPS/IMS [40]

 

•

•

BioCreative 2006 Task ISS [40]

 

•

 

BioInfer [41]

   

BioText: Recognizing Abbreviation Defintions [42]

   

BioText: Protein-Protein Interaction Data [43]

•

 

•

BioText: Relations between Disease/Treatment Entities [44]

•

  

Brown-Genia Treebank [45]

•

  

DepGenia [46]

•

  

DIPPPI [47]

 

•

•

EDGAR [48]

•

•

 

GENIA [49, 50]

•

  

FetchProt [51]

   

Human Gene ID-Serve

•

  

IEPA [52]

•

•

 

ImmunoTome

•

  

iProLink [53]

   

Medstract [54, 55]

   

MedTag [7]

   

OHSUMED [56, 57]

•

•

•

PASBio [58]

 

•

 

PASTA [59]

   

PathBinder [60]

   

PennBioIE [12]

   

PICorpus

   

ProSpecTome [61]

•

•

 

PDG [9]

•

•

•

Texas [62]

•

 

•

TREC Genomics 2004 Categorization Task [63]

 

•

•

TREC Genomics 2005 Categorization Task [64]

 

•

•

TREC Gemonics 2006 IR Task [65]

 

•

•

TREC Genomics 2007 IR Task [65]

 

•

•

Wisconsin [66]

•

•

•

WSD [67]

   

Yapex [68, 69]

•

 Â