Skip to main content

Table 5 Roadmap for refactoring corpora. The list of corpora came from [32] and [33], where there are links to the corpora. Column headings indicate the steps that corpora may need to undergo to be refactored; those corpora that would require that step are noted with a dot. The heading "get original" means the original text needs to be retrieved. "Detect spans" means the corpus is a metadata corpus so spans of entities need to be detected. "Alt. search" means techniques other than exact-match searching must be used.

From: Corpus Refactoring: a Feasibility Study

  get original detect spans alt. search
Arabidopsis Thaliana Circadian Rhythms [34]   
Bio1 [35]   
BioCreative 2004 Task 1A [28]  
BioCreative 2004 Task 1B [36]  
BioCreative 2004 Task 2 [37]  
BioCreative 2006 Task GM [38]    
BioCreative 2006 Task GN [39]    
BioCreative 2006 Task IPS/IMS [40]  
BioCreative 2006 Task ISS [40]   
BioInfer [41]    
BioText: Recognizing Abbreviation Defintions [42]    
BioText: Protein-Protein Interaction Data [43]  
BioText: Relations between Disease/Treatment Entities [44]   
Brown-Genia Treebank [45]   
DepGenia [46]   
DIPPPI [47]  
EDGAR [48]  
GENIA [49, 50]   
FetchProt [51]    
Human Gene ID-Serve   
IEPA [52]  
ImmunoTome   
iProLink [53]    
Medstract [54, 55]    
MedTag [7]    
OHSUMED [56, 57]
PASBio [58]   
PASTA [59]    
PathBinder [60]    
PennBioIE [12]    
PICorpus    
ProSpecTome [61]  
PDG [9]
Texas [62]  
TREC Genomics 2004 Categorization Task [63]  
TREC Genomics 2005 Categorization Task [64]  
TREC Gemonics 2006 IR Task [65]  
TREC Genomics 2007 IR Task [65]  
Wisconsin [66]
WSD [67]    
Yapex [68, 69]