Penn Treebank Project
A corpus of parsed sentences. Used by many researchers for training data-driven parsing algorithms.
The RCSB Protein Data Bank (PDB)
Archive of experimentally-determined, biological macromolecule 3-D structures from the Brookhaven National Laboratory.
Reuters-21578 Text Categorization Corpus
A classic benchmark for text categorization algorithms.
RISE: Repository of Information Sources used in information Extraction tasks.
Repository of online information sources: test domains for information extraction and wrapper generation tools that learn extraction rules (extraction patterns).
The StatLib Datasets Archive
A repository of datasets used in statistics and machine learning.
TechTC - Technion Repository of Text Categorization Datasets
Provides a large number of diverse test collections for use in text categorization research.
Time Series Data Library
A collection of over 500 time series, maintained by Rob Hyndman. Time series are organized by subject.
Text datasets used in information retrieval and learning in text domains.
UCI Machine Learning Repository
A repository of databases, domain theories and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
University of Maryland, INFORUM EconData
Several hundred thousand economic time series, produced by the U.S. Government and distributed by the government in a variety of formats and media, have been put into a standard, highly efficient, easy-to- use form for personal computers.
Web->KB dataset
Web pages partitioned into classes, with hyperlink data. The dataset has been used for text categorization and learning to extract symbolic knowledge from the World Wide Web.
WordSimilarity-353 Test Collection
Contains 353 English word pairs along with human-assigned similarity judgements.
Results: Previous 1 2 3 4 5 6