Useful Toolboxes and Databases
This page contains a collection of toolboxes and datasets that are public available. Edited by Tianpei Xie.
Available Group Toolboxes
- Information Geometric Dimensionality Reduction (IGDR) Toolbox (Kevin M. Carter, Raviv Raich, and Alfred O. Hero III)
- AFFECT MATLAB toolbox for clustering dynamic data (Kevin S. Xu, Mark Kliger, and Alfred O. Hero III)
- Regularized graph layout MATLAB toolbox for dynamic network visualization (Kevin S. Xu, Mark Kliger, and Alfred O. Hero III)
- Bayesian non-linear unmixing (provided by Nicolas Dobigeon and Cecile Bazot): http://dobigeon.perso.enseeiht.fr/app_BFA.html.
Machine Learning and Data Science
- UCI Machine Learning Repository, commonly used in ML.
- Awesome public datasets. This site collected a large set of links that points to public datasets from Agriculture Biology, Climate/Weather, Complex Networks, Computer Networks, Contextual Data, Data Challenges, Economics, Education, Energy, Finance, Geology GIS/Environment, Government, Healthcare, Image Processing, Machine Learning, Museums, Natural Language, Physics, Psychology/Cognition, Public Domains, Search Engines, Social Networks, Social Sciences, Software, Sports, Time Series, Transportation, Complementary Collections.
- Kaggle competition. Useful for side projects.
- Federal Crash Databases (FARS, NASS, LTCCS). FARS dataset is useful for drunk driver predictions.
- Knowledge Extraction Evolution Learning (KEEL)-dataset repository Includes preprocessed dataset from UCI repository and some other datasets like FARS.
- The search engine for datasets in UCI ML Repository, created by Gertjan van den Burg Thanks to Gertjan van den Burg for his excellent work.
- The Extreme Classification Repository: Multi-label Datasets & Code
- Data and code for the study of bullying
- RELATIONAL DATASET REPOSITORY
Network analysis
- Stanford Large Network Dataset Collection , including a set of different datasets for network analysis, community detection etc.
- The personal page of Mark Newman Prof. Newman’s collections. Very good resources with detailed description.
- The personal page of Andrew McCallum Prof. McCallum in UMass has datasets such as Cora Information Extraction, Cora Research Paper Classification etc. Useful for co-citation and co-authorship network.
Bioinformatic Databases
- Gene Expression Search Engine This database stored and manually annotated experimental design and keywords of thousands of gene expression experiments collected from GEO, Array Express, Japanese Toxicogenomic Project, Connectivity Map, DrugMatrix etc. Researchers can find previously conducted experiments involving genes of interest by search for genes or keywords.
Multi-view Face Recognition
- The Color FERET Database (from NIST). Request needed to obtain the dataset
Gaussian Process
- GPmat (Gaussian Process Matlab toolbox)
- GPyOpt (Gaussian process optimization using Python)
- GPy (Gaussian process packages for Python) This is a new version of the package developed by the same group as those developed GPmat.
Deep Learning
- Caffe (Deep Learning toolbox). Caffe is a well-known and widely used machine-vision library that ported Matlab’s implementation of fast convolutional nets to C and C++. Caffe is not intended for other deep-learning applications such as text, sound or time series data. Like other frameworks mentioned here, Caffe has chosen Python for its API.
- Theano is the grand-daddy of deep-learning frameworks, which is written in Python. It is powerful tools widely used for research purposes and serving the large Python community. It is well suited to data exploration and explicitly state that they are intended for research.
- Keras A python framework built upon the Theano.