Useful Toolboxes and Databases

This page contains a collection of toolboxes and datasets that are public available. Edited by Tianpei Xie.

Available Group Toolboxes

Information Geometric Dimensionality Reduction (IGDR) Toolbox (Kevin M. Carter, Raviv Raich, and Alfred O. Hero III)
AFFECT MATLAB toolbox for clustering dynamic data (Kevin S. Xu, Mark Kliger, and Alfred O. Hero III)
Regularized graph layout MATLAB toolbox for dynamic network visualization (Kevin S. Xu, Mark Kliger, and Alfred O. Hero III)
Bayesian non-linear unmixing (provided by Nicolas Dobigeon and Cecile Bazot): http://dobigeon.perso.enseeiht.fr/app_BFA.html.

UCI Machine Learning Repository, commonly used in ML.
Awesome public datasets. This site collected a large set of links that points to public datasets from Agriculture Biology, Climate/Weather, Complex Networks, Computer Networks, Contextual Data, Data Challenges, Economics, Education, Energy, Finance, Geology GIS/Environment, Government, Healthcare, Image Processing, Machine Learning, Museums, Natural Language, Physics, Psychology/Cognition, Public Domains, Search Engines, Social Networks, Social Sciences, Software, Sports, Time Series, Transportation, Complementary Collections.
Kaggle competition. Useful for side projects.
Federal Crash Databases (FARS, NASS, LTCCS). FARS dataset is useful for drunk driver predictions.
Knowledge Extraction Evolution Learning (KEEL)-dataset repository Includes preprocessed dataset from UCI repository and some other datasets like FARS.
The search engine for datasets in UCI ML Repository, created by Gertjan van den Burg Thanks to Gertjan van den Burg for his excellent work.
The Extreme Classification Repository: Multi-label Datasets & Code
Data and code for the study of bullying
RELATIONAL DATASET REPOSITORY

Stanford Large Network Dataset Collection , including a set of different datasets for network analysis, community detection etc.
The personal page of Mark Newman Prof. Newman’s collections. Very good resources with detailed description.
The personal page of Andrew McCallum Prof. McCallum in UMass has datasets such as Cora Information Extraction, Cora Research Paper Classification etc. Useful for co-citation and co-authorship network.

Gene Expression Search Engine This database stored and manually annotated experimental design and keywords of thousands of gene expression experiments collected from GEO, Array Express, Japanese Toxicogenomic Project, Connectivity Map, DrugMatrix etc. Researchers can find previously conducted experiments involving genes of interest by search for genes or keywords.

Caffe (Deep Learning toolbox). Caffe is a well-known and widely used machine-vision library that ported Matlab’s implementation of fast convolutional nets to C and C++. Caffe is not intended for other deep-learning applications such as text, sound or time series data. Like other frameworks mentioned here, Caffe has chosen Python for its API.
Theano is the grand-daddy of deep-learning frameworks, which is written in Python. It is powerful tools widely used for research purposes and serving the large Python community. It is well suited to data exploration and explicitly state that they are intended for research.
Keras A python framework built upon the Theano.