Research

Links

Research Publications

Patents

Research group (current)

Research group (past)

Computing Facilities

Research Overview

(Updated: January 2021) The Hero group focuses on building foundational theory and methodology for data science, in particular for machine learning and signal processing. Data science is the methodological underpinning for data collection, data management, data analysis, and data visualization. Lying at the intersection of mathematics, statistics, computer science, information science, and engineering, we are developing data science methods for a wide range of applications including: network science, nuclear science, public health and personalized medicine, brain and neuroscience, environmental and earth sciences, astronomy and space science, materials science, molecular biology, genomics and proteomics, computational social science, business analytics, computational finance, information forensics, and national security. Much of the theory that the Hero group is developing applies to high dimensional data collection, analysis and visualization. Some current projects of the group are:

  1. Development of tools to extract useful information from high dimensional datasets with many variables and few samples (large p small n). A major focus here is on establishing fundamental limits in “the large p small n” regime, which allows data analysts to “right size” their sample for reliable extraction of information. Areas of interest include: correlation mining in high dimension, i.e., inference of correlations between the behaviors of multiple agents from limited statistical samples, and dimensionality reduction, i.e., finding low dimensional projections of the data that preserve relevant information.
  2. Data representation, analysis and fusion on non-linear non-euclidean structures. Examples of such data include: data that comes in the form of a probability distribution or histogram (lies on a hypersphere with the Hellinger metric); data that are defined on graphs or networks (combinatorial non-commutative structures); data on spheres with point symmetry group structure, e.g., quaternion representations of orientation or pose.
  3. Resource constrained information-driven adaptive data collection. We are interested in sequential data collection strategies that utilize feedback to successively select among a number of available data sources in such a way to minimize energy, maximize information gains, or minimize delay to decision. A principal objective has been to develop good proxies for the reward or risk associated with collecting data for a particular task (detection, estimation, classification, tracking). We are developing strategies for model-free empirical estimation of surrogate measures including Fisher information, R\'{e}nyi entropy, mutual information, and Kullback-Liebler divergence. In addition we are quantifying the loss of plan-ahead sensing performance due to use of such proxies.
  4. Geometric embedding of combinatorial optimization. One of the major roadblocks to making scientific progress in solving grand challenge problems is the curse of dimensionality, This problem is especially acute in combinatorial optimization where the behavior of the objective function under permutations and combinations has no obvious geometric structure. Remarkably, smooth geometric structure emerges as one allows the domain dimension to grow in many Euclidean combinatorial optimization problems including shortest path through a similarity graph and multiobjective pattern matching. This geometric embedding can lead to approximate solution of the combinatorial problem via solution of a simpler variational continuous optimization problem. Further progress in this field could lead to general combinatorial solvers that utilize the considerable machinery available in scientific computing, e.g., general ordinary differential equation (ode) and partial differential equation (pde) solvers. Grand challenge problems that could benefit from this research include: monitoring pandemics (path analysis on epidemic proximity graphs); energy and transportation (optimal routing); and adaptive drug design (computing Pareto frontiers); to name just a few.

These areas arise in the context of several sponsored projects in the Hero lab including the following:

  1. Mathematical representations of high dimensional spatio-temporal data (funded by NASA, DOE, ARO, AFOSR and DARPA). This project is developing mathematical and probabilistic models for multiway data that can be used for prediction, classification, and anomaly detection. Application domains include astronomical data, network data, biomedical diagnostics and predictive health .  One project, funded by DARPA (ended in 2020), aimed to predict health and disease propagation (epidemics) over a close knit human population based on a combination of genetic, metabolic, and wearable data. Another project, funded by NSF (ended in 2018), developed machine learning methods that can handle data that comes in the form of distributions. An example is flow cytometry data where each cell in a blood sample is assayed and assigned a multidimensional label, including antibody, protein binding, and morphology labels. In another project, funded by NASA, the Hero group  is developing physics-based machine learning approaches for prediction of solar flares from spatio-temporal measurements of active regions (sunspots) on the the sun’s surface. Another project, funded by ARO, seeks to model the growth dynamics and metabolite production of bacterial biofilms, relating these behaviors across time and length scales, for which we are applying recurrent neural networks (RNN). Another project (DOE) is funding us to develop analysis tools for nuclear non-proliferation treaty verification using on-site and remote data collection strategies to monitor declared facilities and detect undeclared facilities.  In each of these areas we are developing approaches based on high dimensional data analysis, adaptive sampling (when to take an measurement or assay and where to take it from), large scale statistical inference, and multimodality data/information integration.
  2. Subspace processing for imaging and information fusion (ARO, AFOSR). Subspace models are models that are sparse in a basis spanning a low dimensional subspace of the data. Such models allow fusion of multi-modality data without overfitting and accomplish denoising of high dimensional datasets. We have developed dictionary based learning methods including pattern dictionaries, non-negative factor analysis, and measure transformation generalization of PCA, ICA, and CCA that allow non-linear components to be captured in the original coordinates (unlike kernel methods of PCA, ICA and CCA). An area of current interest is subspace processing for tensor valued data, e.g., arising in satellite remote sensing generating tensor data along time, space and wavelength dimensions. These methods have been applied to modeling longitudinal data including: gene and proteomic expression data, mobile wearable data;  bacterial community dynamics, optical and radar remote sensing, EEG and EKG data, sensor network data, and social media (Twitter) data.
  3. Network measurements and analytics (ARO, AFOSR, DOE). Network data is defined on a graph data structure and can be of exceedingly high dimension, both as measured by the number of nodes and the number of node attributes. One focus, previously funded by NSF, is on Internet data analysis, including flows (TCP, UDP, etc), application level (email, http), and transport (end-to-end delays, packet losses) to detect anomalies and reconstruct topology of the network. Another focus, funded by ARO (ended 2018), is on characterizing emergent behaviors on multi-layer social networks. An AFOSR funded effort  (ended 2016) addressed the problem of reliably estimating structural properties of correlation graphs in sample-starved high dimensional regimes. We have been interested in problems of clustering, classification, and prediction for graph-valued data. A recent ARO funded effort used graphon theory to establish the fundamental limits on the classification accuracy of graph convolutional networks (GCN).  These areas are being pursued in the context of applications including: protein-protein interaction networks, network tomography, target tracking in sensor networks, and network anomaly detection. Modalities that have been investigated include: Internet traffic data, email data, fMRI brain activation, 12 lead EKG monitoring and diagnostics, gene regulation networks, and sensor networks.
  4. Database indexing and retrieval (ARO-Databases, ended in 2014). In this ARO funded project, the objective is to develop methods based on sparsity and dimensionality reduction for searching large multimedia databases of images and videos. This involves the development of scalable methods of feature selection, similarity matching, and spatio-temporal modeling that can improve precision and recall performance. See webpage (.html) for a short bibliography on this topic. Current areas of focus are event detection and correlation in videos, pose estimation and 3D shape retrieval, multimodality retrieval using information theoretic measures, and multiple criteria image search. Some applications areas that we are currently considering are: automated recommendation systems, human-in-the-loop indexing and retrieval, and.
  5. Non-commutative information for fusion and active learning (ARO). We are developing a theoretical framework for accounting for and exploiting the intrinsic non-commutative structure of models-for and operations-on complex data structures. Non-commutativity is ubiquitous to data that carries information that is directed, asymmetric, not invariant to permutation, or that becomes stale over time. Such data may not have any natural linear ordering, e.g., graph-valued data or tensor valued data. Models for such data must account for the fact that certain operations will destroy information. Most current models do not take this into account, e.g., the commonly applied independent identically distributed (iid) model. This model leads to permutation invariant summary statistics like the sample mean, or sum of log-likelihoods, that fail to account for directed dependency in time series data.  Another place that non-commutative operations arise is in active learning where actions are taken on the fly that affect the collected data. In active learning actions are irrevocable, one cannot reverse them after they have been applied, leading to non-commutativity.  This project develops methods for exploiting non-commutativity for application areas including multimodality sensor fusion, target tracking, and inverse problems.   One of our main focii is developing theory and methods that account explicitly for the non-commutative nature of systems with a human-in-the-loop. One major focus is centralized and decentralized cooperative target search: by querying a pair of human and machine sensors to localize an image, classify a scene, or to estimate the position of a weak target.
  6. Adversarial robustness of deep neural networks (DARPA). In this project we are developing methods for hardening deep learning architectures against adversarial attacks. Deep neural networks (DNN) are especially vulnerable to attacks on their inputs, which can have devastating consequences in autonomous critical decision systems, e.g., DNNs for object avoidance detection in self-driving vehicles.  We have developed a robust adversarial immune-system-inspired learning (RAILS) framework for hardening DNNs. This framework emulates the mammalian host immune response in responding to attacks, e.g., pathogens. Using the analogy between adversarial attacks on DNN in-silico system and adversarial pathogen attacks on a host living system, we have created an adaptive immune system emulator (AISE) that emulates the sensing, flocking, clonal expansion, optimization and consensus phases of the naturally occurring mammalian immune response.    

Current projects

Past projects