Rationale: My research is exploring methods from machine learning (e.g. deep reinforcement learning) and controls for automating
and improving large complex network engineering operations and distributed systems facilities.
These systems are complex and big, producing large amounts of data at multiple layers. Either within the data center or across distributed facilities,
these systems are responsible for mission-critical applications and require many engineers to work together to understand system detoriation, optimize performance and
also improve usability for users. My target application domains include self-autonomous networks and facilities for astrophysics and bioinformatics science applications.
Selected Research Projects
DAPHNE: Deep and Autonomously Performing High-Speed Networks
This project is supported by DOE Early Career Award [Home page].
Focused on enabling self-driving intelligent networks that allow improved response, utilization, and reliability for exascale scientific workflows.
The research builds robust networks through machine learning-based approaches, cloud computing, and software-defined networks (SDN). For example, deep learning algorithms have recently been used to process real-time events, anomaly finding and autonomous cars in highway traffic.
Analogously, the proposed research couples deep learning methods with SDN for predicting real-time network behavior and avoiding data traffic congestion or degraded network performance. Distributed processing models such as cloud computing will be used to reduce learning time and improve real-time network reactions.
As data demands from scientific communities rapidly increase, the proposed research is timely for ensuring the development of reliable and robust networks with guaranteed
high-throughput data transfer and uninterrupted performance.
This project is also exploring smart contracts and blockchains as a means of reliable and distributed machine learning communication across distributed nodes.
Panorama 360: Performance data capture and analysis for end-to-end scientific workflows
This project is supported by DOE Analytical models funding.
Scientific workflows support cutting edge
computational science by automating many tedious and error-prone tasks on behalf of the scientists.
They can help capture data at the instrument, preprocess it at an HPC cluster and move results to a visualization platform.
These workflows increase in size and complexity as experiments mature and need powerful computing, networking and storage to support the. Advanced workflow management systems can help achieve desired performance, scalability and reliability.
Due to lack of realistic data, workflows use simplistic algorithms for resource selection and task scheduling focusing on computational aspects of workflows rather than end-to-end process of data management. Building on the success of the Panorama project, which developed initial data collection tools for a subset of workflows, this project 1) expands types of workflow under study 2) apply machine learning analysis to analyze performance data to detect performance bottlenecks and anomalies to optimize workflow performance and 3) build community repository that serve a unique resource for researchers to develop algorithm and techniques for workflow management, exploring querying and provenance management, fault tolerance and adaptively.
The proposed work will explore a number of performance data analysis, synthesis, and characterization approaches based on machine learning utilizing both supervised and unsupervised learning. These algorithms and analyses will detect workflow anomalies, identify performance bottlenecks, aid debugging and troubleshooting sources of failures and perform optimizations. [Webpage]
INDIRA: Intelligent network deployment intent rendering application
This project was supported by LBNL LDRD funding.
INDIRA aims to become a personal assistant for users by using
natural language processing, OWL and AI to converse with users to understand network requirements.
These requirements are then automatically translated into network code and appropriate tools are called through built-in orchestrators and communication to diverse tools.
ESnet is doing research on various techniques for improving intent to capture science workflow performance. A video of our SC16 demonstration on INDIRA for intent-based data transfers is available here.