Advancing Software Understanding and Composition through Machine Learning-driven Semantic Analysis

The exponential growth of scientific software development has resulted in extensive and complex codebases, making it difficult for researchers to search, compare, and understand software repositories. Efficiently navigating these codebases is crucial for accelerating scientific discoveries and fostering innovation. The ability to quickly locate relevant code reduces the time spent on software development, allowing researchers to focus more on core scientific research and less on managing code. This efficient navigation and understanding of codebases also support better collaboration among researchers, promoting interdisciplinary advancements and comprehensive solutions to complex scientific problems. Recent advancements have used deep learning techniques to capture semantic similarities among repositories, enabling better collaboration, code reuse, and faster software development. However, these techniques have not been fully explored, especially in the areas of deeper semantic analysis and workflow composition.

Primary Supervisor: Dr Rosa Filgueira

Further Description

Workflows are crucial in scientific research because they organize various computational tasks, automate complex processes, and ensure reproducibility. Efficiently composing workflows is vital for integrating different software components, managing data flows, and optimizing resources. The current challenge lies in the manual composition of workflows and the inability to perform effective searches for workflows or their components.

This proposed PhD research aims to bridge this gap by delving deeper into Machine Learning (ML) for code analysis and synthesis, enhancing software understanding, workflow composition, and semantic searches. By advancing these areas, this research has the potential to significantly impact the efficiency and effectiveness of scientific research and software development processes. The development of ML-driven techniques for workflow composition will facilitate the automation and optimization of workflows, enabling researchers to focus more on innovation rather than on managing computational complexities.

Overview of research area

The primary goal of this research is to advance the understanding and composition of software through the integration and development of new ML techniques. The proposed work will focus on three interconnected objectives:

Enhanced Software Similarity Analysis: Building upon existing work such as RepoSim and RepoSnipy [1,2], and RepoGraph [3], this research will investigate novel techniques to further improve the identification of semantic similarities among software repositories. This involves exploring advanced deep learning models and embedding methods to better capture semantic relationships even when code fragments exhibit distinct syntactical structures. Additionally, new parsers, similar to inspect4py [4], will be created to extract software repository similarities from other languages (e.g., Java, R, C, etc.).
Semantic-driven Workflow and Component Composition: The research will develop a novel framework that leverages ML-driven semantic analysis to streamline the composition of workflows and software components. Inspired by frameworks like Laminar [5], which integrates serverless computing with deep learning capabilities, this framework will use ML-enhanced techniques for code summarization, completion, and understanding. This will facilitate the efficient assembly of complex software systems, allowing for more intuitive and automated management of workflows and components, thereby benefiting both researchers and practitioners.
Advanced Semantic Search for Software and Workflows: This research will extend the concept of semantic search to encompass not only software repositories but also workflows and workflow components. By employing and creating cutting-edge ML algorithms, a semantic search engine will be designed to retrieve relevant software and workflow components based on their inherent meanings rather than relying solely on keyword matches.

Potential research questions

How can advanced deep learning models be leveraged to improve the identification of semantic similarities among diverse software repositories?
What novel ML-driven techniques can be developed to streamline the composition of workflows and software components?
How can semantic search be extended to effectively retrieve software and workflow components based on their inherent meanings?

Student Requirements

A UK 2:1 honours degree, or its international equivalent, in a relevant subject such as computer science and informatics, physics, mathematics, engineering, biology, chemistry and geosciences.

You must be a competent programmer in at least one of C, C++, Python, Fortran, or Java and should be familiar with mathematical concepts such as algebra, linear algebra and probability and statistics.

English Language requirements as set by University of Edinburgh.

Student Recommended/Desirable Skills and Experience

Experience with Machine Learning (ML): Proficiency in applying ML techniques to real-world problems, particularly in the context of software engineering.
Knowledge of Deep Learning Frameworks: Familiarity with popular deep learning frameworks such as TensorFlow, PyTorch, or Keras.
Understanding of Software Development Lifecycle: Insight into various stages of software development, including version control, testing, and deployment.
Experience with Natural Language Processing (NLP): Ability to work with NLP techniques, as they are often relevant to semantic analysis tasks.

How to apply

Applications should be made via the University application form, available via the degree finder. Please note the proposed supervisor and project title from this page and include this in your application. You may also find this page is an useful starting point for a research proposal and we would strongly recommend discussing this further with the potential supervisor.

References

[1] Mapping the repository landscape: harnessing similarity with RepoSim and RepoSnipy. Zihao Li and Rosa Filgueira. 2023 IEEE 19th International Conference on e-Science (e-Science)

[2] Multi-Level AI-Driven Analysis of Software Repository Similarities. Hongling Zhang, Leyu Zhang, Lei Fang, and Rosa Filgueira. 2024 IEEE 20th International Conference on e-Science (e-Science) [submitted]

[3] RepoGraph: a novel semantic code exploration tool for python repositories based on knowledge graphs and deep learning. Christopher Williams and Rosa Filgueira. 2023 IEEE 19th International Conference on e-Science (e-Science).

[4] Inspect4py: a knowledge extraction framework for python code repositories. Rosa Filgueira, Daniel Garijo. Proceedings of the 19th International Conference on Mining Software Repositories
[5] Laminar: A New Serverless Stream-based Framework with Semantic Code Search and Code Completion. Zaynab Zahra, Zihao Li, Rosa Filgueira. Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis.