A Graph Based End To End Defect Prediction Framework
Summary :
Software defect prediction is one of the most explored research topics in software engineering. Modern software applications are often overly complicated and prone to failures. Software defect prediction (SDP) can alert on the risk of failure of a software component in the initial stages of development and help developers to appropriately schedule and prioritize their test efforts, reduce costs, and ensure software quality. Traditional statistical software defect prediction tools are always time-consuming and ineffective. We argue that machine learning algorithms with its ability in learning, classification, knowledge representation, etc. can capture useful properties of code that are difficult to extract by humans or other alternative research methods. However, the performance of machine learning tools varies depending on the quality of input data. Since the programming languages of modern applications hold increasingly complex characteristics which are difficult to understand, it is a prerequisite to provide a powerful representation of code analysis that can explore deeply the code software artifacts and capture useful information from different levels of abstraction of the programs. For these reasons, many efforts have been made to propose an efficient defect prediction tool, but the achievements do not represent yet high performance.
In this thesis, we focus on software defect prediction and propose a novel deep learning-based technique to enhance existing defect prediction approaches. To build predictive models, previous studies focused on classic machine learning algorithms and handcrafted traditional features (i.e., software metrics). The software metrics are designed manually to capture the static properties of the code. Such methods are time-consuming and inaccurate since they fail to capture the semantic meanings of programs. Recently, researchers exploited deep learning algorithms based on either tree representations of programs or precise graphs representing program execution flows. However, these models do not offer high performance and do not cover all types of bugs. They often fail to capture intra-procedural dependencies. Indeed, several bugs are related to these dependencies. Such information is important in modelling program functionality and can lead to a more accurate defect prediction.
The training procedure requires a sufficient historical data from a project to build a prediction model. Therefore, it is not practical for new projects, which have no or not enough historical data. An alternative solution is to train a prediction model by using data from other projects. The traditional approaches are based on metrics to select appropriate projects whose characteristics are close to the new project. However, the metrics are not enough to capture meaningful information from projects and then choose the best candidates that generalize well the new project. The differences between projects in several aspects such as the architecture, developer experience, coding style, the functional, etc. makes the selection task more complicated.
In this thesis, the emphasis was placed on two main tasks: First, to bridge the gap between programs' dependencies and defect prediction features, we propose an end-to-end deep learning algorithm to learn a powerful code representation including different levels of abstractions of code such as the syntax, the semantic and the dependencies automatically from code and further train and construct defect prediction classifier by using these complex features. The experimental results indicate that our approach can significantly improve the existing defect prediction approaches. Second, we propose a novel method to choose the best candidate projects for the project that lacks historical data. We evaluate the effectiveness of our method on 10 open-source projects. Results show that selecting carefully the projects can boost the performance of existing techniques and even of our proposed defect prediction framework, which considers all the other available projects and does not involve any selection strategy of projects.
Keywords: Defect Prediction; Deep Learning; Code Property Graph; Graph Conventional Neural Network; Abstract Syntax Tree; Control Flow Graph; Program Dependency Graph; Program Analysis.