Ensuring the reliability of software through early-stage defect prevention and prediction is of utmost importance in the face of today’s increasingly complex software systems. Automated testing, particularly driven by machine learning approaches and natural language models, has emerged as a practical solution for crafting bugfree and efficient code. However, the field of Software Defect Prediction (SDP) faces challenges related to data quality and limited generalizability in empirical studies. A significant aspect in software visualization is the generation of aesthetically pleasing graph layouts, where Gradient-descent (GD) based schemes have been employed for optimizing differentiable loss functions. Despite these advancements, some graph properties are not easily expressible through differentiable functions. The Graph Neural Drawer (GND) framework has recently proposed leveraging neural models to express non-differentiable losses, particularly for edge intersection, allowing subsequent optimization via GD. With the exponential growth of insurance documents, the classification of insurance text has become crucial. Natural language processing (NLP) and deep learning models, such as BERT, have proven effective in text mining and classification. However, applying general BERT models to specialized domains, like insurance, often results in unsatisfactory accuracy due to the shift in text data from a general domain to specific corpora. In this thesis, we address these challenges and explore novel approaches in software defect prediction, graph drawing techniques, and specialized domain text classification using BERT models. The goal is to enhance software development processes and contribute to the evolving landscape of machine learning-driven solutions in software engineering. The thesis presents a comprehensive framework aimed at automating software defect predictions, with a primary focus on enhancing software development and testing processes. This innovative framework targets eight distinct defects, seeking to optimize testing and development techniques by identifying defects within software code snippets. The foundational model for defect predictions in the proposed framework is built upon the CodeBERT model. The research introduces a groundbreaking multi-class dataset named "SoftwareBugHarbor," designed to overcome the limitations of previous binary datasets by offering multi-class classification and enhanced accuracy. This dataset, meticulously curated with over 5300 code instances, serves as a significant contribution to the field of software defect prediction. Within the thesis, a novel approach to improving graph layout readability is proposed, leveraging linear splines. This method exploits principles from Graph Neural Networks (GND) and employs a neural model to identify and optimize the relative positions of crossing edges. Introducing linear splines as control points, treated as "fake" vertices, enhances the underlying layout optimization process. The thesis provides both qualitative and quantitative analyses across multiple graphs, optimizing various aesthetic losses, demonstrating the effectiveness of the proposed method. Another facet of the research introduces RiskBERT, a domain-specific language representation model pre-trained on insurance corpora. Focusing on the insurance domain, the RiskBBERT model is developed through further pre-training of LegalBERT. RiskBERT is applied to downstream clause/provision classification tasks, with an experimental study conducted on two datasets. The research showcase the model’s capability to analyze complex insurance texts, outperforming the previous state-ofthe-art BERT model in text classification tasks. The thesis collectively contribute to advancing the field of software defect prediction, offering a holistic approach to automation, improved graph layouts, and domain-specific language representation.

Integrated Solutions in Machine Learning: A Triad of Software Defect Prediction, Graph Drawing Optimization, and Insurance Classification Models / Rida Ghafoor;. - (2025).

Integrated Solutions in Machine Learning: A Triad of Software Defect Prediction, Graph Drawing Optimization, and Insurance Classification Models

Rida Ghafoor
2025

Abstract

Ensuring the reliability of software through early-stage defect prevention and prediction is of utmost importance in the face of today’s increasingly complex software systems. Automated testing, particularly driven by machine learning approaches and natural language models, has emerged as a practical solution for crafting bugfree and efficient code. However, the field of Software Defect Prediction (SDP) faces challenges related to data quality and limited generalizability in empirical studies. A significant aspect in software visualization is the generation of aesthetically pleasing graph layouts, where Gradient-descent (GD) based schemes have been employed for optimizing differentiable loss functions. Despite these advancements, some graph properties are not easily expressible through differentiable functions. The Graph Neural Drawer (GND) framework has recently proposed leveraging neural models to express non-differentiable losses, particularly for edge intersection, allowing subsequent optimization via GD. With the exponential growth of insurance documents, the classification of insurance text has become crucial. Natural language processing (NLP) and deep learning models, such as BERT, have proven effective in text mining and classification. However, applying general BERT models to specialized domains, like insurance, often results in unsatisfactory accuracy due to the shift in text data from a general domain to specific corpora. In this thesis, we address these challenges and explore novel approaches in software defect prediction, graph drawing techniques, and specialized domain text classification using BERT models. The goal is to enhance software development processes and contribute to the evolving landscape of machine learning-driven solutions in software engineering. The thesis presents a comprehensive framework aimed at automating software defect predictions, with a primary focus on enhancing software development and testing processes. This innovative framework targets eight distinct defects, seeking to optimize testing and development techniques by identifying defects within software code snippets. The foundational model for defect predictions in the proposed framework is built upon the CodeBERT model. The research introduces a groundbreaking multi-class dataset named "SoftwareBugHarbor," designed to overcome the limitations of previous binary datasets by offering multi-class classification and enhanced accuracy. This dataset, meticulously curated with over 5300 code instances, serves as a significant contribution to the field of software defect prediction. Within the thesis, a novel approach to improving graph layout readability is proposed, leveraging linear splines. This method exploits principles from Graph Neural Networks (GND) and employs a neural model to identify and optimize the relative positions of crossing edges. Introducing linear splines as control points, treated as "fake" vertices, enhances the underlying layout optimization process. The thesis provides both qualitative and quantitative analyses across multiple graphs, optimizing various aesthetic losses, demonstrating the effectiveness of the proposed method. Another facet of the research introduces RiskBERT, a domain-specific language representation model pre-trained on insurance corpora. Focusing on the insurance domain, the RiskBBERT model is developed through further pre-training of LegalBERT. RiskBERT is applied to downstream clause/provision classification tasks, with an experimental study conducted on two datasets. The research showcase the model’s capability to analyze complex insurance texts, outperforming the previous state-ofthe-art BERT model in text classification tasks. The thesis collectively contribute to advancing the field of software defect prediction, offering a holistic approach to automation, improved graph layouts, and domain-specific language representation.
2025
Marco Gori
PAKISTAN
Rida Ghafoor;
File in questo prodotto:
File Dimensione Formato  
Doctoral Thesis - Rida Ghafoor Hussain.pdf

accesso aperto

Licenza: Creative commons
Dimensione 2.74 MB
Formato Adobe PDF
2.74 MB Adobe PDF

I documenti in FLORE sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificatore per citare o creare un link a questa risorsa: https://hdl.handle.net/2158/1413692
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact