Theses

The student is required to conduct advanced research on a topic related to data science. The topic is chosen in consultation with thesis supervisor, and the student presents research plan in writing before research starts. The student must submit the completed research in a thesis format to an examination committee and make an oral presentation of the thesis. The student is expected to furnish evidence of competence in research and a sound understanding of data science associated with the research.

The Thesis is presented to the university in partial fulfillment of the requirements for the degree of Master of Science in the program of Data Science and Analytics.

2023

de Guzman, Patrick – Transformer Models for Automated Bug Training and Duplicate Bug Detection (Supervisor: Mucahit Cevik; Co-supervisor: Ayse Basar Bener)

In the software engineering field, developer teams must handle bug reports of varying sources and formats to maintain and optimize software applications as issues arise. A team’s workflow for handling bugs involves multiple stages to review, assess, assign, and resolve bugs. In teams for large-scale applications, streamlining such processes is vital for efficient operations as they are exposed to greater volumes and varieties of bug reports. This thesis focuses on bug triaging and duplicate bug detection as preliminary processing options for the automated bucketing and assignment of bugs. In the bug triaging task, Transformer-based models are found to outperform in mean Rank-5, Rank-10, and Mean Reciprocal Rank across several open-source datasets for various software projects. In the duplicate bug detection task, similarity learning is employed and Transformer-based siamese models with domain adaptation are shown to improve similarity learning capabilities with improvements in mean Area under the Curve, Recall-rate @ k, and Mean Reciprocal Rank performance.

Helmeczi, Robert Kraig – Few-Shot Learning for Text Classification and Its Applications in Essay Scoring and Software Engineering (Supervisor: Mucahit Cevik)

Few-shot learning—the ability to train models with access to limited data—has become increasingly popular in the natural language processing (NLP) domain, as large language models such as GPT and T0 have been empirically shown to achieve high performance in numerous tasks with access to just a handful of labeled examples. Smaller language models such as BERT and its variants have also been shown to achieve strong performance with just a handful of labeled examples when combined with few-shot learning algorithms like pattern-exploiting training (PET) and SetFit. The focus of this thesis is to investigate the performance of alternative few-shot learning approaches with BERT-based models. Specifically, vanilla fine-tuning, PET and SetFit are compared for numerous BERT-based checkpoints over an array of training set sizes. To facilitate this investigation, applications of few-shot learning are considered in automatic essay scoring—the task automatically grading written assessments—as well as in software engineering. For each task, high-performance techniques and their associated model checkpoints are identified through detailed empirical analysis. Our results establish PET as a strong few-shot learning approach, and our analysis show that with just a few hundred labeled examples it can achieve performance near that of fine-tuning on full-sized data sets.