PhD Proposal: Building up more human-aligned foundational models: Data and Algorithms
Remote
https://umd.zoom.us/j/3924717987?pwd=QWw0OG1LcS9uL1lxdi9QZHRhS3ZrQT09
The rise of large foundational models has underscored the critical challenge of ensuring their alignment with human values and intentions. Two fundamental elements—data and algorithms—play pivotal roles in advancing this alignment. Firstly, I will introduce a data-centric approach to improve large language models (LLMs) by filtering out low-quality training data using a strong LLM. The method significantly reduces training time, offers cost savings, and proves effective across different datasets, base models, and LLM filters, emphasizing the critical importance of data quality over quantity in instruction tuning.Secondly, I will introduce a reward model training algorithm to tackle the issue of reward hacking in Reinforcement Learning from Human Feedback (RLHF) in LLMs, where models generate overly verbose but less meaningful responses to exploit reward systems for higher scores. We also propose a new evaluation protocol to accurately measure the trade-off between response length and evaluation score, demonstrating through experiments that the method effectively reduces the correlation between reward and verbosity, leading to more genuinely helpful model outputs without excessive length.