Statistical Challenges in Modern Machine Learning and their Algorithmic Consequences

Talk
Yeshwanth Cherapanamjeri
Time: 
02.19.2025 11:00 to 12:00

The success of modern machine learning is driven, in part, by the availability of large-scale datasets. However, their immense scale also makes the effective curation of such datasets challenging. Many classical estimators, developed under the assumption of clean, well-behaved data, fare poorly when deployed in these settings. This unfortunate scenario raises several statistical as well as algorithmic challenges: What are the statistical limits of estimation in these settings, and can they be realized computationally efficiently?
In this talk, I will compare and contrast the task of addressing these challenges in two natural, complementary settings: the first featuring extreme noise and the second, extreme bias.
In the first setting, we consider the problem of estimation with heavy-tailed data, where recent work has produced estimators achieving optimal statistical performance. However, these solutions are computationally impractical, and their analysis is tailored to the specific problem of interest. I will present a simple algorithmic framework that has resulted in state-of-the-art estimators for a broad class of heavy-tailed estimation problems.
Next, I consider the complementary setting of extreme bias under the classical Roy model of self-selection bias, where bias arises due to the strategic behavior of the data-generating agents. I will describe algorithmic approaches to counteract this bias, yielding the first statistically and computationally efficient estimators in this setting.
Finally, I will conclude the talk with future directions targeting the construction of good datasets when the data is drawn from a diverse and heterogeneous range of sources with varying quality and quantity.