Show HN: A mathematical proof that more dirty features can beat fewer clean ones

(github.com)

3 points | by tjleestjohn 2 hours ago ago

4 comments

elkomysara7 2 hours ago
I’ve been reviewing the From Garbage to Gold framework, and I’m intrigued by its central claim: that expanding the predictor space with error‑prone variables can outperform perfect cleaning of a smaller set. The distinction between Predictor Error and Structural Uncertainty feels like a powerful reframing of the ‘dirty data’ problem.
I’d love to discuss the practical implications of this—especially how this theory might reshape feature selection strategies, data architecture design, and the balance between cleaning vs. redundancy in real-world enterprise environments. How do others see this influencing ML workflows, particularly in high‑dimensional tabular settings?
[-]
- tjleestjohn 2 hours ago
  Practically, we see this shifting enterprise ML workflows towards what we term Proactive Data-Centric AI (P-DCAI) in the paper. Instead of the traditional, reactive approach of aggressively cleaning and pruning variables — which often strips away the redundancy needed to capture the full latent signal — P-DCAI treats data architecture as an upfront strategic design choice. Feature selection becomes less about finding pristine, uncorrelated inputs and more about deliberately engineering a portfolio optimized for "novelty" (to comprehensively cover all underlying latent drivers) and "informative redundancy" (to ensure statistical reliability even when individual predictors are highly error-prone).
tjleestjohn an hour ago
Hello HN,
I'm Terry, the first author.
I spent the last 2.5+ years formalizing this theory to explain a strange anomaly I kept encountering in industry: models trained on vast, incredibly dirty, uncurated datasets were sometimes achieving state-of-the-art predictive performance, completely defying the "Garbage In, Garbage Out" mantra.
The TL;DR of the paper [https://arxiv.org/abs/2603.12288] is a formal mathematical proof showing why adding more error-prone variables can actually beat cleaning fewer variables to perfection.
The key is recognizing that complex systems often generate data through underlying latent structures. This allows for the partitioning of predictor-space noise into "Predictor Error" and "Structural Uncertainty," and the results follow logically. The paper also formally connects latent architecture to the prerequisites for Benign Overfitting — showing that the structural conditions that enable modern overparameterized models to generalize well arise naturally from latent generative processes.
The theory applies broadly across domains, but work began as an attempt to explain a specific peer-reviewed result at Cleveland Clinic Abu Dhabi — published in PLOS Digital Health [https://journals.plos.org/digitalhealth/article?id=10.1371/j...] — where we achieved .909 AUC predicting stroke/MI in 558k patients using thousands of uncurated EHR variables with no manual cleaning.
Important caveat: As detailed in the paper, this isn't a magic silver bullet. The framework strictly requires data with a latent hierarchical structure (e.g., medical diagnoses driven by unmeasured physiology, stock prices driven by hidden sentiment, sensor readings driven by underlying physical states). It also means your pre-processing effort shifts from data hygiene to data architecture.
I included a fully annotated R simulation in the repo so you can see the exact mechanisms of how "Dirty Breadth" beats "Clean Parsimony."
My team and I are currently operationalizing this into warehouse-native infrastructure (Snowflake, Databricks, etc.) because 80% of enterprise data is tabular, and companies are burning massive amounts of their ML budgets on data cleaning pipelines that they might not actually need.
I would love to hear your thoughts or criticisms on the theory, or how you handle high-dimensional noise in your own tabular pipelines.
I'll be hanging out in the comments to answer any questions!
2 hours ago
[deleted]