Artificial Intelligence

   

From Data Pipelines to AI Outcomes: Quantifying the Impact of Data Engineering Decisions on Machine Learning Reliability

Authors: Thuy Thu Nguyen

The reliability and performance of machine learning (ML) systems in production dependcritically on data engineering decisions made throughout the pipeline lifecycle. This compre-hensive technical review synthesizes ndings from 434 peer-reviewed publications spanning20182026 to quantify how upstream data collection, mid-stream preprocessing and featureengineering, and downstream versioning and monitoring decisions impact ML outcomes.We examine production systems across cybersecurity, healthcare, nance, and cloud-nativeplatforms, analyzing technical frameworks including Apache Kafka, Kubeow, MLow, andemerging feature stores. Our analysis reveals that data quality issues account for 6080% ofML system failures in production, with data engineering decisions inuencing model accu-racy by up to 40 percentage points. We identify critical decision points across the pipeline,quantify their impacts through empirical evidence, and provide actionable frameworks forpractitioners. Key ndings include: (1) streaming architectures reduce latency by 10100Öwhile maintaining accuracy within 25% of batch systems; (2) automated data validationcatches 7090% of quality issues before model training; (3) feature stores reduce feature engi-neering time by 5070% while improving consistency; and (4) comprehensive lineage trackingenables 35Ö faster debugging of production failures. This review establishes data-centricAI as essential for reliable ML systems and identies critical gaps in cost-benet analysis,cross-domain generalization, and standardized impact metrics.

Comments: 45 Pages. (Note by viXra Admin: Author name is required in the article; please submit article written with AI assistance to ai.viXra.org)

Download: PDF

Submission history

[v1] 2026-01-19 21:07:20

Unique-IP document downloads: 538 times

Vixra.org is a pre-print repository rather than a journal. Articles hosted may not yet have been verified by peer-review and should be treated as preliminary. In particular, anything that appears to include financial or legal advice or proposed medical treatments should be treated with due caution. Vixra.org will not be responsible for any consequences of actions that result from any form of use of any documents on this website.

Add your own feedback and questions here:
You are equally welcome to be positive or negative about any paper but please be polite. If you are being critical you must mention at least one specific error, otherwise your comment will be deleted as unhelpful.

comments powered by Disqus