Structured Data Preparation Pipeline for Machine Learning-Applications in Production

Maik Frye, Robert Heinrich Schmitt
Abstract:
The application of machine learning (ML) is becoming increasingly common in production. However, many ML-projects fail due to the existence of poor data quality. To increase its quality, data needs to be prepared. Through the consideration of versatile requirements, data preparation (DPP) is a challenging task, while accounting for 80 % of ML-projects duration [1]. Nowadays, DPP is still performed manually and individually making it essential to structure the preparation in order to achieve highquality data in a reasonable amount of time. Thus, we present a holistic concept for a structured and reusable DPP-pipeline for ML-applications in production. In a first step, requirements for DPP are determined based on project experiences and detailed research. Subsequently, individual steps and methods of DPP are identified and structured. The concept is successfully validated through two production use-cases by preparing data sets and implementing ML-algorithms.
Keywords:
Artificial Intelligence, Machine Learning, Data Preparation, Data Quality
Download:
IMEKO-TC10-2020-034.pdf
DOI:
-
Event details
IMEKO TC:
TC10
Event name:
TC10 Conference 2020 (ONLINE)
Title:

17th IMEKO TC10 Conference "Global trends in Testing, Diagnostics & Inspection for 2030” (2nd Conference jointly organized by IMEKO and EUROLAB aisbl)

Place:
Dubrovnik, CROATIA
Time:
20 October 2020 - 22 October 2020