Andrei Paleyes
SAP Inspiration Sessions, 07.06.2023
Software engineers?
Data scientists?
ML/data engineers?
Other?
Venture Beat, 2019
“Why do 87% of data science projects never make it into production?”
InfoWorld, 2021
“85% of AI and machine learning projects fail to deliver, and only 53% of projects make it from prototypes to production.”
Capital One and Forrester, 2022
“73% of respondents find transparency, traceability, and explainability of data flows challenging.”
Many reasons
e.g. model accuracy vs business value
or computational and labour costs
or skillset
"150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com", Bernardi et al., KDD 2019
"Challenges in Deploying Machine Learning: A Survey of Case Studies", Paleyes et al., ACM Computing Surveys, 2022
Data management in modern software is often a mess
“Effective big data mining at scale doesn't begin or end with what academics would consider data mining”
“Data scientists expend a large amount of effort to understand the data available to them, before they even begin any meaningful analysis”
“Exploratory data analysis always reveals data quality issues”
Software Services
that provides a function, or many functions,
known as interface or API,
that clients* can reuse,
together with policies to control its usage.
*A client can be anything: another software, a person, a hardware.
https://en.wikipedia.org/wiki/Service_(systems_architecture)
However...
Image credit: Ben Morris, https://www.ben-morris.com/microservices-rest-and-the-distributed-big-ball-of-mud/
“Twitter is powered by many loosely-coordinated services.”
“Since a single user action may involve many services, a data scientist wishing to analyze user behavior must first identify all the disparate data sources involved.”
“Services are normally developed and operated by different teams, which may adopt different conventions for storing and organizing log data.”
Build software with data as the first priority!
Or... roll on dataflow!
"Data flow schemas", Dennis et al., International Symposium on Theoretical Programming, 1974
Control flow: instructions are executed one after another, classic von Neumann architecture.
Dataflow: instruction is ready to execute as soon as all its inputs are available.
Control flow is about operations and their order
Dataflow is about data routes and transformations
"Towards better data discovery and collection with flow-based programming", Paleyes et al., DCAI Workshop, NeurIPS 2021
Image credit: Compex IT, https://www.compexit.co.uk/where-is-your-business-data-kept/
"An empirical evaluation of flow based programming in the machine learning deployment context", Paleyes et al., CAIN 2022
"Assuring the machine learning lifecycle: Desiderata, methods, and challenges", Ashmore et al., ACM Computing Surveys, 2021
"Decision provenance: Harnessing data flow for accountable systems", Singh et al., IEEE Access, 2018
"Desiderata for next generation of ML model serving", Akoush et al., DMML Workshop, NeurIPS 2022
"A Primer on Provenance: Better understanding of data requires tracking its history and context.", Carata et al., ACM Queue, 2014
Image credit: redgate, https://www.red-gate.com/hub/product-learning/sql-data-catalog/improving-the-quality-of-data-governance-where-to-start
"Dataflow graphs as complete causal graphs", Paleyes et al., CAIN 2023
"Causal fault localisation in dataflow systems", Paleyes et al., EuroMLSys 2023
"Causal fault localisation in dataflow systems", Paleyes et al., EuroMLSys 2023
"Intellectual Debt: With Great Power Comes Great Ignorance", Zittrain J, Medium: Berkman Klein Center Collection, 2019
But also
"Real-world Machine Learning Systems: A survey from a Data-Oriented Architecture Perspective", Cabrera et al., 2023
"Machine learning from innovation to deployment: A strategic research agenda for AutoAI", ML@CL, 2022
https://mlatcl.github.io/
https://paleyes.info/
✉ ap2169@cam.ac.uk