Pasteur Labs, 13.10.2023
Why did it fail?
What were the challenges?
“Effective big data mining at scale doesn't begin or end with what academics would consider data mining”
“Data scientists expend a large amount of effort to understand the data available to them, before they even begin any meaningful analysis”
“Exploratory data analysis always reveals data quality issues”
... software services
that provides a function, or many functions,
known as interface or API,
that clients* can reuse,
together with policies to control its usage.
*A client can be anything: another software, a person, a hardware.
https://en.wikipedia.org/wiki/Service_(systems_architecture)
However...
“Twitter is powered by many loosely-coordinated services.”
“Since a single user action may involve many services, a data scientist wishing to analyze user behavior must first identify all the disparate data sources involved.”
“Services are normally developed and operated by different teams, which may adopt different conventions for storing and organizing log data.”
Build software with data as the first priority!
Or... roll on dataflow!
"Data flow schemas", Dennis et al., International Symposium on Theoretical Programming, 1974
Control flow: instructions are executed one after another, classic von Neumann architecture.
Dataflow: instruction is ready to execute as soon as all its inputs are available.
Control flow is about operations and their order
Dataflow is about data routes and transformations
"Towards better data discovery and collection with flow-based programming", Paleyes et al., DCAI Workshop, NeurIPS 2021
"An empirical evaluation of flow based programming in the machine learning deployment context", Paleyes et al., CAIN 2022
"Assuring the machine learning lifecycle: Desiderata, methods, and challenges", Ashmore et al., ACM Computing Surveys, 2021
"Decision provenance: Harnessing data flow for accountable systems", Singh et al., IEEE Access, 2018
"Desiderata for next generation of ML model serving", Akoush et al., DMML Workshop, NeurIPS 2022
"A Primer on Provenance: Better understanding of data requires tracking its history and context.", Carata et al., ACM Queue, 2014
"Dataflow for machine learning operations", Paleyes and Rakowski, Kafka Summit 2023
"Dataflow for machine learning operations", Paleyes and Rakowski, Kafka Summit 2023
"Dataflow graphs as complete causal graphs", Paleyes et al., CAIN 2023
"Causal fault localisation in dataflow systems", Paleyes et al., EuroMLSys 2023
"Causal fault localisation in dataflow systems", Paleyes et al., EuroMLSys 2023
"Intellectual Debt: With Great Power Comes Great Ignorance", Zittrain J, Medium: Berkman Klein Center Collection, 2019
That can lead to...
"Real-world Machine Learning Systems: A survey from a Data-Oriented Architecture Perspective", Cabrera et al., 2023
"Machine learning from innovation to deployment: A strategic research agenda for AutoAI", ML@CL, 2022