Benefits of dataflow modeling for data management in software systems
Andrei Paleyes
The Ocean Cleanup Challenge, December 2022
About
- PhD student at the University of Cambridge
- Researching ML for systems and systems for ML
- Over a decade of software engineering experience
- Including several years of deploying ML at Amazon
Shoutout to ML@CL group
|
|
|
|
|
|
Neil Lawrence |
Eric Meissner |
Markus Kaiser |
Pierre Thodoroff |
Jessica Montgomery |
Christian Cabrera |
ML deployment is hard!
Venture Beat, 2019
“Why do 87% of data science projects never make it into production?”
InfoWorld, 2021
“85% of AI and machine learning projects fail to deliver, and only 53% of projects make it from prototypes to production.”
But why?
Many reasons
e.g. model accuracy vs business value
or computational and labour costs
or skillset
One more reason
Data management in modern software is often a mess
Scaling Big Data Mining Infrastructure: The Twitter Experience
J Lin, D Ryaboy; ACM SIGKDD Explorations Newsletter, 2013
“Effective big data mining at scale doesn't begin or end with what academics would consider data mining”
“Data scientists expend a large amount of effort to understand the data available to them, before they even begin any meaningful analysis”
“Exploratory data analysis always reveals data quality issues”
Who's to blame?
Software Services
What is a service
A service is a piece of software,
that provides a function, or many functions,
known as interface or API,
that clients* can reuse,
together with policies to control its usage.
*A client can be anything: another software, a person, a hardware.
https://en.wikipedia.org/wiki/Service_(systems_architecture)
Service oriented architecture is
- Scalable
- Flexible
- Modular
- Reliable
- Encourages ownership
However...
Two services
Three services
Big ball of mud
https://www.ben-morris.com/microservices-rest-and-the-distributed-big-ball-of-mud/
Scaling Big Data Mining Infrastructure: The Twitter Experience
J Lin, D Ryaboy; ACM SIGKDD Explorations Newsletter, 2013
“Twitter is powered by many loosely-coordinated services.”
“Since a single user action may involve many services, a data scientist wishing to analyze user behavior must first identify all the disparate data sources involved.”
“Services are normally developed and operated by different teams, which may adopt different conventions for storing and organizing log data.”
What to do?
What to do?
Build software with data as the first priority!
- Prioritise data while designing services - Götz et al., 2018
- Split data storage to encourage ownership - Data Meshes, Dehghani, 2019
- Cluster services by data domains - Domain-Oriented Microservice Architecture, Uber, 2020
Control flow vs data flow
Control flow is about operations and their order
Data flow is about data routes and transformations
You are likely familiar with data flow already!
- Google Tensorflow
- Netflix Metaflow
- Node-RED
- Apache Spark
- Apache Beam
- Apache Kafka
- Spotify Luigi
- Sklearn pipelines
Flow-based programming
- Known since 1970s
- Data coupling - “loosest form of coupling”
- Software system as a data flow graph
Benefits of dataflow design
- Data oriented software
- Data discovery out of the box
- Data collection as simple as graph traversal
- Simple experimentation
- Enables causal reasoning, e.g. for monitoring
- Data lineage for security and compliance
Dataflow references
- NoFlo.js, https://noflojs.org/
- What the Hell Is Flow-Based Programming?, Julian Matschinske, Medium 2018
- An Empirical Evaluation of Flow Based Programming in the Machine Learning Deployment Context, Paleyes, Cabrera and Lawrence, CAIN 2022
- Assessing software privacy using the privacy flow-graph, Tang and Østvold, MSR4P&S 2022
- Falkirk Wheel: Rollback Recovery for Dataflow Systems, Isard and Abadi, SoCC 2021
- Flow-based programming, Paul J. Morrison, 1994.
- Position: GDPR compliance by construction, Schwarzkopf et al., DMAH 2019
- Milan: An evolution of data-oriented programming, Tom Borchert, 2020
- Decision provenance: Harnessing data flow for accountable systems, Singh, Cobbe and Norval, IEEE Access 2018
- Pathways: Asynchronous distributed dataflow for ML, Barham et al., MLSys 2022
Questions to dataflow design
- What is the infrastructure to use?
- Where are the tools?
- How to debug it?
- How much does it cost?
- How to monitor and alert it?