Dataflow software as complete causal graphs

Andrei Paleyes

Causal Digital Twins workshop, 04.01.2023

Ongoing work together with


Neil Lawrence	Siyuan Guo	Bernhard Schölkopf

Overview of the talk

Motivation
Motivation
Motivation
Our work

ML deployment is hard!

Venture Beat, 2019

“Why do 87% of data science projects never make it into production?”

InfoWorld, 2021

“85% of AI and machine learning projects fail to deliver, and only 53% of projects make it from prototypes to production.”

Capital One and Forrester, 2022

“73% of respondents find transparency, traceability, and explainability of data flows challenging.”

But why?

Many reasons

e.g. model accuracy vs business value

or computational and labour costs

or skillset

One more reason

Data management in modern software is often a mess

Scaling Big Data Mining Infrastructure: The Twitter Experience J Lin, D Ryaboy; ACM SIGKDD Explorations Newsletter, 2013

“Effective big data mining at scale doesn't begin or end with what academics would consider data mining”

“Data scientists expend a large amount of effort to understand the data available to them, before they even begin any meaningful analysis”

“Exploratory data analysis always reveals data quality issues”

Who's to blame?

Software Services

What is a service

A service is a piece of software,

that provides a function, or many functions,

known as interface or API,

that clients* can reuse,

together with policies to control its usage.

*A client can be anything: another software, a person, a hardware.

https://en.wikipedia.org/wiki/Service_(systems_architecture)

Service oriented architecture is

Scalable
Flexible
Modular
Reliable
Encourages ownership

However...

Two services

Three services

Big ball of mud

https://www.ben-morris.com/microservices-rest-and-the-distributed-big-ball-of-mud/

Scaling Big Data Mining Infrastructure: The Twitter Experience J Lin, D Ryaboy; ACM SIGKDD Explorations Newsletter, 2013

“Twitter is powered by many loosely-coordinated services.”

“Since a single user action may involve many services, a data scientist wishing to analyze user behavior must first identify all the disparate data sources involved.”

“Services are normally developed and operated by different teams, which may adopt different conventions for storing and organizing log data.”

What to do?

Build software with data as the first priority!

Prioritise data while designing services - Götz et al., 2018
Split data storage to encourage ownership - Data Meshes, Dehghani, 2019
Cluster services by data domains - Domain-Oriented Microservice Architecture, Uber, 2020

Or... roll on dataflow!