Andrei Paleyes
Causal Digital Twins workshop, 04.01.2023
Neil Lawrence | Siyuan Guo | Bernhard Schölkopf |
Venture Beat, 2019
“Why do 87% of data science projects never make it into production?”
InfoWorld, 2021
“85% of AI and machine learning projects fail to deliver, and only 53% of projects make it from prototypes to production.”
Capital One and Forrester, 2022
“73% of respondents find transparency, traceability, and explainability of data flows challenging.”
Many reasons
e.g. model accuracy vs business value
or computational and labour costs
or skillset
Data management in modern software is often a mess
“Effective big data mining at scale doesn't begin or end with what academics would consider data mining”
“Data scientists expend a large amount of effort to understand the data available to them, before they even begin any meaningful analysis”
“Exploratory data analysis always reveals data quality issues”
Software Services
that provides a function, or many functions,
known as interface or API,
that clients* can reuse,
together with policies to control its usage.
*A client can be anything: another software, a person, a hardware.
https://en.wikipedia.org/wiki/Service_(systems_architecture)However...
“Twitter is powered by many loosely-coordinated services.”
“Since a single user action may involve many services, a data scientist wishing to analyze user behavior must first identify all the disparate data sources involved.”
“Services are normally developed and operated by different teams, which may adopt different conventions for storing and organizing log data.”
Or... roll on dataflow!
Control flow is about operations and their order
Dataflow is about data routes and transformations
Control flow: instructions are executed one after another, classic von Neumann architecture.
Dataflow: instruction is ready to execute as soon as all its inputs are available.
Dataflow ideas date back to 70s
Work by Arvind, Jack Dennis, J. Paul Morrison
Has seen limited adoption
Deployed bug affects not just one node...
...but many!
Data shift affects not just one stream...
...but many!
Sounds good!
But does this work?
We will have to see!
We need to demonstrate applicability of graphical causal model-based methods, namely attribution, to dataflow programs.