This is the second part of a threepart blog series. You can read the first part here and the third part here.
Most toddlers can say about 20 words by the time they are 18 months old. By age two, they start combining two words to make simple sentences, such as “baby crying.” As all parents are painfully aware, by the time they are three or four years old, they become extremely curious and can’t stop asking questions. In Russian, they have a special word for kids of that age: Pochemuchka, from the word pochemu, which means why.
But what happens in children’s brains as they start discovering the world around them? What happens is that their brains start building models — and we shall see later what building a model implies.
As we grow up, we learn many different things. Some of them we learn by doing, and some of them by other people’s experiences (from stories, books, movies… ). So, we learn for instance that ice is slippery. But why is ice slippery? Now, the simple answer can be, well, because everyone knows that ice is slippery! That answer was satisfactory enough to humans for millions of years, until we have started building knowledge based on math and science. And then, we have discovered that ice being slippery has something to do with Brownian motion, alignment of water molecules as water turns to a solid state, and the fact that water under pressure can change from solid state back to liquid. For a better explanation — enjoy this video!
We don’t learn only by observing things around and labeling them as facts, but rather, by trying to understand the underlying principles that guide these events.
In very simply terms, if we look at a given phenomena described by formula y = f(x), we are not only interested in what happens with output Y as the input X changes, but we are even more intrigued by the transfer function itself by f! And this is where knowledge modelling comes in — in understanding the transfer function. Once we learn models, we can then quickly start deriving all sorts of ideas from them, by combing all of these transfer functions together, like in one giant convolution network. And as mentioned earlier, some of these functions we learn ourselves, while others we have been told to believe in (which sometimes can get us into all sorts of other problems).
That is also the reason we say that humans are good at working with small data sets — and it is also the reason we tend to take shortcuts when confronted with larger data sets.
As explained in the first blog post of these threepost series on the subject, deep learning nets are not modeling the relations. Going back to the world of y=f(x), they are trying to fit the best set of matrix decompositions inside f, while having huge sets of X and Y available to play with, and while at the same time, making no assumptions about what f actually represents. That is the reason why deep learning labels a goat as a bird or as a giraffe (even though one may argue that seeing a goat in the tree is not as unusual as it seems).
As we will see later on, the best knowledge modeling comes from Bayesian casual understanding of the world around us, so there’s no surprise that a lot of the latest research effort in deep learning is based on Bayesian methods, something called Bayesian deep learning. But even that direction of research is focused more on getting better parameter estimations rather than on fixing the explainability problem of deep learning.
In one of his latest articles, Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution, Judea Pearl says that in order “to achieve human level intelligence, learning machines need the guidance of a model of reality, similar to the ones used in causal inference tasks.”
Further on, he describes a threelevel causal hierarchy, together with the characteristic questions that can be answered at each level. The levels are titled Association, Intervention, and Counterfactual.
So, the real question is then how to combine deep learning, which is superior in so many applications, with Bayesian nets?
My view is that rather than “fixing” deep learning with learning inference models from the Bayesian world, or bringing some of the ideas of deep learning into Bayesian nets, a much more natural solution would be to place deep learning nets below Bayesian nets, and use deep learning nets as one of many sensory inputs into the Bayesian rule engine. That way, we can combine the best out of both worlds.
I wish I could have said that this is a novel idea, but while browsing the literature, I bumped into Attention Schema Theory (AST) — a neuroscientist theory on how we came to be aware of ourselves: A New Theory Explains How Consciousness Evolved. Here are some important parts of that article (with my emphasis):
“Even before the evolution of a central brain, nervous systems took advantage of a simple computing trick: competition. Neurons act like candidates in an election, each one shouting and trying to suppress its fellows. At any moment only a few neurons win that intense competition, their signals rising up above the noise and impacting the animal’s behavior. This process is called selective signal enhancement, and without it, a nervous system can do almost nothing. Selective signal enhancement is so primitive that it doesn’t even require a central brain. [..] The cortex is like an upgraded tectum. Unlike the tectum, which models concrete objects like the eyes and the head, the cortex must model something much more abstract. According to the AST, it does so by constructing an attention schemaa constantly updated set of information that describes what covert attention is doing momentbymoment and what its consequences are. ”
In short, the AST theory of brain evolution is very similar to the idea of having deep learning nets, pattern matching, and Bayesian models put together, one on top of the other!
Here is a typical deployment model of the Waylay platform, with the Waylay rules engine based on Bayesian nets, the smart agents, and with sensors and actuators written in either of these methods:

based on realtime data

based on deep learning prediction models (learn parameters from ML models)

based directly on top of the ML model
In my next blog post, I will take one real example to show how all these blocks work together.
Comment here