However, there are hints that a kind of cross-pollination can occur between the two worlds when a problem in the industrial setting of machine learning can spark intriguing theoretical questions.
See also: Ethics of AI: Benefits and risks of artificial intelligence.
“This actually arose out of an escalation with a customer,” recalled Soatto.When an artificial intelligence program is put into production, it is one part in a whole chain of computer processes. With deep learning forms of AI, the values of the software can change as new and improved versions are produced. That can break things in the chain of processes, causing headaches for the customer.
“In reality, this is only a very small part of a production system,” explained Soatto, referring to an AI program such as an image classifier. In the customer complaint, he related, the new technology was breaking stuff the customer was using that was much older. “Customers came back and said, ‘we had problems with your model,’ and we realized they’re using a model from four years ago!”
Thus began a quest by Soatto and colleagues into a new realm of exploration, how to make newer AI programs compatible with their predecessors.
That customer complaint led to a paper this past June, presented at the CVPR conference, an academic gathering studying computer vision. Soatto and his team approached the issue of compatibility as a constraint-satisfaction problem, taking a neural net and asking it to have guarantees beyond just being accurate in making predictions.
Specifically, they asked if the neneural net versionet could get more accurate without introducing new errors. Think of a classifier of cats and dogs: if the new neural net gets better overall but suddenly mis-categorizes some pictures of cats or dogs that the old one got right, that’s not a good trade-off. Making a mistake where a previous program was fine is called a “negative flip.”
In the paper, “Positive-Congruent Training: Towards Regression-Free Model Updates,” Soatto and colleagues solve the constraint-satisfaction problem of reducing such negative flips by setting the goal in a novel way. They took the traditional objective function, the so-called cross-entropy loss, which governs how well the neural net predicts the cat or dog, and they added to it a second objective function, requiring the neural net to make sure to do well on the same predictions the old neural net got right.
Moving beyond a single objective function, instead of framing the matter as solving a constraint satisfaction problem, Soatto and his team invented a new area of applied deep learning, which they have christened “Graceful AI.”
The name is an umbrella term that encapsulates the principle that there are multiple goals in a problem.
“The models we develop have to play nicely with everything around them, not just train the best model you can,” said Soatto. Another way to look at it is being “respectful for criteria beyond just optimizing for performance.”
The paper demonstrates that the academic focus on producing the most highly performing AI program is not the only way to arrive at really interesting problems, Soatto told ZDNet.
See also: AI in sixty seconds.
“As an academic, you spend most of your time trying to invent problems that don’t exist,” observed Soatto. “Very rarely you get lucky, and you end up with something that’s useful to the world.”
Being at AWS, by contrast, “you get constantly exposed and bombarded with real problems that are fascinating that don’t have a solution.”
The practical result is that “any [AWS] customer who employs one of these models knows that from that point on, they will be able to ingest any subsequent improvement without having to change any of their post-processing.”
In Soatto’s view, every customer complaint is an opportunity to understand what may be intriguing questions, he said. When a customer complaint comes up, “There is something to be understood, something is not working the way we thought.”
In the case of positive congruent, he said, “we stopped and asked, Why is it that we don’t train models that are compatible with whatever is surrounding them?”
There is a payoff here for pure research. The practical question opened the door to deeper matters that touch upon theoretical issues, such as why machine learning is or is not able to generalize beyond the training data.
“The elephant in the room for machine learning is you really don’t care how you do in the training set because you will never, ever see it again,” said Soatto. “What you care about are a small number of errors in the test set, which is sequestered, and you don’t see it.”
The “schism” between the two is all about what are called inductive biases, the thing that “connects the test data, which you don’t have access to, and the training data.”
“This is a problem that has been open and obsessing us for decades,” he said. It goes back to the grandfathers of AI, Alan Turing and Norbert Wiener. The mystery of AI programs is a conundrum: “You cannot create information by torturing the data, but everything we do to data is torturing the data – we do stuff to it.”
Theoretical questions of learning representations get to the heart of what scientist Claude Shannon theorized as the very nature of information, said Soatto, to wit, What representations are optimal for a task, in terms of being maximally informative?
On that score, “there are some very strange and fascinating phenomena,” said Soatto.
The positive-congruent work and other research that Soatto and his team have produced share certain themes. In particular, there is a current running through the works of equivalence, the question of what makes two neural nets the same or different for a given task.
The AI team’s projects sometimes come down to searching for neural nets that have an overlap, despite differences. The exact nature of the overlap may not always be clear but is tantalizing nonetheless.
For example, in the positive-congruent paper, Soatto and the team tested what happens when old and new networks are developed as ensembles, groups of similar neural nets with varying hyper-parameters. They found they could “future-proof” old neural nets, in a way, because the ensemble of old neural nets collectively had less divergence from the new neural nets in terms of examples that went wrong.
“Ensembles are very interesting,” said Soatto. “We have not solved it completely.”
In another piece, “Compatibility-Aware Heterogenous Visual Search,” Soatto and colleagues ask whether it’s possible to develop neural nets that are more efficient in compute requirement while giving up the least amount of accuracy. They use a popular approach to automatically design neural nets, called “neural architecture search,” and they set another constraint-satisfaction problem: the new network must be “compatible” with an existing neural net. A smaller neural net may save on processing as long as it produces a representation that is compatible with a larger neural network – larger in terms of the number of parameters.
Soatto and colleagues have also inverted the question of similarity between neural nets by, for example, asking what happens to a neural net if a given sample of data is left out. In the paper “Estimating Informativeness Of Samples With Smooth Unique Information,” they define the information value of a single sample in a data set by asking how the weight values of a neural net are different with and without that individual data point.
Again, the work has multiple theoretical implications. It points the way to possible bounds on the generalization capability of a neural net, meaning how well it can be applied to new examples beyond training data when making predictions in the real world. And the work can provide insight into how much information may “leak out” of a neural net or be disclosed about a given example. That theoretical question is also a crucial practical issue in terms of privacy because leaked information can potentially de-anonymize entities in a data set.
Much of the Graceful AI work has been done with computer vision types of problems, though Soatto notes “the framework is general, so the considerations apply to other forms of architectures.”
All of these explorations into things such as backward compatibility touch upon a broad area of continued exploration, known as “continual learning.” That challenge is “still an open problem,” said Soatto.
“We want to get an opportunity to work on new problems that are impactful and meaningful and end up in the hands of thousands of developers,” he said. “We hire people so that we can get them exposed to real problems that don’t have a solution,” he said. “Here we have people who join our team, and in six months, their work is in the hands of tens of thousands of people,” something that is unheard of for most scientists.
Although the positive-congruent training has been implemented in AWS, Soatto and the team acknowledge the intriguing questions that aren’t fully answered.
At the end of the paper, Soatto and colleagues write that the new training regimen still involves tricky trade-offs. The best solution, the ensembles, are impractical when making live predictions. But the simpler approach, known as focal distillation, brings with it some increase in error rate, which is undesirable.
The paper ends with the caveat that the authors have “only scratched the surface” of PC training. More work remains to be done.
“These are not hypothetical academic questions,” said Soatto, “These are questions that, if we are able to successfully address, they will really simplify the life of customers and developers who need to insert these models into their pipeline.”