A Revolution in Privacy Constraints for the World of AI
Artificial intelligence is one of the most disruptive trends the world is facing today. Most companies are shifting toward AI applications and attempting to figure out where they best meet the adoption of its capabilities. We should be able to see a real aggressive approach to adoption in the next couple of years. According to McKinsey AI could deliver an additional economic activity of $13T by 2030 or 16% of today’s GDP.
The global and widespread adoption of AI will take some time, and will require a few breakthroughs to come first:
- Finding a way to balance the data appetite vs data privacy paradigm
- With more than 30 Zettabytes of data being transferred for training, we need to solve the inherent bandwidths, storage, and computing costs of this amount of data.
- There is still a lack of talent or companies’ access to talent that is capable of leading this adoption and use of AI.
So Much Data…
Data generation will keep growing at exponential speed in the next years. According to IDC data collected will grow from 33 Zettabytes in 2018 to 175 Zettabytes in 2025. More-so, 90 ZB (>50%) of this data will be generated through IoT devices (those can be anything from connected cars to a smart microwaves). Imagine all of the great insights, learnings and artificial intelligence that can be created with all that data! The issue is that this data has to be transferred to the cloud or local servers in order to be accessible for analytics and to allow AI experts to use it for their smart models. The 4 main problems of the current format in which data is approached and made accessible for analysis are:
- Privacy — all this data (most of which is personal) is traveling to and from servers and the cloud.
- Latency and network costs — even if you have a fast server, transferring the data to the server for training is often the bottle-neck. For reference, the data collected from 1 hour of driving by a connected car, takes over 10 hours to upload.
- Scalability — the more connected devices there are the more server power you need and the more expensive the transfer of the data gets.
- Training costs / server costs — training costs are growing. Though there is an increase of companies attempting to reduce training time, the costs of doing so usually increase. You don’t need me to tell you that at the end of the day, time is money.
If AI is the Spaceship then Data is the Rocket Fuel
Artificial Intelligence models are built to enable detection, classification, and prediction of future events. For that, we can embrace the cloud, but we need to deal very carefully with the data we send or save in these data-centres. It is often impractical to send all the data to a centralised location, due to bandwidth, storage, and privacy concerns. Since Machine learning models are built from this collected data, we have to be extremely aware of the sensitive nature of the data used for the model building. There are many risks with transferring the data from the edge to the cloud, in fact, many industries are regulated heavily around what data they can export from their local devices. For reference, in the world of healthcare, X-rays, or ultrasounds can be referred to as anonymous (you are not likely to recognise a person based on their X-Ray) and so are easily transferred to the server for training purposes in order to detect health related diagnostics. However, since you cannot transfer medical files along with the images, the companies analysing those X-rays don’t know if they belong to an 18 year old with previous health issues or a 55 year old with a clean health sheet. Knowing those details can heavily increase the accuracy of the AI analysis. And accuracy of detection, is the main barrier for AI driven healthcare tech.
Living on the edge
There is more computing power on the iPhone 6 than Nasa had on its computers during the early Apollo days. As edge devices are getting stronger (in terms of computing), and data collection by these edge devices grows, one has to stop and think, why not run all that fancy artificial intelligence and Machine learning directly on the edge devices? Where all the data is being generated anyway.
Training on the edge vs inference at the edge
Before we dive into the potential solution, here is a quick explanation on the difference between training and inference (inference is often mistaken for training.. So we thought we would clear that out).
Inference at the edge — Over the past few years, the demand for real time solutions increased drastically. People want to aim their phones at an object and have the phone tell them what it is immediately, or generate a 3D image in real time, to name a few… This opened a new generation of Deep Learning systems, that shift the deployment of models to the edge devices to enable them to continuously infer, or predict based on the data sourced from the same edge device. This process of deploying deep learning models at the edge, is currently what people think of when they hear the term AI at the edge . As advanced as this system is, it still requires the learning of the models to be done on a centralised server, and the edge devices receive a predetermined model that can only infer based on data that it recognises, and can no longer improve as a model (hence the need for constant updates). This is already a common practice of sorts.
Training on the edge — the actual training of the model at the edge is only being explored by a limited group of companies. More so, it is regarded extremely difficult due to very harsh restrictions that the edge devices possess for Deep Learning model training. Once those tough restrictions are overcome, the edge device can start to train its own models, meaning that for the first time, NO data has to leave the edge and those tons of zettabytes can stay exactly where they are.
So what does training at the edge really solve?
The four added values of training at the edge:
- Privacy and information security
Forget what you know about hashing, anonymising, encryption, encryption keys, and other forms of rendering data private. Training at the edge means there is no need for any decoding of the data as to allow it to be transferred, since, you guessed it, it doesn’t need to go anywhere in order to be trained on. This is crucial as even if a company holds, for argument sake, a strongly “anonymised” dataset, it can still put user privacy at risk if combined with other data. Evidence is provided in Latanya Sweeney (Director of the Data Privacy Lab in the Institute of Quantitative Social Science (IQSS) at Harvard) document that this practice of de-identifying data and ad-hoc generalisations are not sufficient to render data anonymous because combinations of attributes often combine uniquely to re-identify individuals. Another way companies deal with data is to take away from it specific identifiers, and though that makes the data private, it takes away from its integrity and accuracy of the prediction model. With Edge Computing, an application can make sure that sensitive data is processed on-edge, and no data is sent to the Cloud for analysis. By doing so, it redefines the way we understand privacy, and keeps the data intact to create the most accurate models!
2. Reduced time and network latency
As mentioned before, The power and flexibility of Cloud computing has enabled many scenarios that were impossible before. Think about how the accuracy of image or voice recognition algorithms has improved in recent years. However, this accuracy has a price: the time needed to get an image or a piece of audio recognised is significantly affected by the non-negligible yet unavoidable network delays due to data being shipped to the Cloud and results computed and sent back to the edge. For reference, think of the massive amounts of unstructured data collected by self-driving cars, the cars must essentially act as data centres. Sending data to the cloud to be trained or to be inferred could cost valuable upload time. To act efficiently on the data, and hold the potential to one day offer us self-driving cars, we must first be able to train and infer on the road. With edge computing, you are directing the traffic from various IoT devices in an efficient manner, by using various data centres than relying on one single (overburdened) server. Therefore edge computing reduces network traffic greatly.
Finally, when the data is being processed at a rapid pace, the performance of the application (that is, the IoT device) will improve greatly.
The table below shows the time estimated by Google to transfer certain amounts of data from a device to a centralised server at a certain network speed. As an example, take a connected vehicle that generates 3TB of data per hour. According to Google, even at maximum 5G speed of 1Gbps it would take at least 9 hours to send the information from the car to the server.
Distributed learning solves scalability issues that traditional ML hasn’t solved yet, such as robustness of model and amount of data that can be analysed. Training on the edge provides a solution for models for the highest quantities of data inputs available.
A great example for this, is demonstrated in this NVIDIA article as it illustrates the challenge of scaling a model as the amount of data increases. The case introduced is quite simple yet shockingly disturbing. If we take a fleet of 100 autonomous (or partially autonomous) vehicles, that produce around 1TB of data per hour (not far fetched at all), the author estimates (in his most conservative approach) that to train the whole model will take between 21 to 166 days. His less conservative view places the training time at between 197 to 1556 days (or up to four years). Training at the edge means that the more edge units you have, the faster you train.
4. Meaningful cost effectiveness
As datasets grow larger and models become more complex, training machine-learning models requires an increase in distributing the optimisation of model parameters over multiple machines. This can greatly increase cloud server and datacenter costs. Existing machine learning algorithms are designed for highly controlled environments (an example of a controlled environment is a classic server) where the data is distributed among machines and high-connection networks are available.
With the increased costs of running complex training on cloud providers, training models on the edge will reduce the cost of expensive cloud providers/ data-centres to practically zero.