Monday, November 27, 2023

Blazing Past a Major AI Bottleneck

There is more to human life than anyone can ever imagine, and yet the thing which stands out the most is our ability to grow at a consistent clip. We say this because the stated ability has already fetched the world some huge milestones, with technology emerging as quite a major member of the group.  The reason why we hold technology in such a high regard is, by and large, predicated upon its skill-set, which guided us towards a reality that nobody could have ever imagined otherwise. Nevertheless, if we look beyond the surface for one hot second, it will become clear how the whole runner was also very much inspired from the way we applied those skills across a real world environment. The latter component, in fact, did a lot to give the creation a spectrum-wide presence, and as a result, initiated a full-blown tech revolution. Of course, the next thing this revolution did was to scale up the human experience through some outright unique avenues, but even after achieving a feat so notable, technology will somehow continue to bring forth the right goods. The same has turned more and more evident in recent times, and assuming one new discovery ends up with the desired impact, it will only put that trend on a higher pedestal moving forward.

The researching teams at Massachusetts Institute of Technology and MIT-IBM Watson Lab have successfully developed a technique which can empower deep-learning models to adapt to new sensor data, and more importantly, do so directly on an edge device. Before we unpack the whole development, we must try and gain an idea about the problem statement here. Basically, deep-learning models that enable artificial intelligence chatbots, in order to deliver the customization expected from them, need constant fine-tuning with fresh data. Now, given how smartphones and other edge devices lack the memory and computational power required for such a fine-tuning process, the current framework tries to navigate through that by uploading user data on cloud servers where the model is updated. So, what’s the problem here? Well, the problem talks to the data transmission process exhausting huge amounts of energy. Not just energy, there is also security risks involved, as you are sending  sensitive user data to a cloud server which always carry a risk of getting compromised. Having covered the problem, we should now get into how exactly the new technique takes on it. Named PockEngine, the new solution comes decked up with the means to determine what parts of a huge machine-learning model need alterations to improve accuracy. Complimenting the same is a fact that it only stores and computes with those specific pieces, thus leaving the rest undisturbed and safe. This marks a major shift, because up until now, whenever we would run an AI model, it instigated inference, a process where data input is passed from layer to layer till the time a prediction is generated. Hold on, the main issue presents itself after the said process is done. You see, during training and fine-tuning, the model undergoes a phase known as backpropagation. Backpropagation, in case weren’t aware, involves comparing the output to the correct answer. Next up, it runs the model in reverse, and each layer is updated as the model’s output gets closer to the correct answer. With each layer required to be duly updated, the entire model and intermediate results have to be unquestionably stored, making the fine-tuning mechanism pretty high maintenance. There is, fortunately enough, a loophole which suggests that not all layers in the neural network are important for improving accuracy, and even for layers that are important, the entire layer may not need to be updated. Hence, the surplus components don’t need to be stored. Furthermore, you also don’t have to revisit the very first layer to improve accuracy because the process can be stopped somewhere in the middle. Understanding these loopholes, PockEngine first fine-tunes each layer, one at a time, on a certain task, and then measures the accuracy improvement after each individual layer. Such a methodology can go a long way when it comes to identifying the contribution of each layer, as well as trade-offs between accuracy and fine-tuning cost, while automatically determining the percentage of each layer that needs to be fine-tuned.

“On-device fine-tuning can enable better privacy, lower costs, customization ability, and also lifelong learning, but it is not easy. Everything has to happen with a limited number of resources. We want to be able to run not only inference but also training on an edge device. With PockEngine, now we can,” said Song Han, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, a distinguished scientist at NVIDIA, and senior author of an open-access paper describing PockEngine.

Another manner in which the solution sets itself apart is concerned with the timing aspect. To put it simply, the traditional backpropagation graph is generated during runtime, meaning it demands a massive load of computation. On the other hand, PockEngine does the same during compile time, just as the model is being prepared for deployment. It essentially deletes bits of code to remove unnecessary layers or pieces of layers and create a pared-down graph of the model to be used during runtime. Then, the solution performs other optimizations on this graph to further improve efficiency. Turning this feature all the more important is a fact that the entire process needs to be conducted only once.

The researchers have already performed some initial tests on their latest brainchild. The stated tests saw them applying PockEngine to deep-learning models on different edge devices, including Apple M1 Chips, and the digital signal processors common in many smartphones and Raspberry Pi computers. Going by the available details, the solution performed on-device training up to 15 times faster, and that it did without witnessing any drop in accuracy. Apart from that, it also made a big cut back on the amount of memory required for fine-tuning. Once this bit was done, they then moved on to applying the solution across large language model Llama-V2. Here, the observations revealed that PockEngine was able to reduce the each iteration’s timeframe from seven seconds to less than one second.

“This work addresses growing efficiency challenges posed by the adoption of large AI models such as LLMs across diverse applications in many different industries. It not only holds promise for edge applications that incorporate larger models, but also for lowering the cost of maintaining and updating large AI models in the cloud,” said Ehry MacRostie, a senior manager in Amazon’s Artificial General Intelligence division.

Latest