July 1, 2019
The Internet of Things (IoT) and edge computing have triggered the proliferation of connected devices, products that perform a rich assortment of functions and sport expanded capabilities. This, in turn, has increased the level of product design complexity engineers must tackle. The complexity becomes particularly evident when selecting processing resources for one of these devices.
The problem is that engineers must not only meet a growing variety of seemingly conflicting design demands—like low-power operation and high performance—but they must do so using a bewildering assortment of processing options.
Unfortunately, there are few easy answers. For processor architectures and technologies in the era of the IoT, one size never fits all. It’s true that all the devices populating the network perform certain fundamental and universal tasks, which general-purpose, single-processing architectures can handle. The catch is that the IoT now requires chips that can perform specialized tasks, such as machine learning, voice or gesture recognition and security. These demands have caused designers to turn to a growing and evolving class of accelerators. Market forces further complicate the designer’s task of demanding shorter development cycles and reduced development costs, making the processor selection process even more critical.
Single-Purpose vs. Multi-Function Processors
One of the first decisions an IoT device designer must make is whether to create a system using a collection of function-specific chips or a single chip that integrates multiple functions.
Although a function-specific chip may increase the board space and bill of material (BOM) costs, the approach offers several advantages. By keeping the application functions separate from the specialized tasks, the developer minimizes code complexity. The integration of the two software pieces can be challenging, especially with a single-core architecture.
This approach also affords access to richer feature sets. Products embedding the specialized tasks and general-purpose features often come with fewer options because of memory size and peripherals. Separating application and specific functionality means developers avoid real-time issues, such as conflicts between running a motor and handling a Wi-Fi communication stack.
On the other hand, a multi-function chip approach clearly takes up the smallest printed circuit board (PCB) space and possibly comes at a lower cost. As a result, it is well suited for simpler applications that target lowest cost and higher integration.
One solution to this dilemma may pertain to a technique that compartmentalizes functions. STMicroelectronics uses this approach in its STM32WB family of wireless microcontrollers (MCUs). With this product, ST has built a multicore series of MCUs and has “walled off” the connectivity MCU to protect it from interruption from the application processor.
“This dual-core architecture allows concurrent execution of the two pieces of software and provides greater flexibility to the software teams developing each function,” says Renaud Bouzereau, marketing manager for STM32 high-performance products, microcontroller division, STMicroelectronics.
Finding a Balance of Power
Because of the proliferation of mobile wearable and remotely operating devices, power efficiency has assumed particular significance in IoT product development, often defining key aspects of the system processors’ specifications.
Improving a processor architecture’s power efficiency often comes at the expense of performance, die area or both. The most basic factors that influence the choice of a power efficient processor for a given task or application include the instruction set architecture and clock speed.
“A complex instruction set architecture, like Intel x86, utilizes many complex instructions and more hardware compared to a reduced instruction set architecture (RISC) like ARMv8,” says Nihaar Mahatme, technology lead for the microcontroller business line at NXP Semiconductors. “The simplified hardware of RISC chips requires fewer transistors to be powered, reducing the power consumption. This has made ARM chips popular for battery-powered mobile devices and wearables, while higher performance x86 processors are better suited for personal computers, laptops and servers.”
For a given microarchitecture, a lot of trade-offs between power and performance are made through choices such as processor pipeline depth, out-of-order execution, size of the address and data bus, multi-core architectures, frequency of operations and memory hierarchy. Each of these options requires a careful trade-off between power, performance and area.
Designers should also recognize that many IoT devices require always-on performance, which relies on the implementation of a sleep mode. Under these conditions, leakage current becomes a problem when it wastes energy.
“One area often overlooked is leakage current because many IoT devices are often sleeping, so the leakage current dominates,” says Paul Washkewicz, co-founder and vice president of marketing of Eta Compute. “In fact, the real-time clock is completely dominated by leakage and is often the main function that is always on to keep the IoT device synchronized. Often the gains made going from 90 nm to 40 nm—for the SoC [system on a chip] to get more performance or shrink the die for cost—lock the SoC out of markets dominated by off leakage currents.”
Designers seeking greater energy efficiency must consider power dissipation, which varies with the processing platform used.
“Power dissipation is very important in microprocessor–based applications because the power dissipation is much higher than on most MCUs,” says Bouzereau. “However, on high-performance MCUs, the power dissipation becomes more important and requires more attention from the developer. The key point is the maximum junction temperature, and how the package can help dissipate this power.”
Developers have a number of techniques at their disposal to mitigate the effects of power dissipation. They can limit the power inside the chip by limiting the operating frequency and the activity of the chip (peripherals activity) and by leveraging low-power modes between high-processing periods.
Designers also can select an ultra-low-power processor and a package with lower thermal resistance. Additionally, they can improve the power dissipation capability of the package by using heat sinks or air circulation, but things get more challenging when the processor is inside a sealed casing.
One way to track and compare targets for processor power consumption and performance is the EEMBC CoreMark benchmark, which helps board and system designers to compare energy efficiency. The benchmark measures the performance of microcontrollers and CPUs deployed in embedded systems using the following algorithms:
- list processing (find and sort);
- matrix manipulation (common matrix operations);
- state machine; and
- cyclic redundancy check (CRC).
The CRC algorithm provides a workload commonly seen in embedded applications and ensures correct operation of the CoreMark benchmark, essentially acting as a self-checking mechanism. This process performs a 16-bit CRC on the data contained in elements of the linked list to verify correct operation. The system is designed to run on devices from eight-bit microcontrollers to 64-bit microprocessors.
Supporting Artificial Intelligence
The relationship between power consumption and the implementation of artificial intelligence (AI) will pose more challenges to IoT developers. Here, again, no single processor architecture or technology fits all use cases.
Most AI-enabled IoT applications target speech recognition and response, image classification, tracking video features and processing sensor data to analyze behavioral patterns. Each of these applications can have widely different requirements for performance, response latency and power.
“The common underlying theme for all the architectures is that they require arithmetic logic units or multiply-accumulate units (MACs) to perform arithmetic operations, as well as large amounts of on-chip and/or off-chip memory to hold input data and intermediate computation results,” says NXP’s Mahatme. “A higher number of MACs corresponds to higher performance. This must be complemented with the right kind of memory hierarchy to ensure that the processing elements are never starved of data, or become bandwidth-limited. Often a right mix of hierarchical on-chip RAMs, tightly coupled memories and DRAM are employed to manage the performance and minimize power consumption due to memory accesses.”
In addition to these factors, designers should note that software plays a crucial role in how the processing element is programmed. Flexible application programming interfaces (APIs) simplify deploying and optimizing ML models on diverse hardware, especially those models built on different frameworks, such as TensorFlow, Caffe and Pytorch.
One of the biggest trade-offs a designer must make is the decision to use a fixed accelerator or a programmable one. A case may be made for one over the other, depending on the segment.
“But for the hundreds of thousands of applications in small, energy-constrained devices, it’s best to have programmability along with any accelerator in order to target the many applications of IoT,” says Washkewicz.
From Sensor Data to Real-Time Decision-Making
The conversion of real-time sensor data brings a clash between competing demands for power efficiency, high performance, AI implementation and cost. Large volumes of data are being generated from sensors in distributed settings like smart cities, industrial plants and autonomous vehicles. For fast real-time response, developers increasingly turn to edge computing. This raises the bar for computing resources in edge nodes. As a result, hardware must be complemented with real-time operating systems and fast-response software to handle interrupts and events swiftly.
ML for continuously changing real-time sensor data will also require learning on the fly because real-time systems cannot deploy and forget. For this adaptability, new hardware and software features that can be updated or reconfigured may be required for machine learning inference. With vast amounts of data being generated from multiple sources, it has become critical to extract just enough information required to perform the given AI task to minimize time and energy spent on redundant information processing. Such features are built into customized accelerators today.
In selecting the right processor architectures for such systems, some of the key considerations include the kind of scheduling algorithms and their effect on system performance, how the processor prioritizes and responds to interrupts and the different constraints on scheduling interrupts of varying priority.
Some developers challenge the ability of current digital processing architectures to handle fast-response challenges. “Most systems that digitize and require a high-speed bus will not be able to perform high-speed, closed-loop processing because of latency issues,” says David Schie, CEO of AIStorm. “Most digital systems will be unable to perform high-speed, real-time decision-making because they have no information about the sensor data until it is digitized and communicated to the AI system, by which time it is too late to make a data pruning decision.”
The Analog Alternative
The crux of this challenge is in the contention that the current generation of specialized accelerators are not up to providing the real-time processing, low power consumption and low cost required to use AI in many IoT applications. According to the challengers, the solution is to eliminate or minimize digitization by processing input data from sensors in analog.
Chipmakers recognize that by adopting analog computing or approximating it, both analog and in-memory processing avoid the power-hungry binary weighted arithmetic multiply-add operations that conventional ML accelerators need. They do, however, have concerns about the approach.
“Both analog and in-memory processing will be welcome for energy-efficient machine learning,” says Mahatme. “However ... the dependence on specific charge-based storage, memristive or resistive elements, limits their applicability to only certain technologies. Analog processing similarly requires sense-amplifiers, A/D and D/A for processing, and interfacing with the rest of the chip, making it less scalable. Secondly, digital communication to and from these architectures will still consume power.”
Mahatme says interest in the alternative has been rising, but it may be a few years before IoT devices reap the benefits.
About the Author