IoT gateways have become a critical component of IoT deployments today. In this post, we try to understand the need for IoT Gateways and the role they play in an IoT solution architecture.
Integrating ‘Things’ With the Cloud
Some IoT appliances are sufficiently advanced to support the full extent of the TCP/IP stack and to securely communicate directly with your IoT Cloud.
However, we often encounter lightweight IoT sensors and actuators that support local communication interfaces only – such as Zigbee, Bluetooth, RS232, RS485 etc. They do not have the capability or the compute power to support a full TCP/IP stack.
In such cases, an IoT Gateway acts as an intermediary device that is deployed on the field. It provides multiple local interfaces – sensors and actuators connect to these local interfaces:
Analog-to-Digital Converter (ADC)
The software on the gateway is then responsible, to aggregate information from sensors and dispatch it to the IoT Cloud. Also, the gateway may receive commands from the cloud, which it relays further to the sensors and actuators via the local interfaces.
IoT Gateways takes care of the protocol impedance mismatch between your IoT Cloud and your sensors (or actuators).
An IoT gateway filters data at the network edge so that only relevant data is dispatched to the IoT cloud. Here are some examples where this is useful:
Sensors often ‘chirp’ data periodically. A sensor may emit data a much higher frequency than actually needed by your application.
Data from sensors may include edge values and boundary-conditions which could be ignored.
Sometimes sensors misfire or provide bad sample values which can be discerned and ignored at the outset.
If all such sample values are dispatched to the Cloud it consumes additional network bandwidth; And such data may not be useful to your application at all.
An IoT gateway allows you to specify filtering rules, so that only useful data is sent to the cloud.
Edge filtering helps sanitize your sensor data before dispatching to the IoT cloud.
In addition to filtering sample values, an IoT gateway also offers some stream processing capability to aggregate and to shape data coming from the sensors. For example:
Some sensors offer non-linear response curves. Their sampled values may have to be transformed to a linear scale before transmission to the IoT Cloud.
Sensor response may be within a very wide bit range (Say 128 bits) and needs to be scaled down (Say to 16 bits), since your application does not need such a high resolution of measurement.
Sensors may exhibit hysteresis, which needs to be compensated for.
Sensors may exhibit temperature sensitivity, which needs to be compensated for.
A sensing element may need an average of 5 sample measurements to determine a more precise answer.
Data shaping ensures that any quirks and idiosyncrasies in your sensors are handled before sample values are dispatched to the IoT Cloud.
Most IoT applications involve some kind of a ‘control loop’. For example, if the temperature reaches a certain threshold, we need to shut-off the furnace.
A typical control loop involves one or more sensing element, a decision tree (rules engine), and a command to the actuator. Any control loop exhibits a latency of its own.
While the business logic of the control loop could be implemented on the IoT Cloud, certain applications may require a much faster response time.
In such cases, the business rules (decision making) are localized to the IoT gateway itself. A gateway can trigger an actuator based on certain conditions.
IoT Gateway enables tighter control loops with low response latency.
Aggregating and rolling-up data at the edge (field) before sending it to the Cloud saves substantial bandwidth. IoT gateways often provide data aggregation and analytics capabilities so that only concise information is dispatched to the cloud for further processing and archival.
Enterprise systems often need to ingest telemetry data from the field. However, we need to ensure that appropriate enterprise security mechanisms are enforced before data can be ingested.
For example, lightweight IoT sensors may not have the capability to support TLS, HTTPS, Client Certificates, VPN tunnels etc. which are a standard part of enterprise security today.
An IoT gateway can provide such capabilities which integrating with your enterprise system or with the IoT cloud.
IoT gateways support the necessary enterprise security standards to ensure that only data from trusted client devices is ingested by your enterprise systems.
IoT Cloud platforms support a variety of protocols such as HTTPS, WebSockets, MQTT, AMQP etc. IoT Gateways provide the ability to connect to an IoT Cloud platform over these protocols.
Another role of IoT Gateways is to monitor the health of deployed sensors on the field, and to notify the IoT Cloud in case of an errant sensor.
IoT Gateways referred in this post are often called as Field Gateways, as they are often installed on the field (such as a factory floor).
Field gateways are different than Protocol Gateways which are a common component of IoT Cloud platforms. Protocol gateways are software components which run in the IoT Cloud (not on the field), and offer termination for various IoT protocols such as HTTPS, WebSockets, MQTT etc.
Field gateways can integrate with Protocol Gateways too!
Interface Capabilities: RS232, PCI, Zigbee, Bluetooth etc.
Network Capabilities: Ethernet, WiFi.
Embedded OS: Hardened OS such as WindRiver Linux, Ubuntu Core.
If you’re building smart solutions that involve primitive sensors and actuators, IoT Gateways can be an indispensable part of your solution. They offer the ability to integrate with your sensors locally, support multiple cloud protocols, and an ability to filter and shape your data before transmission to the IoT Cloud.
IoT platforms are an essential part of IoT solutions today. They help accelerate the development of IoT applications and also ensure the requisite level of security, remote management, and integration capabilities in your solution.
There are several established platform providers in the market today such as – AWS IoT, ThingWorx, Azure IoT, Xively et. al. Many of these platforms share common features and architectural patterns.
In this post, we explore the architectural components and essential patterns to be considered in your IoT solutions.
We also share our wishlist of desired features for IoT Cloud Platforms. Such a wishlist is quite useful when trying to evaluate and choose a platform for a specific IoT solution.
Device Connectivity and Protocol Support
IoT devices support a variety of protocols, so any mature IoT platform should include support for multiple protocols such as: MQTT, AMQP, CoAP, STOMP, WebSockets, XMPP etc.
A component within an IoT platform which handles (terminates) these protocols is often called as the Cloud Gateway. Such gateways need to be highly scalability with an ability to process millions of messages each day.
Most IoT protocols use a message-centric, asynchronous communication model instead of the traditional Request-Response model of Web Applications. Hence, IoT platforms often include a scalable message bus infrastructure that is responsible for routing messages between devices and application services. Messages are delivered to one or more recipients using a pub-sub delivery model.
Device connectivity is often divided into two logical channels – control and data. The QoS levels and the exact protocols used for each logical channel may vary depending on specific application needs.
A Control Channel: To deliver device commands, health status, updates etc.
A Data Channel: To carry actual telemetry data, sampling values, from devices to the platform.
Unified Device Management Capabilities
Device management is a must-have feature for any IoT platform today. This includes capabilities enumerated below. Such capabilities are typically exposed as an admin dashboard with can be used by IoT Ops personnel.
Device Inventory: Tracking inventory of devices (things).
Device Health: Capturing heartbeat and health status of devices.
Remote Configuration Management: Remote management of device configuration using two-way sync capabilities.
Remote Device Management: Remote management of the device state – wipe, lock, activate.
Device Firmware Upgrades: Over-the-air firmware upgrades with canary releases.
Remote Logging: Remote access to device logs and capturing error reports from devices.
Nearly all CIOs rate ‘security’ to be a paramount concern for IoT applications today. Any IoT Platform hence needs to offer robust security features out-of-the-box. These include:
Device Identity: Establish a secure device identity using client certificates or other cryptographic means.
Device Enrollment: Securely enroll and authorize IoT devices to the platform.
Device Policy: Fine-grained authorization control to restrict device traffic coming into the IoT platform. Restrict what devices can publish, and what they can subscribe to.
Secure Communication Channels: Provide secure tunnels for communication between devices and the platform (TLS / SSL / IPSec / Private Networks etc).
Secure Firmware Delivery: Deliver signed software updates and checksum verifications during firmware upgrades.
This includes the ability to capture data streams from devices in real-time and performing analytics to drive business decision making.
Analytics can be offered in four flavors:
Predictive analytics using machine learning and,
The underlying analytics platform should be ready for scale, with an ability to handle millions (or even billions) of telemetry messages each day.
Support for Business Rules
This component provides ‘extensibility’ to an IoT Platform. This is where business logic (specific to your IoT application) gets codified.
It includes a business rules engine which can be customized to your business requirements, and it also includes a micro-services stack where custom code (business logic, lambda functions etc.) can be deployed by the application developer.
The rules engine often forms an important part of the ‘control loop’ for IoT applications. For example: If the temperature of a furnace exceed a certain threshold, a specific business rule triggers, and this may send a ‘cut-off’ command to the electric furnace.
Rules engines provide a DSL (Domain Specific Language) to express business rules. A common pattern to express rules is also the IFFT (If-This-Then-That). Alternately, you can codify your business logic in a programming language of your choice and deploy it as micro services.
Rules engines and micro services hook into the message bus so that they are able to receive real-time telemetry data and dispatch commands to devices.
Most enterprise systems offer standard protocols such as REST, SOAP, and HTTPS to facilitate integration with other systems. Enterprise cloud platforms also offer capabilities such as Big Data Stores, Large File Stores, Notification Services etc.
To build a complete IoT solution, devices need to integrate with legacy enterprise solutions and enterprise cloud applications. IoT platforms hence need to provide connectors to such enterprise and cloud services. These connectors would be invoked by the business rules or by the micro services running on the IoT platform.
The rapid growth of IoT paradigms today has made it necessary to accelerate ‘goto market’ timelines for IoT solution providers. Leveraging an IoT platform is a great way to achieve this goal.
IoT platforms provide cross-cutting concerns such as connectivity, security, management, and analytics so that solution developers do not reinvent the wheel. It is critical for you to evaluate your chosen IoT platform against these set of features before you embark on your journey. Now go build something awesome!
Over the past few years, consumer applications and enterprise solutions have been rapidly adopting the Node.js stack. In this paper we explore three key areas relevant to Node.js infrastructure – scalability, resilience, and mature DevOps.
Challenges with Node
Underutilizing Multi-Core CPUs
Node.js inherently operates in a single-process execution model. However, most modern production-grade hardware has multiple CPU cores.
So the execution of Node.js on modern hardware results in heavy utilization of one CPU core (due to CPU affinity), while leaving the other CPU cores underutilized.
No Isolation Between The Server Engine and Your Biz Logic
Mature servers such as Apache Tomcat or Apache HTTPd bring some degree of process-isolation (or thread-isolation) between the server core and the developer’s business code. Faults in business code do not crash the core server engine itself.
Node.js does not inherently have such an isolation. Node developers typically initialize a web server and run business code within the same process. If the business code throws an exception, it causes the server itself to crash!
The Risks of A Weakly Typed Language
Try-Catch Doesn’t Suffice
A try-catch block will not catch exceptions that occur in an async callback function contained within itself. Such is the nature of async callbacks!
Which means, there is only a limited use of try-catch blocks for the purposes of error handling and defensive programming in Node.js
Slow Memory Leaks
The Node Package Manager has become the de-facto approach to import libraries (modules) into your Node application today. Developers are ‘trigger happy’ about the use of many public NPM modules in their application code.
However many npm modules exist in the wild, and have rarely been curated or intensively tested. Some of them misbehave, throw unexpected exceptions, or slowly leak memory at runtime. Moreover, such leaks may be difficult to catch in your profiling tests as the rate of the memory leak could be slow and only adds-up over time.
Utilizing Multi-Core CPUs
To effectively utilize multiple CPU cores and to achieve a higher application throughput, one can think of the following approach:
Spawn multiple worker processes to execute our Node.js code.
The kernel scheduler will allocate these worker processes across the available CPU cores on the system: Since Linux kernels often prefer CPU-affinity, each worker process is likely to get allocated to a specific core for it’s entire lifetime.
Distribute inbound HTTP requests evenly across these worker processes (We see how this really happens later).
As inbound requests arrive, each worker process services the requests allocated to it and thus starts utilizing it’s own CPU core.
Comparisons With Apache MPM Pre-fork
At a first glance, this approach seems very similar to Apache’s MPM Pre-fork module, which spawns multiple child processes at startup and delegates incoming requests to them. However there is one key difference!
The Apache-Way of Doing Things
Apache’s I/O is blocking in nature. Which means, a child process will receive a request and will often block waiting for I/O to complete (Say, a file read from the disk). In order to increase the concurrency in this case, we spawn a pool of additional child processes. So while some processes are blocked, other processes can continue serving new inbound requests from the clients.
We can keep increasing our concurrency by increasing the number of child processes – but only to a certain limit!
Since each process has it’s own memory footprint and we soon reach a limit for the number of child processes that we can spawn without causing excessive thrashing of the virtual memory on this system. At some point, there will be too many child processes vying for attention from the OS scheduler, and the cost of swapping the process images in-and-out of the disk will be prohibitive.
The Node-Way of Doing Things
A Node process, on the other hand, does not block for I/O at all. Which means, the process can service more and more inbound requests until it’s own CPU core is nearly saturated.
Since each Node process has the ability to saturate it’s own CPU core, the number of Node processes required here to achieve high concurrency is equal to the number of CPUs on that machine. Fewer processes means, an overall lower memory footprint.
The Node.js cluster module adopts this approach. Let us explore the workings of the cluster module further in the next section.
Figure 2: Node’s non-blocking IO. One process per CPU core.
The Node.js Cluster Module
The primary Node.js process is called the master. Using the cluster module, the master can spawn additional worker processes and tell them which Node.js code to execute. This works much like the Unix fork() where a master process spawns child processes.
IPC Channel Between Master and Workers
Whenever a new worker is spawned, the cluster module sets up an IPC (Inter Process Communication) channel between the master and that worker processes. Thru this IPC mechanism, the master and worker can exchange brief messages and socket descriptors with each other.
Listening to Inbound Connections
Once spawned, the worker process is ready to service inbound connections an invokes a listen() call on a certain HTTP port. Node.js internally rewires this call as follows:
The worker sends a message to the master (via the IPC channel), asking the master to listen on the specified port.
The master starts to listen on that port (if it is not already listening).
The master is now aware that a specific worker has indicated interest in servicing inbound requests arriving on that port.
While it may seem that the worker is invoking the listen(), the actual job of listening to inbound requests is done by the master itself.
Load Balancing Between Worker Processes
When an inbound request arrives, the master accepts the inbound socket, and adopts a round-robin mechanism to decide ‘which worker’ should this request be delegated to.
The master then hands-over the socket descriptor of this request to that worker over the IPC channel. This round-robin mechanism is part of the Node core and helps accomplish the load balancing of the inbound traffic between multiple workers.
Minimize Responsibilities of The Master
Let your master process do a minimal amount of work and be responsible only for:
Spawning worker processes at the start of your server.
Managing the lifecycle of your worker processes.
Delegating all inbound requests to workers.
In particular, do not encapsulate your business logic in the master process. And do not load any unwanted npm modules in the master.
Most runtime errors are likely to occur due to buggy business code or npm modules used by your code. By encapsulating business code only in the workers, such errors will impact (or crash) a worker processes, but not impact your master process.
This gives you a stable, unhindered, master process which can ‘baby-sit’ worker processes at all times. As we shall see later, the master is responsible to manage workers and ensure that enough healthy workers are available to service your inbound traffic.
If the master itself crashes or is badly-behaved (because it ran buggy business code), there is no caretaker left for your workers anymore.
Replenishing Worker Processes
It is possible that worker processes die over time. This can happen due to various reasons (Running out of memory, receiving a Unix signal to forcefully kill itself, programming bug causes an abrupt crash in a certain path of execution).
When a worker dies, Node.js notifies your master process with an event. At that point, in the event handler, your master process should spawn a new worker process. This ensures that we have enough workers in our pool to service inbound traffic.
Gracefully Killing A Worker Process
As we shall see in later sections, there are several scenarios where you would like to gracefully shutdown (kill) a worker process in your cluster. Let us understand how we can accomplish a graceful worker shutdown:
Suppose, the master decides to elegantly kill a specific worker &om the present pool of workers in the cluster.
The master sends a signal or a message to that worker (via the IPC channel) asking the worker to gracefully kill itself.
At this point, that worker disconnects from the IPC channel, so it stops accepting new inbound HTTP requests from the master.
The worker attempts to gracefully finish any in-flight requests that it has already accepted in the past. (So that we don’t drop in-flight requests and we don’t send errors to our clients).
After giving itself time to gracefully complete in-flight requests, the worker attempts to close any resources it had acquired (DB connections, cache connections, sockets, file handles etc).
Then, the worker kills itself.
If the worker does not manage to kill itself elegantly within a certain window of time, say 10 seconds, the master decides to forcefully kill the worker by sending it a signal.
Dealing With Uncaught Exceptions
Even with meticulous programming and defensive tactics, unhandled exceptions are likely to occur at runtime, and your Node server has to deal with it. But how?
When an unhandled exception occurs, Node.js offers your worker process a way to ‘catch’ it. However, Node creator Ryan Dahl, mentions that the event loop is likely to be in an indeterminate state at that point in time.
So it is best to kill that worker process as soon as you can, and spawn a new worker as a replacement. Here is what you should do:
Return a HTTP 500 for the request that resulted in an unhandled exception.
Perform the steps for a graceful worker shutdown (as we’ve seen before) and then let the worker process kill itself.
When the master notices that a worker has died, it spawns a new worker at that point.
Consider the worker to be ‘unhealthy’ anytime it catches an unhanded exception. The above steps would be an elegant way to deal with the situation.
Periodic Roll-Over of the Cluster
We’ve seen earlier how some npm modules could potentially result in slow memory leaks. Sometimes your own own code could leak memory as well.
To keep the cluster healthy, it is recommended that your master periodically kill all worker processes and spawn new ones. But this needs to be done elegantly – If you kill all workers at once, there would be nobody left to do the work and the server’s throughput will drop momentarily.
We adopt a process called as slow rolling of workers as follows: • At regular intervals, say every 12 hours, the master initiates the rolling process.
The master chooses one worker from the present pool of workers in the cluster and decides to gracefully kill that worker proces. (We’ve already seen the steps to gracefully kill a worker in an earlier section).
At the same time, the master spawns a new worker process to replenish capacity.
Once the rollover of this worker is completed, the master picks the next worker from the initial pool to gracefully roll that one, and the process continues until all workers are ‘rolled over’.
Rolling workers is a great way to keep your node cluster healthy over elongated periods of time.
Preventing A ‘Self Inflicted’ Denial of Service Attack
So far we’ve looked at how our master can baby-sit worker processes and spawn new ones if a worker dies. But here is an interesting scenario to consider:
An inbound HTTP request results in execution of a specific (buggy) code that throws an uncaught exception.
The worker process kills itself, and then the master spawns a new worker right away!
Subsequent HTTP requests again result in uncaught exceptions in the new worker. The new worker decides to kill itself, and this self destructive cycle continues over and again.
The process of continuously spawning a new process over and again (in an indefinite loop), causes your OS to thrash. Anything else that is possibly running on that machine will get impacted too. This represents a ‘self inflicted’ denial of service attack.
The ultimate resolution to this problem would be to fix that buggy code and redeploy it. But until then, you need to add some safeguards to prevent such a run-away conditions from occurring on your machine:
When the master spawns a new worker, it watches if the new worker process survives for a certain number of seconds (threshold).
If the worker dies within that threshold of time, the master infers that something is seriously wrong within the mainline code itself (and not just within some corner case).
At this point, the master should dispatch a panic message into your logs, or invoke a HTTP call to an alerting service – that gets your rapid response team into action!
Also, the master should to throttle the rate of spawning new process at this point, so as not to thrash the OS or impact other things running on that machine.
It is likely that this machine would soon be devoid of any useful workers to serve any requests. But at least you have: (A) Notified your rapid response team to swing in action (B) Prevented a run- away condition in your OS.
In a subsequent section, we talk about how your master process can improvise this further and safeguard your server from deploying such buggy (self destructing) code.
Zero Downtime Restarts
The utopia for a mature DevOps is to have a zero downtime restart capability on the production servers. This would mean the following things:
The development team can push new code snapshots to a live server without shutting down the server itself (even for a moment).
All in-flight and ongoing requests continue to be processed normally without clients noticing any errors.
The new code cuts-over seamlessly and soon new client requests get served by the newly deployed code.
With Node.js this is not an utopia anymore. We have already looked at most of the ingredients which can make zero downtime restart a reality:
Suppose you have a running Node.js cluster that is serving Version 1 of your code. All the modules from your source code are already loaded and cached by Node’s module cache.
Now you place Version 2 of your code on the file system. And you send a signal to the master to initiate a graceful rollover of the entire cluster (We’ve seen those details earlier).
At this point, the master will gracefully kill one worker at a time, and span a new worker as a replenishment. This new worker will read Version 2 of your code and starting serving requests suing the Version 2 code.
As an additional safeguard, if the new worker dies within a short threshold of time, the master infers that the new code may have a buggy mainline and hence it does not proceed with the graceful restart for other workers.
There will be a brief period during which some of your workers are serving Version 1 of the code, and some others are already serving Version 2. This may or may not be okay, depending on the circumstances.
Wrapping Requests in Domains
We explored earlier how a simple try-catch block does not suffice to catch exceptions that occur within asynchronous callbacks.
Node.js has now introduced the concept of domains to elegantly handle asynchronous errors. Your implementation hence needs to do the following:
Wrap every inbound request in a domain. Wrap all event emitters from that request in that same domain.
Write an error handler on that domain which can catch runtime errors that occur within that domain.
When you encounter an error in this domain, gracefully kill the present worker process and re-spawn a new one.
This approach is similar to handling uncaught exceptions which we described earlier. The key difference is that we are wrapping individual requests within the scope of each domain. This helps isolate faults within a request (context) and dealing with that specific request more elegantly.
Delegating Work to a Front-Proxy
Node.js does have the capability to accept inbound HTTPS traffic, but this task is best done by a front-proxy such as Nginx which can run on the same machine as your Node server. Configure Nginx as a reverse proxy and let it terminate inbound SSL connections. Alternatively, a front load balancer could also do that for you.
Compressing HTTP Streams
There are npm modules to achieve gzip compression of HTTP streams. But you would rather delegate this job to the front-proxy such as Nginx. This way your Node.js server can focus on serving your core business logic and delegate such tasks to Nginx.
We’ve spoken earlier about how a single Node.js process can, in theory, saturate a CPU core by accepting more and more inbound requests. In reality, you would not like to reach the peak of your machine capacity in production. The front-proxy can play an important role in traffic throttling and making sure your Node.js machine does not fully saturate.
Other Recommended Practices
Running as a Non Privileged User
This may be an obvious consideration when building any server runtime: For reasons of security, you do not want your Node.js processes to run with root privileges! So make sure you create a non-privileged user and have the node processes to run under that user on your system. This has been a standard guideline, of course, when creating any daemon on Linux.
Keeping IPC Messages Lean
The IPC channel between the master and child processes is only intended to exchange short, control messages. Do not abuse this channel to send large business payloads.