Understanding IoT Gateways – The Glue for Industrial Internet of Things

IoT gateways have become a critical component of IoT deployments today. In this post, we try to understand the need for IoT Gateways and the role they play in an IoT solution architecture.

Integrating ‘Things’ With the Cloud

Some IoT appliances are sufficiently advanced to support the full extent of the TCP/IP stack and to securely communicate directly with your IoT Cloud.

However, we often encounter lightweight IoT sensors and actuators that support local communication interfaces only – such as Zigbee, Bluetooth, RS232, RS485 etc. They do not have the capability or the compute power to support a full TCP/IP stack.

In such cases, an IoT Gateway acts as an intermediary device that is deployed on the field. It provides multiple local interfaces – sensors and actuators connect to these local interfaces:

  • ZigBee
  • ZWave
  • Bluetooth
  • BLE
  • RS485
  • RS232
  • SPI
  • Digital IO
  • Analog-to-Digital Converter (ADC)

The software on the gateway is then responsible, to aggregate information from sensors and dispatch it to the IoT Cloud. Also, the gateway may receive commands from the cloud, which it relays further to the sensors and actuators via the local interfaces.

IoT Gateways takes care of the protocol impedance mismatch between your IoT Cloud and your sensors (or actuators).

Edge Filtering

An IoT gateway filters data at the network edge so that only relevant data is dispatched to the IoT cloud. Here are some examples where this is useful:

  • Sensors often ‘chirp’ data periodically. A sensor may emit data a much higher frequency than actually needed by your application.
  • Data from sensors may include edge values and boundary-conditions which could be ignored.
  • Sometimes sensors misfire or provide bad sample values which can be discerned and ignored at the outset.

If all such sample values are dispatched to the Cloud it consumes additional network bandwidth; And such data may not be useful to your application at all.

An IoT gateway allows you to specify filtering rules, so that only useful data is sent to the cloud.

Edge filtering helps sanitize your sensor data before dispatching to the IoT cloud.

Data Shaping

In addition to filtering sample values, an IoT gateway also offers some stream processing capability to aggregate and to shape data coming from the sensors. For example:

  • Some sensors offer non-linear response curves. Their sampled values may have to be transformed to a linear scale before transmission to the IoT Cloud.
  • Sensor response may be within a very wide bit range (Say 128 bits) and needs to be scaled down (Say to 16 bits), since your application does not need such a high resolution of measurement.
  • Sensors may exhibit hysteresis, which needs to be compensated for.
  • Sensors may exhibit temperature sensitivity, which needs to be compensated for.
  • A sensing element may need an average of 5 sample measurements to determine a more precise answer.

Data shaping ensures that any quirks and idiosyncrasies in your sensors are handled before sample values are dispatched to the IoT Cloud.

Control Loops

Most IoT applications involve some kind of a ‘control loop’. For example, if the temperature reaches a certain threshold, we need to shut-off the furnace.

A typical control loop involves one or more sensing element, a decision tree (rules engine), and a command to the actuator. Any control loop exhibits a latency of its own.

While the business logic of the control loop could be implemented on the IoT Cloud, certain applications may require a much faster response time.

In such cases, the business rules (decision making) are localized to the IoT gateway itself. A gateway can trigger an actuator based on certain conditions.

IoT Gateway enables tighter control loops with low response latency.

Edge Analytics

Aggregating and rolling-up data at the edge (field) before sending it to the Cloud saves substantial bandwidth. IoT gateways often provide data aggregation and analytics capabilities so that only concise information is dispatched to the cloud for further processing and archival.

Edge Security

Enterprise systems often need to ingest telemetry data from the field. However, we need to ensure that appropriate enterprise security mechanisms are enforced before data can be ingested.

For example, lightweight IoT sensors may not have the capability to support TLS, HTTPS, Client Certificates, VPN tunnels etc. which are a standard part of enterprise security today.

An IoT gateway can provide such capabilities which integrating with your enterprise system or with the IoT cloud.

IoT gateways support the necessary enterprise security standards to ensure that only data from trusted client devices is ingested by your enterprise systems.

Cloud Integration

IoT Cloud platforms support a variety of protocols such as HTTPS, WebSockets, MQTT, AMQP etc. IoT Gateways provide the ability to connect to an IoT Cloud platform over these protocols.

Health Monitoring

Another role of IoT Gateways is to monitor the health of deployed sensors on the field, and to notify the IoT Cloud in case of an errant sensor.

Noteworthy Points

IoT Gateways referred in this post are often called as Field Gateways, as they are often installed on the field (such as a factory floor).

Field gateways are different than Protocol Gateways which are a common component of IoT Cloud platforms. Protocol gateways are software components which run in the IoT Cloud (not on the field), and offer termination for various IoT protocols such as HTTPS, WebSockets, MQTT etc.

Field gateways can integrate with Protocol Gateways too!

Components of An IoT Gateway

  • Compute Capabilities: CPU, Memory, Persistent Store.
  • Interface Capabilities: RS232, PCI, Zigbee, Bluetooth etc.
  • Network Capabilities: Ethernet, WiFi.
  • Embedded OS: Hardened OS such as WindRiver Linux, Ubuntu Core.

Wrapping Up…

If you’re building smart solutions that involve primitive sensors and actuators, IoT Gateways can be an indispensable part of your solution. They offer the ability to integrate with your sensors locally, support multiple cloud protocols, and an ability to filter and shape your data before transmission to the IoT Cloud.

 

MQTT: A Protocol for the Internet of Things

Connecting smart appliances requires a robust and lightweight protocol that facilitates efficient M2M (Machine-to-Machine) communication.

Such protocols are expected to work in conditions of low bandwidth and intermittent connectivity. The protocol implementation should also require a small code footprint to run on devices having limited computational power. See this post for further details about the desired capabilities of M2M protocols.

The MQTT (MQ Telemetry Transport) Protocol was designed nearly 15 years ago to meet such constraints and to facilitate efficient transportation of telemetry data from embedded devices. With the emergence of IoT today, this protocol has risen to prominence – MQTT is an ISO standard and the open source community offers an extensive set of SDKs and Libraries that support MQTT.

MQTT Concepts

If you’re familiar with Enterprise Messaging Systems, you are already familiar with some of the essential concepts in MQTT.

A Pub-Sub Delivery Model

MQTT uses a ‘publish-subscribe’ model for message delivery. All parties interested to communicate with each other, connect to a centralized message broker. This broker acts as a mediator, and all messages are routed via the broker itself.

Parties which connect to the broker can be smart appliances, sensors, actuators, and application services.

Senders such as a temperature sensor, will publish messages (temperature values) to the broker. To ensure that messages reach only the relevant recipients, MQTT uses the concept of topics. A topic is a string that represents a virtual address. One or more recipients can indicate their interest in a topic by subscribing to it (beforehand).

Example:

A furnace controller may be interested in receiving messages that contain temperature information, but not in receiving messages having acceleration information. Whereas, a logging service may be interested in receiving all messages. In this case, topics could be ‘temperature’ and ‘acceleration’.

Each message thus represents an independent package of information and also contains the name of the ‘topic’ that this message pertains to.

When the broker receives any message, it determines the topic to which a message belongs. It then determines the recipients who are interested in that topic at that point in time. The broker is then responsible to forward the message to each of the interested recipients.

The number of subscribers attached to a topic may vary over time as the system evolves. This hub-and-spoke architecture of MQTT ensures that senders and recipients are de-coupled from each other and we have a 1-to-N communication capability between parties.

Simply put, you could think of four stages here:

Connect: All parties ‘connect’ to a message broker over a TCP connection.

Subscribe: Each party informs the broker about specific topics of its interest. Each topic is represented by a unique string.

Publish: Parties publish messages to the broker from time-to-time. Say, when a temperature sensor has sampled temperature information.

Delivery: Broker inspects the topic specified within each message and routes this message to everyone who has subscribed to that topic.

As we shall see later, there is a bit more to it – If the broker determines that an interested recipient is currently not reachable (disconnected), it may queue the messages internally for future delivery.

TCP Connectivity

MQTT uses TCP/IP as the underlying transport mechanism. MQTT devices establish a persistent TCP connection with the broker at all times. This acts as a two-way point-to-point messaging channel between the device and the broker itself.

If the TCP connection is broken, the device attempts a reconnection with the broker. While the device is offline, the broker will buffer (queue) messages until that device comes back online.

Note that devices do not have any direct transport layer connectivity with each other. The broker acts as a central hub to which everybody connects.

MQTT Topics

Every MQTT message contains a ‘topic’ within the message header. The topic is the primary means of routing messages to intended recipients (subscribers). You can think of a topic as a virtual address to which a message is destined to.

  • A topic is a simple string such as ‘temperature’. A sensor may be publishing messages (current temperature information) to this topic periodically.
  • Topics can be hierarchically organized by specifying ‘/’ as the separator. For example, if a building has multiple sensors, they could be organized as:
    • building / floor-1 / temperature
    • building / floor-1 / humidity
    • building / floor-2 / temperature
    • building / floor-2 / humidity
  • Topic strings also support wildcard characters ‘+’ and ‘#’
    • A ‘+’ represents a single step within the hierarchy. And it can occur at any step within the hierarchy.
    • A ‘#’ represents multiple steps within the hierarchy. And it can occur only at the end.
  • For Example: A client who subscribes to ‘building/+/temperature’ would receive messages from both: ‘building/floor-1/temperature’ and ‘building/floor-2/temperature’
  • For Example: A client who subscribes to ‘building/floor-1/#’ would receive messages from all topics under ‘building/floor-1/’.

Transient vs Durable Subscriptions

  • When a client connects to the broker using a TCP connection. Due to network conditions, this TCP may get dropped from time-to-time, and the client could reconnect each time. Each such connection represents a temporary physical session between the two parties.
  • However, the logical association between a client and its subscribed topics, could outlast these temporary session outages. This concept is called as durable subscriptions.
  • In effect, the client has informed the broker: “Please keep my messages for this topic, while I’m offline. I’ll come back and pick those up later”.
  • When a client connects to the broker, it can specify if this connection is Transient (Clean Session Flag = 1) or Persistent (Clean Session Flag = 0).
  • If the Clean Session Flag = 1, the broker considers this to be a transient connection. If this client abruptly disconnects later, the broker would not ‘keep’ any messages on behalf of this client for the topics of interest.
  • If the Clean Session Flag = 0, the broker considers this to be a durable connection. In this case, if the client abruptly disconnects later, the broker would ‘keep’ all messages, on topics subscribed by this client, with the assumption that the client would come back in the future to pick those up. When the client connects again (with Clean Session Flag = 0), the queued messages are delivered to this client.

Supporting durable connections means, the broker has to track additional ‘state’ information on behalf of the clients.

Concept of Retained Messages

  • A client can publish a message to a specific topic, and flag this message to be ‘retained’ by the broker.
  • The broker then delivers this message to the current subscribers of this topic and also retains this message for future use.
  • In the future, when any new clients subscribe to this topic, the broker will automatically deliver this ‘retained message’ to those new subscribers right away.

This model is very useful in scenarios where a subscriber should receive the ‘last known good value’ of something. Say, a subscriber is interested to receive temperature information from a sensor and it can receive the ‘last known temperature’ from the broker (instead of waiting for the publisher to publish this information again).

Now, the publisher only has to publish temperature values to the broker if there is a change in the temperature. This approach helps reduce the network chatter and conserves energy of the publisher as well.

Will and Testament

Given the intermittent network connectivity, it would be useful if interested recipients can be automatically notified when a particular device goes offline. This capability is achieved using a ‘Last Will & Testament’ (LWT) as follows:

  • When a device connects to a broker, it informs the broker about it’s ‘Last Will’, and the broker remembers this information for the future.
  • Later, if that device abruptly goes offline, the broker automatically dispatches this ‘Will’ (message) to any interested parties who may have previously subscribed to this.

LWT is thus a useful way of notifying all interested parties when an IoT appliance goes offline.

For example: When a security camera abruptly disconnects, an interested Service may receive the LWT message from the broker, and this Service can further send an SMS notification to the home owner.

Keep Alive Messages

Sometimes client devices abruptly crash, or have an abrupt network disconnection. This can often result in a half-open TCP connections on the broker – The broker continues to think that it has an active TCP socket with this client.

The ‘keep alive’ is introduced as a timeout mechanism by which parties can determine if the connection is still alive.

  • During the establishment of the connection, the client specifies the ‘keep alive’ duration to the broker. The broker remembers this value.
  • The ‘keep alive’ interval is the longest duration of time that the client and broker can endure without exchanging any message between themselves.
  • The broker maintains a ‘keep alive’ timer for each client:
    • If a broker does not receive any publish messages from the client for a duration of 1.5 X the ‘keep alive duration’, it assumes that the client has disconnected.
    • Upon receipt of a message from the client within the ‘keep alive duration’, the broker resets the ‘keep alive’ timer for that client.
  • However, if the client does not really have any new information to publish within the ‘keep alive’ interval, it can simply publish a PINGREQ message to the broker, and receives a PINGRESP back. This serves as a heartbeat mechanism between the two parties.
  • It is the responsibility of the client to keep publishing messages within the ‘keep alive’ interval.
  • If the ‘keep alive’ threshold for a client is exceeded, the broker will do the following:
    • Forcibly close the TCP connection with this client.
    • If the client had specified any LWT (Last Will and Testament) that will be Published to all interested recipients.
    • If the client has a durable subscription, it will retain all QoS 1 and QoS 2 messages pertaining to the client’s subscriptions until this client connects again.

Quality of Service

MQTT deals with each message as an individual package of information. What are the delivery guarantees for a message to reach its intended recipients? MQTT brokers offer three QoS levels as explained  below:

Level 0: At Most Once Delivery: There is no guarantee of message delivery by the broker. This needs minimal overhead, but applications need to be designed with the assumption that a message may not be delivered as intended.

Level 1: At Least Once Delivery: There is a guarantee that a message will be delivered to each listener at least once (but it could be more than once). In this case, the handling of the message on the recipient needs to be idempotent  – since a particular message may be received more than once.

Level 2: Exactly Once Delivery: There is a guarantee that a message will be delivered to each interested listener exactly once. Ensuring this requires additional overhead in the broker.

Getting Started with MQTT

MQTT Version 3.1.1 is the latest standard and is an OASIS specification today. Below are some additional references to get you started:

MQTT Message Brokers: Eclipse Mosquito, Mosca, ActiveMQ, RabbitMQ are some examples of MQTT broker products.

MQTT Client SDKs: The Eclipse Paho project offers an open source implementation of MQTT client SDKs. Libraries are provided for C, Java, Python, JavaScript and other programming languages.

MQTT Cloud Brokers: Most IoT Cloud providers, such as Amazon AWS IoT,  provide a device gateway which supports a scalable MQTT message broker that devices can connect to.

MQTT: Summary

  • Asynchronous model of communication using discrete messages (events).
  • Hub-and-Spoke architecture using a centralized broker.
  • Instead of creating new network standards, MQTT uses ubiquitous IP networks with persistent TCP connections as the underlying transport mechanism.
  • Publish-Subscribe mechanism to decouple data producers (publishers) and data consumers (subscribers) using message queues (topics).
  • Topics represent ‘virtual addresses’ to which messages get delivered.
  • Low protocol overhead with a 2 byte header. Low network bandwidth.
  • Supports durable connections which outlast temporary TCP sessions.
  • Broker caches messages for each durable connection until the device reconnects.
  • Supports multiple Quality of Service levels for message delivery.
  • Supports keep alive timeout to detect if a device goes offline.
  • Supports Last Will and Testament (LWT) to notify parties if a device goes offline.

Architectural Features of IoT Cloud Platforms

IoT platforms are an essential part of IoT solutions today. They help accelerate the development of IoT applications and also ensure the requisite level of security, remote management, and integration capabilities in your solution.

There are several established platform providers in the market today such as – AWS IoT, ThingWorx, Azure IoT, Xively et. al. Many of these platforms share common features and architectural patterns.

In this post, we explore the architectural components and essential patterns to be considered in your IoT solutions.

We also share our wishlist of desired features for IoT Cloud Platforms. Such a wishlist is quite useful when trying to evaluate and choose a platform for a specific IoT solution.

Device Connectivity and Protocol Support

IoT devices support a variety of protocols, so any mature IoT platform should include support for multiple protocols such as: MQTT, AMQP, CoAP, STOMP, WebSockets, XMPP etc.

A component within an IoT platform which handles (terminates) these protocols is often called as the Cloud Gateway. Such gateways need to be highly scalability with an ability to process millions of messages each day.

Most IoT protocols use a message-centric, asynchronous communication model instead of the traditional Request-Response model of Web Applications. Hence, IoT platforms often include a scalable message bus infrastructure that is responsible for routing messages between devices and application services. Messages are delivered to one or more recipients using a pub-sub delivery model.

Device connectivity is often divided into two logical channels – control and data. The QoS levels and the exact protocols used for each logical channel may vary depending on specific application needs.

  • A Control Channel: To deliver device commands, health status, updates etc.
  • A Data Channel: To carry actual telemetry data, sampling values, from devices to the platform.

Unified Device Management Capabilities

Device management is a must-have feature for any IoT platform today. This includes capabilities enumerated below. Such capabilities are typically exposed as an admin dashboard with can be used by IoT Ops personnel.

  • Device Inventory: Tracking inventory of devices (things).
  • Device Health: Capturing heartbeat and health status of devices.
  • Remote Configuration Management: Remote management of device configuration using two-way sync capabilities.
  • Remote Device Management: Remote management of the device state – wipe, lock, activate.
  • Device Firmware Upgrades: Over-the-air firmware upgrades with canary releases.
  • Remote Logging: Remote access to device logs and capturing error reports from devices.

Security Features

Nearly all CIOs rate ‘security’ to be a paramount concern for IoT applications today. Any IoT Platform hence needs to offer robust security features out-of-the-box. These include:

  • Device Identity: Establish a secure device identity using client certificates or other cryptographic means.
  • Device Enrollment: Securely enroll and authorize IoT devices to the platform.
  • Device Policy: Fine-grained authorization control to restrict device traffic coming into the IoT platform. Restrict what devices can publish, and what they can subscribe to.
  • Secure Communication Channels: Provide secure tunnels for communication between devices and the platform (TLS / SSL / IPSec / Private Networks etc).
  • Secure Firmware Delivery: Deliver signed software updates and checksum verifications during firmware upgrades.

Telemetry Analytics

This includes the ability to capture data streams from devices in real-time and performing analytics to drive business decision making.

Analytics can be offered in four flavors:

  • Real-time analytics,
  • Batch analytics,
  • Predictive analytics using machine learning and,
  • Interactive Analytics.

The underlying analytics platform should be ready for scale, with an ability to handle millions (or even billions) of telemetry messages each day.

Support for Business Rules

This component provides ‘extensibility’ to an IoT Platform. This is where business logic (specific to your IoT application) gets codified.

It includes a business rules engine which can be customized to your business requirements, and it also includes a micro-services stack where custom code (business logic, lambda functions etc.) can be deployed by the application developer.

The rules engine often forms an important part of the ‘control loop’ for IoT applications. For example: If the temperature of a furnace exceed a certain threshold, a specific business rule triggers, and this may send a ‘cut-off’ command to the electric furnace.

Rules engines provide a DSL (Domain Specific Language) to express business rules. A common pattern to express rules is also the IFFT (If-This-Then-That). Alternately, you can codify your business logic in a programming language of your choice and deploy it as micro services.

Rules engines and micro services hook into the message bus so that they are able to receive real-time telemetry data and dispatch commands to devices.

Integration Capabilities

Most enterprise systems offer standard protocols such as REST, SOAP, and HTTPS to facilitate integration with other systems. Enterprise cloud platforms also offer capabilities such as Big Data Stores, Large File Stores, Notification Services etc.

To build a complete IoT solution, devices need to integrate with legacy enterprise solutions and enterprise cloud applications. IoT platforms hence need to provide connectors to such enterprise and cloud services. These connectors would be invoked by the business rules or by the micro services running on the IoT platform.

Wrapping Up…

The rapid growth of IoT paradigms today has made it necessary to accelerate ‘goto market’ timelines for IoT solution providers. Leveraging an IoT platform is a great way to achieve this goal.

IoT platforms provide cross-cutting concerns such as connectivity, security, management, and analytics so that solution developers do not reinvent the wheel. It is critical for you to evaluate your chosen IoT platform against these set of features before you embark on your journey. Now go build something awesome!

 

Architectural Design Patterns for an IoT Platform

IoT solutions exhibit architectural characteristics that are often different than traditional Web or Mobile applications. We express these ideas as a series of dozen patterns below.

Use these as a checklist when designing your next IoT solution or when evaluating other IoT platforms.

Pattern 1: Data Ingestion

IoT solutions require the ability to ingest events and data generated by IoT devices. Solution should have capabilities to ingest such real-time telemetry data at a massive scale (i..e. millions, or even billions of messages) and yet stay highly performant (i.e. low latency to accept and persist messages).

Design the ability to ingest and persist massive amounts of telemetry data via a primary ‘data channel’ in your IoT solution.

Pattern 2: Control Channel

Most web architectures adopt a Request-Response model of communication: The client sends a request to a service, and receives a response in return. While immensely useful, this model does not suffice for IoT applications.

IoT solutions need the ability to send commands and notifications to IoT Devices in real time. We need a two-way communication channel with the following characteristics:

  • The solution should allow either party to initiate communication.
  • The channel should also offer guarantees to ensure delivery of commands (i.e. Resilience against packet drops etc.)
  • Extensibility to support new commands in the future as the solution evolves.
  • Ensure that commands are delivered securely with sufficient mutual authorizations.

Protocols such as MQTT are well suited to act as such a ‘control channel’ with Quality of Service (QoS) guarantees for message delivery.

Design for a secure, two-way, control channel which can be initiated by either the Device or by the Cloud. Ensure ‘at least once’ delivery semantics for messages in this control channel.

Pattern 3: Loose Coupling

IoT devices often encounter intermittent connectivity in their wireless networks. Also, devices sometimes suspend their radio interfaces or hibernate entirely in order to conserve power.

The IoT communication channels hence need a asynchronous model of communication. Messages and commands should be queued for delivery if a device is intermittently disconnected or is in a sleep mode.

Moreover, the set of devices deployed in the field changes with time: A temperature sensor may not know who the exact recipient of its telemetry information is. It may be one or more existing thermostats, or new thermostats and services in the future.

IoT communication channels hence require a loose-binding between the message senders and recipients. A topic-based, multicast communication is a very useful model to consider in your solution design.

Pattern 4: IoT Gateway

Lightweight IoT devices often operate within a PAN (Personal Area Network) using protocols such as Zigbee, Z-Wave, or BLE. These protocols do not support an IP stack to communicate directly with the IoT Cloud over WAN.

An IoT Gateway is an intermediate device that communicates with sensors and actuators using low-level protocols. It further offers the ability to connect with your IoT Cloud using IP-based protocols.

The Gateway offers capabilities such as sanitization of telemetry data, aggregation and edge-analytics of telemetry data.

Devices which do not support the IP stack will integrate using an IoT Gateway.

Pattern 5: Business Rules Engine

Once ingested, we need to run business rules on this data to derive control decisions and business insights. This could be done at two points in time:

The Hot Path

Ability to run business rules on real-time data streams. This enables our IoT platform to take spontaneous decisions, route information, or control objects in real-time. For example, if a temperate threshold of the furnace is crossed, we need to recalibrate the heating system with immediate effect.

The Cold Path

This involves offline data processing on the persisted telemetry data. For example, aggregating the temperature over the period of last three months to determine the average temperature of the furnace.

Provide an ability to process data streams in real-time, or to process them post-facto.

Pattern 6: Heartbeat

IoT devices tend to fall off the grid from time to time. This may be due to battery outages, network outages, hardware failures etc. Devices should send a ‘heartbeat’ to the Cloud platform from time-to-time which includes information about the internal health of the device.

Design devices to periodically send a heartbeat or a keep-alive signal to your IoT Cloud.

Pattern: Canary Firmware Releases

Most IoT solutions require the ability to remotely upgrade firmware running on your devices. The firmware needs to be securely downloaded, and verified for integrity before replacing the older firmware on the device.

Moreover, firmware upgrades should be incrementally rolled-out, instead of being pushed to all devices at once. This helps limit the risks due to a faulty firmware release, as well as reduces the burden on your Cloud infrastructure when devices attempt to download the new firmware.

Ability to incrementally roll out firmware upgrades to manage risks of new software releases.

Pattern 8: Unified Endpoint Management

Ability to manage endpoint devices remotely – a dashboard and a set of APIs to be able to do this task.

Pattern 10: Device Authorization

Web or Mobile applications typically authenticate users with credentials (User Name and Password). IoT devices are often headless and offer no means to provide such credentials.

IoT devices often use client certificates to authorize themselves with the IoT Cloud.

Use client certificates for device authorization. Make sure the client private keys are stored in a secure key-chain on the device.

Pattern 11: State Synchronization

This is similar to the concept of Device Shadows offered by platforms like the Amazon AWS IoT. It created a representation of the device ‘state’ as an object in the Cloud. Web and Mobile applications can then interact with this device ‘state’  in the Cloud to manipulate the state. These changes are then synchronized with the real world object.

Provide a ‘state’ object which mirrors the real-world state of your device.

Pattern 12: Device Registry

IoT solutions need to keep a track of all deployed devices and information about them such as – device identifiers, device certificates, device configuration etc. A device registry provides such a capability.

 

 

Protocol Design Challenges in M2M Communication

Connecting smart IoT appliances requires a robust and lightweight protocol that facilitates an efficient Machine-to-Machine (M2M) communication. In this post, we explore some of the interesting challenges in the design of protocols for M2M communication needs.

Low Bandwidth

IoT devices deployed in the field are often connected using networks that have low-bandwidth or offer inconsistent throughput. Hence, the protocol overheads need to be very low – such as having small protocol headers, using variable length headers etc.

For example: Field devices are often connected using 2G / 3G carrier networks that offer low bandwidth.

Two-way Communication:

IoT devices often require a two-way communication channel, whereby communication can be initiated by either parties.

For example: Say, a device needs to send telemetry data to a Backend Application, or the Application wants to dispatch commands to this device. Unlike a traditional HTTP Request-Response model, the communication could be initiated by either of the parties. Polling is an inefficient way of doing this over HTTP.

Intermittent Connectivity

Wide area wireless networks (carrier networks) often experience flaky connectivity. So the underlying protocol needs resilience against connectivity problems. After an abrupt disconnection, when a device comes back online later, any pending messages for this device should be automatically delivered to it.

For example: IoT devices can fade in-and-out of network connectivity as their geo-location changes, or they switch to a ‘hibernation mode’ to conserve power.

Low Compute Footprint

Embedded devices often have low compute capabilities – CPUs with lower energy consumption, lower clock rates, and low memory availability. So the protocol implementation should require a minimal compute and memory overhead on those devices.

One-to-Many Communication Model

Unlike conventional protocols such as HTTP which only allow a 1-to-1 communication, an IoT protocol should allow a one-to-N communication model.

For example: Messages from a temperature sensor could be routed to a telemetry logging service, a temperature control logic, and a monitoring dashboard – all at the same time!

Asynchronous Model

Any party can dispatch messages at will, sometimes even without knowing if the recipient is online at that point in time. As devices fade in-and-out of network connectivity, or hibernate from time to time, it would be inefficient for the senders to keep polling recipients over a HTTP request-response model. Hence, the underlying messaging protocol is expected to be asynchronous in nature

For example: Most IoT devices sense real world events and trigger messages based on the occurrence of those events. Sensors can dispatch this information without really knowing if the intended recipients are online / offline at that point in time.

For example: Applications need to instruct devices by dispatch commands to them. Applications could dispatch commands without knowing if the recipient device is online / offline at that point.

Decoupling of Participants

The protocol needs an appropriate ‘routing mechanism’ whereby a message can be routed to a set of interested recipients, without the sender necessarily knowing who the exact recipients are. The set of recipients for a certain type of message could change as the ecosystem evolves.

For example: A logger device may be interested in temperature information today. Tomorrow, a new dashboard may also be interested to receive this information. The temperature emitter (sensor) may be oblivious to who the recipients actually are.

Routing Complexity

The routing complexity and responsibility should be encapsulated within the IoT Cloud (and not within the devices themselves). The device should be offered a simpler model to connect, dispatch, and to receive messages asynchronously.

Security

This is a paramount concern in IoT today. The protocol should support a strong security model to protect data-in-motion using a robust PKI infrastructure. It needs to offer capabilities such as: Secure endpoint identity (Client certificates), real-time authorization controls, and real-time policy enforcements.

Adoption of Standards

Most embedded OSes and all cloud OSes today support the TCP/IP stack. So a suitable IoT protocol is expected to run on top of the standard TCP/IP stack to ensure highest compatibility across multiple hardware and OS platforms.

Wrapping Up…

M2M communication brings its unique set of challenges – Intermittent connectivity, asynchronous communication needs, and a need to decouple participants. Having an intermediate messaging queue (broker) is a great way to address many of these requirements. Hence IoT protocols such as MQTT adopt a pub-sub, queue-based architecture and are thus better suited for IoT applications today.

A Node.js Infrastructure for Scalability, Fault Resilience, and Zero Downtime

Over the past few years, consumer applications and enterprise solutions have been rapidly adopting the Node.js stack. In this paper we explore three key areas relevant to Node.js infrastructure – scalability, resilience, and mature DevOps.

Challenges with Node

Underutilizing Multi-Core CPUs

Node.js inherently operates in a single-process execution model. However, most modern production-grade hardware has multiple CPU cores.

So the execution of Node.js on modern hardware results in heavy utilization of one CPU core (due to CPU affinity), while leaving the other CPU cores underutilized.

No Isolation Between The Server Engine and Your Biz Logic

Mature servers such as Apache Tomcat or Apache HTTPd bring some degree of process-isolation (or thread-isolation) between the server core and the developer’s business code. Faults in business code do not crash the core server engine itself.

Node.js does not inherently have such an isolation. Node developers typically initialize a web server and run business code within the same process. If the business code throws an exception, it causes the server itself to crash!

The Risks of A Weakly Typed Language

The weakly typed nature of JavaScript makes it easy for developers to leak defects – No compilation phase, and no type checking.

Which means, Javascript code is a lot more prone to bad references and null pointer exceptions that get discovered only at runtime. It is hard anticipate and catch every possible exception in advance. So weakly-typed languages bear greater risks in the runtime environment.

Try-Catch Doesn’t Suffice

A try-catch block will not catch exceptions that occur in an async callback function contained within itself. Such is the nature of async callbacks!

Which means, there is only a limited use of try-catch blocks for the purposes of error handling and defensive programming in Node.js

Slow Memory Leaks

The Node Package Manager has become the de-facto approach to import libraries (modules) into your Node application today. Developers are ‘trigger happy’ about the use of many public NPM modules in their application code.

However many npm modules exist in the wild, and have rarely been curated or intensively tested. Some of them misbehave, throw unexpected exceptions, or slowly leak memory at runtime. Moreover, such leaks may be difficult to catch in your profiling tests as the rate of the memory leak could be slow and only adds-up over time.

Utilizing Multi-Core CPUs

An Approach

To effectively utilize multiple CPU cores and to achieve a higher application throughput, one can think of the following approach:

  • Spawn multiple worker processes to execute our Node.js code.
  • The kernel scheduler will allocate these worker processes across the available CPU cores on the system: Since Linux kernels often prefer CPU-affinity, each worker process is likely to get allocated to a specific core for it’s entire lifetime.
  • Distribute inbound HTTP requests evenly across these worker processes (We see how this really happens later).
  • As inbound requests arrive, each worker process services the requests allocated to it and thus starts utilizing it’s own CPU core.

Comparisons With Apache MPM Pre-fork

At a first glance, this approach seems very similar to Apache’s MPM Pre-fork module, which spawns multiple child processes at startup and delegates incoming requests to them. However there is one key difference!

Process-per-CPU model of traditional servers.
Process-per-CPU model of traditional servers.

The Apache-Way of Doing Things

Apache’s I/O is blocking in nature. Which means, a child process will receive a request and will often block waiting for I/O to complete (Say, a file read from the disk). In order to increase the concurrency in this case, we spawn a pool of additional child processes. So while some processes are blocked, other processes can continue serving new inbound requests from the clients.

We can keep increasing our concurrency by increasing the number of child processes – but only to a certain limit!

Since each process has it’s own memory footprint and we soon reach a limit for the number of child processes that we can spawn without causing excessive thrashing of the virtual memory on this system. At some point, there will be too many child processes vying for attention from the OS scheduler, and the cost of swapping the process images in-and-out of the disk will be prohibitive.

The Node-Way of Doing Things

A Node process, on the other hand, does not block for I/O at all. Which means, the process can service more and more inbound requests until it’s own CPU core is nearly saturated.

Since each Node process has the ability to saturate it’s own CPU core, the number of Node processes required here to achieve high concurrency is equal to the number of CPUs on that machine. Fewer processes means, an overall lower memory footprint.

The Node.js cluster module adopts this approach. Let us explore the workings of the cluster module further in the next section.

Non-blocking Node.js processes.

Figure 2: Node’s non-blocking IO. One process per CPU core.

The Node.js Cluster Module

Spawning Workers

The primary Node.js process is called the master. Using the cluster module, the master can spawn additional worker processes and tell them which Node.js code to execute. This works much like the Unix fork() where a master process spawns child processes.

How the cluster module works in Node.js

IPC Channel Between Master and Workers

Whenever a new worker is spawned, the cluster module sets up an IPC (Inter Process Communication) channel between the master and that worker processes. Thru this IPC mechanism, the master and worker can exchange brief messages and socket descriptors with each other.

Listening to Inbound Connections

Once spawned, the worker process is ready to service inbound connections an invokes a listen() call on a certain HTTP port. Node.js internally rewires this call as follows:

The worker sends a message to the master (via the IPC channel), asking the master to listen on the specified port.
The master starts to listen on that port (if it is not already listening).
The master is now aware that a specific worker has indicated interest in servicing inbound requests arriving on that port.

While it may seem that the worker is invoking the listen(), the actual job of listening to inbound requests is done by the master itself.

Load Balancing Between Worker Processes

When an inbound request arrives, the master accepts the inbound socket, and adopts a round-robin mechanism to decide ‘which worker’ should this request be delegated to.

The master then hands-over the socket descriptor of this request to that worker over the IPC channel. This round-robin mechanism is part of the Node core and helps accomplish the load balancing of the inbound traffic between multiple workers.

Recommended Practices

Minimize Responsibilities of The Master

Let your master process do a minimal amount of work and be responsible only for:

  • Spawning worker processes at the start of your server.
  • Managing the lifecycle of your worker processes.
  • Delegating all inbound requests to workers.
  • Nothing else!

In particular, do not encapsulate your business logic in the master process. And do not load any unwanted npm modules in the master.

Most runtime errors are likely to occur due to buggy business code or npm modules used by your code. By encapsulating business code only in the workers, such errors will impact (or crash) a worker processes, but not impact your master process.

This gives you a stable, unhindered, master process which can ‘baby-sit’ worker processes at all times. As we shall see later, the master is responsible to manage workers and ensure that enough healthy workers are available to service your inbound traffic.

If the master itself crashes or is badly-behaved (because it ran buggy business code), there is no caretaker left for your workers anymore.

Replenishing Worker Processes

It is possible that worker processes die over time. This can happen due to various reasons (Running out of memory, receiving a Unix signal to forcefully kill itself, programming bug causes an abrupt crash in a certain path of execution).

When a worker dies, Node.js notifies your master process with an event. At that point, in the event handler, your master process should spawn a new worker process. This ensures that we have enough workers in our pool to service inbound traffic.

Gracefully Killing A Worker Process

As we shall see in later sections, there are several scenarios where you would like to gracefully shutdown (kill) a worker process in your cluster. Let us understand how we can accomplish a graceful worker shutdown:

  • Suppose, the master decides to elegantly kill a specific worker &om the present pool of workers in the cluster.
  • The master sends a signal or a message to that worker (via the IPC channel) asking the worker to gracefully kill itself.
  • At this point, that worker disconnects from the IPC channel, so it stops accepting new inbound HTTP requests from the master.
  • The worker attempts to gracefully finish any in-flight requests that it has already accepted in the past. (So that we don’t drop in-flight requests and we don’t send errors to our clients).
  • After giving itself time to gracefully complete in-flight requests, the worker attempts to close any resources it had acquired (DB connections, cache connections, sockets, file handles etc).
  • Then, the worker kills itself.
  • If the worker does not manage to kill itself elegantly within a certain window of time, say 10 seconds, the master decides to forcefully kill the worker by sending it a signal.

Dealing With Uncaught Exceptions

Even with meticulous programming and defensive tactics, unhandled exceptions are likely to occur at runtime, and your Node server has to deal with it. But how?

When an unhandled exception occurs, Node.js offers your worker process a way to ‘catch’ it. However, Node creator Ryan Dahl, mentions that the event loop is likely to be in an indeterminate state at that point in time.

So it is best to kill that worker process as soon as you can, and spawn a new worker as a replacement. Here is what you should do:

  • Return a HTTP 500 for the request that resulted in an unhandled exception.
  • Perform the steps for a graceful worker shutdown (as we’ve seen before) and then let the worker process kill itself.
  • When the master notices that a worker has died, it spawns a new worker at that point.

Consider the worker to be ‘unhealthy’ anytime it catches an unhanded exception. The above steps would be an elegant way to deal with the situation.

Periodic Roll-Over of the Cluster

We’ve seen earlier how some npm modules could potentially result in slow memory leaks. Sometimes your own own code could leak memory as well.

To keep the cluster healthy, it is recommended that your master periodically kill all worker processes and spawn new ones. But this needs to be done elegantly – If you kill all workers at once, there would be nobody left to do the work and the server’s throughput will drop momentarily.

Rolling worker processes.

We adopt a process called as slow rolling of workers as follows: • At regular intervals, say every 12 hours, the master initiates the rolling process.

  • The master chooses one worker from the present pool of workers in the cluster and decides to gracefully kill that worker proces. (We’ve already seen the steps to gracefully kill a worker in an earlier section).
  • At the same time, the master spawns a new worker process to replenish capacity.
  • Once the rollover of this worker is completed, the master picks the next worker from the initial pool to gracefully roll that one, and the process continues until all workers are ‘rolled over’.

Rolling workers is a great way to keep your node cluster healthy over elongated periods of time.

Preventing A ‘Self Inflicted’ Denial of Service Attack

So far we’ve looked at how our master can baby-sit worker processes and spawn new ones if a worker dies. But here is an interesting scenario to consider:

  • An inbound HTTP request results in execution of a specific (buggy) code that throws an uncaught exception.
  • The worker process kills itself, and then the master spawns a new worker right away!
  • Subsequent HTTP requests again result in uncaught exceptions in the new worker. The new worker decides to kill itself, and this self destructive cycle continues over and again.

The process of continuously spawning a new process over and again (in an indefinite loop), causes your OS to thrash. Anything else that is possibly running on that machine will get impacted too. This represents a ‘self inflicted’ denial of service attack.

The ultimate resolution to this problem would be to fix that buggy code and redeploy it. But until then, you need to add some safeguards to prevent such a run-away conditions from occurring on your machine:

  • When the master spawns a new worker, it watches if the new worker process survives for a certain number of seconds (threshold).
  • If the worker dies within that threshold of time, the master infers that something is seriously wrong within the mainline code itself (and not just within some corner case).
  • At this point, the master should dispatch a panic message into your logs, or invoke a HTTP call to an alerting service – that gets your rapid response team into action!
  • Also, the master should to throttle the rate of spawning new process at this point, so as not to thrash the OS or impact other things running on that machine.

It is likely that this machine would soon be devoid of any useful workers to serve any requests. But at least you have: (A) Notified your rapid response team to swing in action (B) Prevented a run- away condition in your OS.

In a subsequent section, we talk about how your master process can improvise this further and safeguard your server from deploying such buggy (self destructing) code.

Zero Downtime Restarts

The utopia for a mature DevOps is to have a zero downtime restart capability on the production servers. This would mean the following things:

  • The development team can push new code snapshots to a live server without shutting down the server itself (even for a moment).
  • All in-flight and ongoing requests continue to be processed normally without clients noticing any errors.
  • The new code cuts-over seamlessly and soon new client requests get served by the newly deployed code.

With Node.js this is not an utopia anymore. We have already looked at most of the ingredients which can make zero downtime restart a reality:

  • Suppose you have a running Node.js cluster that is serving Version 1 of your code. All the modules from your source code are already loaded and cached by Node’s module cache.
  • Now you place Version 2 of your code on the file system. And you send a signal to the master to initiate a graceful rollover of the entire cluster (We’ve seen those details earlier).
  • At this point, the master will gracefully kill one worker at a time, and span a new worker as a replenishment. This new worker will read Version 2 of your code and starting serving requests suing the Version 2 code.
  • As an additional safeguard, if the new worker dies within a short threshold of time, the master infers that the new code may have a buggy mainline and hence it does not proceed with the graceful restart for other workers.

There will be a brief period during which some of your workers are serving Version 1 of the code, and some others are already serving Version 2. This may or may not be okay, depending on the circumstances.

Wrapping Requests in Domains

We explored earlier how a simple try-catch block does not suffice to catch exceptions that occur within asynchronous callbacks.

Node.js has now introduced the concept of domains to elegantly handle asynchronous errors. Your implementation hence needs to do the following:

  • Wrap every inbound request in a domain. Wrap all event emitters from that request in that same domain.
  • Write an error handler on that domain which can catch runtime errors that occur within that domain.
  • When you encounter an error in this domain, gracefully kill the present worker process and re-spawn a new one.

This approach is similar to handling uncaught exceptions which we described earlier. The key difference is that we are wrapping individual requests within the scope of each domain. This helps isolate faults within a request (context) and dealing with that specific request more elegantly.

Delegating Work to a Front-Proxy

Terminating HTTPS

Node.js does have the capability to accept inbound HTTPS traffic, but this task is best done by a front-proxy such as Nginx which can run on the same machine as your Node server. Configure Nginx as a reverse proxy and let it terminate inbound SSL connections. Alternatively, a front load balancer could also do that for you.

Compressing HTTP Streams

There are npm modules to achieve gzip compression of HTTP streams. But you would rather delegate this job to the front-proxy such as Nginx. This way your Node.js server can focus on serving your core business logic and delegate such tasks to Nginx.

Traffic Throttling

We’ve spoken earlier about how a single Node.js process can, in theory, saturate a CPU core by accepting more and more inbound requests. In reality, you would not like to reach the peak of your machine capacity in production. The front-proxy can play an important role in traffic throttling and making sure your Node.js machine does not fully saturate.

Other Recommended Practices

Running as a Non Privileged User

This may be an obvious consideration when building any server runtime: For reasons of security, you do not want your Node.js processes to run with root privileges! So make sure you create a non-privileged user and have the node processes to run under that user on your system. This has been a standard guideline, of course, when creating any daemon on Linux.

Keeping IPC Messages Lean

The IPC channel between the master and child processes is only intended to exchange short, control messages. Do not abuse this channel to send large business payloads.

 

Accelerate Enterprise Mobility with Mobile Backend as a Service

Why MBaaS?

Enterprises that invest in a mobility roadmap are often faced with interesting challenges today:

(a) How do we accelerate the development of our mobility solutions?

(b) How do we achieve the desired level of visibility and management controls for our deployed mobile solutions?

(c) How do we optimize the costs of Infrastructure and Mobile Ops?

(d) How do we create a coherent architecture that spans across all your mobile apps?

(e) How do we effectively leverage legacy technology investments into our new mobility roadmap?

Several mobile middleware platforms have evolved over the past few years to address these challenges and accelerate enterprise mobile implementations.

Such middleware platforms are termed as MBaaS (Mobile Backend As A Service) or MEAP (Mobile Enterprise Applications Platforms). They are typically multi-tenant, cloud-based, PaaS providers; Some of them can even run on private cloud or hybrid cloud infrastructures within your enterprise.

In this post, we take a deeper look at the core capabilities of MBaaS platforms and provide detailed guidelines to chose the right MBaaS provider to fit your needs.

Legacy Enterprise Services

Most enterprise mobility solutions need to use legacy enterprise services in order to access existing business data and workflows.

Consider the example of a B2B Commerce Mobile App that allows customers to place orders. Such an App will need to: (a) Verify stock-availability against an Inventory System, (b) Fetch customer address information from a CRM System, (c) Place an order into an Order Fulfillment System, and finally, (d) Generate an invoice in the Accounts Receivables System.

Completing the entire business workflow via a mobile App would involve calling all these legacy systems (or services) in a specific sequence, or feeding the results obtained from one service call into the next service – a process often called as ‘Service Orchestration’.

 mbaas_service_orchestration

Service Orchestration

MBaaS platforms offer the ability to invoke and orchestrate multiple backend enterprise services. As a developer, you write code to perform this service orchestration and deploy it onto the MBaaS Cloud.

Some MBaaS platforms also offer declarative ways to define ‘how my services should be orchestrated’ – this reduces the extent of coding needed for service orchestration.

Since the intelligence to orchestrate complex services gets encapsulated within the MBaaS itself, you can now expose a simpler REST interface to all your mobile clients. The mobile client no longer directly interacts with complex enterprise systems – making the client App lightweight and simpler to build.

Many MBaaS platforms also offer capabilities to invoke services asynchronously (Non Blocking IO) thus improving the ‘data delivery performance’ for your mobile clients.

Normalize Heterogeneous Data

Enterprise systems have been built over the past few decades using legacy software and heterogeneous technology stacks. As a result of this, the backend data sources in an enterprise often exist in varied formats and require heterogeneous protocols, and authentication schemes to access that data.

MBaaS platforms offers the ability to massage and normalize heterogeneous data sources into a single homogeneous data format (typically, JSON). By abstracting heterogeneous sources into a common data format (JSON), your mobile Apps no longer deal with multiple legacy data formats, authentications schemes, or access protocols. This makes the architecture of your mobile apps simpler and coherent.

mbaas_heterogeneous_protocols

Managing Service Granularity

The granularity of legacy services may not be the right-fit for direct consumption by mobile clients today. The service payloads could be too large (coarse-grained services) or too sparse (fine-grained services).

Large payloads would mean: (a) Frequent drops and timeouts on your mobile carrier network, (b) High response latencies from the backend, (c) Unwanted or unnecessary data reaching the mobile client, (d) Excessive CPU and memory overhead in mobile clients.

Small payloads could mean: Too many HTTP round-trips from mobile client to fetch the required data or to complete the required business transaction (and hence a slower App).

By encapsulating the underlying business services, MBaaS offers the ability to manage the ‘granularity’ of services exposed to your mobile clients. MBaaS can consolidate responses from multiple fine-grained services, or filter data from a coarse-grained service to expose just the ‘right sized’ service interface for mobile consumption.

Backend Evolution

Enterprise systems and services constantly evolve over time: Service interfaces get redefined to meet the evolving needs of your business. Old services get deprecated or retired; New information systems get deployed to replace legacy ones.

MBaaS acts a loose-coupling between the enterprise backend and your mobile Apps. If enterprise systems evolve or services interfaces change, the orchestration rules can be modified within the MBaaS itself without having to re-publish a new App to all your users every time.

Mobile API Versioning

Mobile Apps themselves evolve over time with new features and capabilities every few months. Often multiple versions of an App exists across your user’s devices (since not all users upgrade at once) and each App version is tied to specific REST APIs.

MBaaS platforms offer versioning capabilities for REST APIs that are exposed to your mobile clients. This enables multiple App versions to thrive in production at the same time, and the latest App versions can be incrementally rolled out to your users.

Securing Enterprise Boundaries

MBaaS acts as an added layer of security in front of legacy enterprise services so that internal services do not have to be directly exposed to the public Internet.

Moreover, MBaaS Platforms offer other security aspects such as: (a) A mobile-specific authentication layer, (b) SSO capabilities for Apps, and (c) Session filters for all real-time mobile traffic.

Audit Trail and Compliances

MBaaS can be leveraged to capture a trail of all “chatter” between mobile clients and the enterprise backend.

An enterprise can keep a track of which user accessed what enterprise data at what time via which mobile App. This may be necessary for compliance and policy requirements in your enterprise.

Runtime for Mobile Workflows

The ‘mobile first’ way of doing things often involves disruptive workflows and it is not always a mirror of legacy business flows. For example, if a user has added items to her shopping cart, but has not ‘checked out’ for the past few days, we may want to send a push notification to that user.

This requires additional business logic to be implemented in the backend which may not exist in your legacy system. MBaaS platforms provide a runtime environment for such additional mobile-specific business logic and triggers.

Data Synchronization

Offline access is a common requirement for mobile Apps today. This requires the intelligence to facilitate a ‘two way’ data sync between mobile clients and backend data sources.

Many MBaaS platforms offer APIs to facilitate such a two-way data sync. This includes: (a) Prefetching specific data objects to your mobile client, (b) Identifying stale objects on the client and automatically refreshing those from the backend, (c) Identifying dirty or modified objects on the client, (d) Performing a two-way data interchange and merge of the client’s data with the backend, (e) Elegantly handling merge conflicts in the data objects. In most platforms this sync can be performed either automatically or on- demand by the user.

MBaaS Reference Architecture

Last Mile Caching

MBaaS platforms also offer a last-mile caching layer for your mobile Apps. This is typically a cluster of an in-memory cache nodes (products such as Redis are commonly used by MBaaS providers for this purpose).

Slow-moving data or master data, can be cached in the MBaaS cache to avoid deeper backend calls to your enterprise services each time. Data that is common across multiple logged-in mobile users can also be cached here.

Mobile CDN and File Storage

Some MBaaS platforms offer a Content Delivery Network (CDN) for binary content required by your mobile Apps – Images, Videos, Static Resources, Documents.

This helps scale-out your mobile deployment without straining the enterprise backend infrastructure. This can also act as a scalable cloud-based file storage for content uploaded by mobile users.

Performance and Scale

By using last-mile caching, offline and sync capabilities, and mobile CDNs, the API calls from mobile clients avoid going deeper into the enterprise stack.

This reduces the response latencies when clients attempt to fetch data and results in a better mobile App performance. This also reduces the strain on your legacy enterprise infrastructure and helps scale-out your mobile Apps to millions of users by leveraging the scalability of the MBaaS cloud platforms.

Mobile Analytics

In-app analytics is a powerful way to understand user behavior and to tune your App’s user experience. Some MBaaS platforms offer mobile analytics capabilities including a Client SDK, Analytics Engine and a Dashboard. They also offer visibility into to the number and velocity of REST API calls being made by your mobile clients.

Enterprise Software Connectors

Many MBaaS platforms offer baked-in connectors to specific enterprise software such as SAP, Oracle, Microsoft CRM, SFDC etc. Instead of using generic SOAP or REST interfaces for backend integration, such product-specific connectors help accelerate the development of mobility solutions in your enterprise and help leverage the features of your legacy software better.

User Engagement Features

Push notifications are an important way to drive mobile user engagement. Some MBaaS platforms offer APIs to trigger push notifications (these are typically wrappers on top of APNS or GCM). This eliminates the need to have separate integrations with APNS or GCM or to leverage other third party providers for this purpose.

Cloud Object Store

Mobile Apps often require a structured data storage for some App- specific data: Information such as mobile user profiles, mobile- specific user preferences, persistent mobile user sessions, user stats etc.

Such a store may not exist in your legacy enterprise infrastructure, so some MBaaS platforms provide a cloud-based store for JSON objects along with a client API to access this object store. This is typically a scalable NoSQL database platform that is managed by your MBaaS provider.

Social Connectors

Mobile Apps targeted towards your customers, employees or partners often have social media integration as a critical requirement in the App.

MBaaS platforms offer APIs to easily integrate various social media platforms (such as Facebook, Twitter and Linked In). This lets your mobile users to perform a Single Sign On (SSO) into your App using their social avatars, or share content from your mobile App to social platforms.

Improved DevOps Cadence

Most MBaaS platforms directly integrate with your source code repository and offer the ability to push latest code from there to multiple MBaaS runtime environments with a single click (Say, to Dev, Stage, QA, Production environments of the MBaaS runtime). This eliminates downtime when upgrading your middleware code.

Most platforms also provide a self-service management portal (dashboard) to monitor the middleware, including information about the provisioned capacity, utilized capacity, and the overall health of the run-time. This eases the efforts on your DevOps / SysOps teams and brings a mature cadence.

Build Farm

Some MBaaS platforms offer build farms which can create packaged builds for your native or hybrid Apps (Including iOS, Android and Windows Mobile platforms). They can manage your AppStore signing keys and publish the signed builds directly to the AppStore or Marketplace.

This helps streamline your build process and you no longer have to rely on individual developer machines to perform production builds for your enterprise Apps.

Wrapping Up

Choosing an MBaaS platform judiciously is critical for your enterprise mobility strategy. Given the significant acceleration that an MBaaS could bring to your implementations, do make sure that you evaluate available options carefully w.r.t. the capabilities outlined in this post.

Building Internet-Scale Web Platforms with the Amazon Elastic Load Balancer

Introduction

The Elastic Load Balancer distributes your Application’s inbound traffic to multiple Web Servers running on EC2 instances. This offers the following key benefits to your architecture:

Increased Throughput: This increases the capacity of your Web infrastructure to handle additional traffic (i.e. Horizontally scaling-out).

Avoiding Single Points of Failure: An individual Web Server is no longer a single point of failure, since traffic is distributed across multiple server instances. This makes your application much more resilient.

Scale Out with aLoad Balancer

Figure-1: ELB distributes inbound traffic across multiple EC2 instances.

Maintaining Healthier Servers: The risk of overloading or overwhelming a single Web server is now minimized due to distribution of traffic. This increases the chances of your individual Web servers staying healthier over much longer periods of time.

An architect needs to consider several critical aspects of a deployment such as:

  • How do I truly achieve ‘internet-scale’? What does my ‘scaled out’ architecture look like?
  • How does the ELB schedule incoming traffic?
  • What if my load balancer itself becomes a single point of failure?
  • How does my design guarantee fault-tolerance, resiliency, and high-availability?
  • What security features does the load balancer offer for my inflight traffic?

In this post, we answer these questions and also help you understand why the ELB is more effective than a home-brewed load balancing solution using Nginx or Apache.

Continue reading “Building Internet-Scale Web Platforms with the Amazon Elastic Load Balancer”