Written by:
Nathan Alderson - @nathanalderson
Jonathan Hood - @jonathan_hood
Published: 31 October 2016

*Note: Introducing Firefly Part I - Two Scale Problems is PART I of this blog series.

No hurts. With respect. Be responsible. Have fun.

These represent the full extent of the family rules in the Alderson household. With two adults, two teenagers, and two youngsters living in one house, conflict is inevitable. However, rather than attempting to enumerate the infinite list of behaviors we do and don't want from our children (and ourselves!), we instead choose to focus on a few key ideas and let the rest flow from there. Brushing your teeth and doing your homework? That falls under being responsible. Slamming doors? Let's try that again with respect. Pillow fight? Have fun (but no hurts)! The point is, these four rules aren't really rules at all–they're more like guiding principles.

In a previous post, I introduced Firefly as our microservices cloud platform. I described how we faced two separate scale problems, and those drove us to certain architectural decisions. I also described some technology choices we had made. Underlying all of these decisions, however, are a set of principles and thought processes, along with a whole lot of study and prototyping. When we began the Firefly project, Jonathan and I felt that it was important to capture these thoughts explicitly. Since then, we have continued to review them periodically, and we have indeed found them valuable in steering our ongoing design processes.

We don't claim that the concerns listed here are exhaustive, or that they won't change over time. We also realize that engineering is often the process of balancing competing concerns. By making the following assertions, we are stating that we will strive to emphasize and prioritize solutions which align with these principles.

First, there are some fundamental principles that we have identified that have driven our architectural decisions:

  • Scale. Firefly needs to scale well to handle networks on the order of millions of devices. This implies scale at several levels, including device communication, data processing, and data storage.
  • High Availability. When applications built atop Firefly become a one-stop-shop for managing, monitoring, and diagnosing entire carrier networks, downtime due to system failure or planned maintenance becomes unacceptable.
  • Platform Architecture. We should support the development of management applications by providing a platform which presents useful APIs and hides things like scale issues and network protocol details from the application developer.
  • Reuse. The challenges we face in designing and building this system are not new. Highly scalable, resilient platforms have been built many times, particularly in the web space. Network Management Systems of varying quality are abundant. We should learn from these systems and strive to incorporate off-the-shelf and standards-based solutions whenever possible.
  • Security. Firefly will be responsible for managing sensitive data and critical network infrastructure. As such, the security and integrity of the system must be a priority from the beginning.

These principles lead to some basic design decisions:

  • Horizontal Scaling. To reach the level of scale and resiliency that we need, we will have to scale horizontally across a network of commodity servers. This could include both running different parts of the system on different physical or virtual devices, as well as potentially scaling parts of the system dynamically based on load (elastic scaling).
  • Asynchronous Design. Again, to achieve the level of scale required, we will need to be highly concurrent in our processing. The overwhelming consensus of modern web-scale and big-data applications is that this concurrency should be achieved using an asynchronous, event-driven architecture rather than a massively multithreaded approach.
  • Modularity. A highly modular and decoupled design achieves many valuable goals. It improves testability and maintainability, isolates failure, simplifies maintenance, enables distributed development, and will ultimately allow our system to grow and adapt to meet ever-changing market needs.

These same principles also imply some things about the interfaces we present:

  • Message-Based. Components should interact with the platform and with each other through a non-blocking, message-based API. This facilitates the horizontal scaling and asynchronous design goals mentioned above. Additionally, it encourages better decoupling among parts of the system than alternatives like RPC.
  • Resource-Oriented. To achieve uniform style across the platform, components should be resource oriented in nature. Driving consistent messaging semantics throughout the platform will help make the platform feel like a cohesive whole.
  • Model-Driven. Wherever possible, the API should be based on models. This makes the APIs easier to generate, maintain, document, and understand. It also leads to better consistency throughout the entire stack from management application to network element.
  • Layered. Application developers will need to be able to interact with the platform at various logical levels. For example, in the ADTRAN Mosaic Cloud Platform some developers will just want to talk to individual network elements (although they shouldn’t be concerned with actual network protocols like SNMP–see the discussion on model-driven APIs above). Others will want to interact at a network services layer. Things outside of the server platform like the web GUI or the customer’s own OSS will need APIs like REST, XML/SOAP, and TL1.
  • Encrypted. To prevent eavesdropping and man-in-the-middle attacks, all communication into and out of the platform should be signed and encrypted. Even the data center network in some applications may not be trusted, so both inter- and intra-cluster communication should be encrypted by default. Sensitive data should be encrypted on disk.
  • Controlled. APIs and data stores should have adequate access controls so that only authorized users can gain access. All operations should be authenticated, authorized, and accounted for. Also, only authorized connections should be allowed to backing services (such as databases).

For a while–I kid you not–my family had to outlaw the word "sure." One of my boys seemed incapable of uttering that particular word without infusing it with potentially toxic levels of sass. "With respect," therefore, got clarified as a complete ban on that one colloquial adverb. Eventually, of course, he moved on and found new ways to sarcastically express his disapproval, so that rule got dropped and replaced with new clarifications of what speaking with respect does and does not entail.

Similarly, while our architecture has morphed and our technology choices have shifted as the Firefly platform has evolved, these guiding principles have remained remarkably unchanged. We continue to clarify them in light of new design patterns, new customer requirements, and new technologies. Eventually, some of these principles might even be modified. By clearly stating our vision for the system, however, we have been able to facilitate the rapid development of an industry-leading software platform that's enabling communities and connecting lives.