What do I do as a software architect?

"I am the Architect. I created the matrix. I’ve been waiting for you." -- The Architect

I present my view on what an architect does or should do? And what are the important things to keep in mind in my opinion? 

Topics: Introduction. Making decisions using trade-offs. Research and evaluate technology and tools. Create proof-of-concept of big picture. Systematically divide the goal into smaller solvable problems. Create knowledge base and training for others. Continuous monitoring and improvement of the system. Identify and address disruptive technologies. Conclusions.


Introduction

In early 2000, I contributed to Columbia InterNet Extensible Multimedia Architecture, or CINEMA [pdf] for short. This early architecture of the Voice-over-IP system consisted of separate components for various roles of call routing, conferencing, voice/video mail, gateway'ing and interactive dialog. It was deployed in our Computer Science department during the early days of innovations in SIP (Session Initiation Protocol) at Columbia University. Not only did it contribute to several academic papers detailing many novel ideas, but also became a generic VoIP architecture template for others to adopt.

Since then I have been involved in many other system and software architectures as a principal contributor at various organizations. My last two job titles of Lead Architect and Principal (WebRTC) Architect also reflect the roles and responsibilities I take. In this article I attempt to provide a background of what I do as an architect, provide recommendations for others, and summarize the high level roles and responsibilities of an architect.

There are many other articles, blog posts and even books about software architecture, and what an architect does. If you have read some of those, you either get a very vague idea of what the job involves, or you get a more clear picture that - the job involves all and every aspect of the software. There is some truth in that. Here, I do not endorse or refute the claims of other authors about what an architect does. I merely present my view of what an architect does or should do? And what are the important things to keep in mind in my opinion?

There are many types of software architects - from being a helper to a marketing and sales driven CTO, to a team lead who is more on people management than on the technology aspect, to a glorified software engineer promoted to the next appealing job title. This diversity in types makes it hard to define what an architect does. It would be better to treat an architect as a job responsibility rather than a job title. In an early stage startup, CTO or a co-founder often fills the role. In many agile oriented teams, a senior engineer or tech lead often serves the purpose. Regardless of the type of the architect, there are some common things that the role involves. 

In my opinion, at the high level, there are three things to do - (a) think ahead, (b) make decisions using trade-offs, and (c) communicate. 

Making decisions using trade-offs

Making decisions using trade-offs is the core of what an architect does. The other two - thinking ahead and communicating -  merely support those decisions and trade-offs in some ways. Depending on the work situation, job responsibility, prior product baggage or individual personality trait, the proportions of these may be uneven. For example, one may spend more time on longer term strategies (think ahead), while another may deal with day-to-day fire fighting. One may decide to continue with existing outdated framework, while another may incorporate every next fancy thing in the product (trade-offs). There is no answer to what is right - but there are only experiences and results to share! What is perfectly right for one may be outrageously wrong for another. 

In that case, it may appear that personal opinions play a big role. Not true! I will show, how the domain expertise purified over vast experience and knowledge, along with the habit of doing analysis and experimentation guides an architect. The difference between personal opinion versus informed opinion is huge.

I have seen first hand, problems and frustrations that are caused when someone, often a CTO, team lead  or product manager, with limited domain experience, makes architectural decisions without using the tools of analysis, experimentation, or proper understanding of the trade-offs. Those are the situations where personal opinion plays a big role. Gut feeling about an architectural decision is important too. But that should come after good amount of domain expertise, experience and knowledge, or from past analysis, prior art or experimentation. 

At one time, when deciding to pick the right video codec for a conferencing product, my initial research suggested H.264 for performance, and VP8 for browser compatibility. Browser compatibility was not an issue, since we targeted only one browser. So the initial opinion was to switch to H.264 immediately. After some quick experimentation, I confirmed the performance benefit, but at the same time, discovered new critical issues in compatibility with our media server when a network optimization was applied. The analysis of the trade-offs allowed me to create the system with both the codecs included, dynamically selected, and changed based on the runtime, network and system conditions. This resulted in a smooth rollout for all users, instead of leaving out some users with unfavorable devices or network.

Although you can hire someone with good amount of experience and domain knowledge, the habit of doing analysis and experimentation is equally important for sustained success in the architect role. More often than not, a decision is to be made in a new uncharted territory, and the ability to determine the right course is crucial for the role - and that usually needs quick experiments, prototyping, or analysis of trade-offs.

Let's look at some more tasks that architects might do, or are expected to do, and how they apply the ability to think ahead and make decisions using trade-offs.

Research and evaluate technology and tools

Once, when I was creating a chat service at a startup, there were two choices - go with the well proven chat protocol such as XMPP, or invent a new proprietary one on top of a modern messaging protocol such as MQTT. My research and evaluation involved study of complexity, need for security and obfuscation, speed of development, and future maintenance. Given that I had significant experience with XMPP, it would have been a natural choice in any other situation. But I went with the alternative due to the unique aspects of the new company. It helped tremendously in creating a customizable and light weight interaction, and eventually became a product differentiator. 

At one time in another organization, when I was revamping our video conferencing system from the expensive multipoint control unit (MCU) to scalable selective forwarding unit (SFU), there were at least three choices of media servers to pick from. Although third party prior research provided some guidance on the performance numbers, I had many open questions, e.g., whether it would work with the unique telephony requirement we had, or how easily customizable it was. I created a proof-of-concept implementation to verify the basic flows, measured the primitive performance numbers to co-relate with the existing research, considered trade-off of programming language the server was written in, and the maintenance overhead over the next few years, and eventually picked one. It served us well!

Such early decisions are often a crucial part of an architect's job - especially the decisions that will become too hard to change later. Sometimes, I try to delay the decision by incorporating flexible design, i.e., to pick any of the choices dynamically, or to shield the decision behind another layer of indirection. However, in many cases, a decision must be made now. And the ability to quickly and effectively research and evaluate technologies and tools plays a vital role.

This involves thinking ahead on how the particular tool or technology will serve us, and at what cost, using prior research as well as hands-on-experimentation to understand the trade-offs, making a decision, and documenting the decision along with the thought processes and influencing information for future reference.

If such crucial decisions are made without due diligence, e.g., "let's pick ReactJS as our framework because it is more popular than others", "let's use Google Cloud as they do more heavy lifting for us than others", then the software system will sooner or later suffer the consequences. And when that happens, without prior documentation about the decision, it becomes an ugly blame game, or a long and expensive refactoring project.

Create proof-of-concept or big-picture

The ability to comprehend the big picture and to quickly create proof-of-concept implementations is another crucial job of an architect. Some people may argue that an architect is not required to do coding. Seeing the big picture and hiring a developer to implement the core concept is equally effective in such cases. Luckily, I have the gift and ability of rapid prototyping. 

I was hired at one startup to create interconnectivity between the web (Flash Player) based media and telephony. After landing there, it was expected that I would create a demonstration in about six months, which would then be used to enhance the existing product. I created the initial implementation in less than a month! The CXOs could not believe. And it created a lasting initial impression. Although the particular project did not get productized for another year due to other logistics, I got the opportunity to re-architect the video communication platform in a clean slate modular manner, that aligned with the longer term vision of the big picture of the product.

At another small company with a very small team, we divided the proposed software implementation into three parts, so that the three senior folks, including me, could take control of and care for one part each. However, this created an unnecessary dependency - the client couldn't work until the server was ready, and the server could not be tested until the client was working. Although, I was the principal owner of the client system, I also created light weight prototypes of the servers - one for messaging and another for APIs. This allowed us to identify system integration problems early on, provided me with a way to fully develop and demonstrate the client software even before any of the servers was ready, and created an interface template, which the server implementations could adhere to in the future. Since many of our demonstrations were client facing, we could start demo'ing to potential customers before our real cloud server infrastructure are done.

At another company, as we planned to enhance our product from small 12-party video meetings to very large 100 or even more participants, we faced several challenges. A good video layout was one such challenge. Developing video layout with too many live participants would be inherently slow due to the test requirements and long develop-debug cycle. So I decided to separate out the layout logic from the rest of the system, and created a proof-of-concept for the layout alone. Then I worked backwards to integrate the layout logic in to the overall system - one step at a time, e.g., by using images instead of live videos, or by forking the same video multiple times instead of actual participants joining there. This greatly improved the speed of development, testing as well as bug-fixing of the feature. We kept the ability to test out such large video layout in our production software. It continued to be used to quickly replicate and fix layout bugs in large video meetings, instead of having to actually do real, but time-consuming, large meetings.

I will write a separate article on the art of rapid prototyping to quickly create proof-of-concept implementations. But, in summary, it involves the ability to see the core application logic in the big picture, abstracting out other things that are less important, and quickly connecting various interfaces to demonstrate the whole system. While others may disagree, I often find relying on the existing web frameworks for rapid prototyping to be a big bottleneck, and I usually prefer to use raw HTML5, CSS3 and pure JavaScript.

Creating proof-of-concept aligns with thinking ahead, by identifying potential problems early on, and is useful in making technology decisions, by quickly evaluating viability of certain ideas. Furthermore, architecture documents often describe how the system looks or behaves from different point-of-views - from high-level big picture to low level implementation details. The big picture creates a cohesive understanding, and but is often seeded in the initial proof-of-concept implementation.

Systematically divide the goal into smaller solvable problems

Once the technology path towards the goal is in sight, additional resources are added to continue the design and implementation. The big problems are divided into smaller solvable problems depending on the team structure and engineering processes followed. Since the architect created the initial prototype or was aware of the original goal in great technological detail, he or she is expected to divide the big tasks into smaller tasks. 

This is one area I sometimes struggled with, largely because of the lack of clear understanding of other folks' aptitudes or abilities. A task small enough for one, may be a many-month project for another. At one organization, after creating the initial prototype of a large video conference system, I went ahead and created about four high level tasks for other developers in the team to work on. I was shocked when the team came back with a projected estimate of several months for each task. So, in the short term, I took up productization of two of the high level tasks, and further subdivided the other two tasks into smaller chunks of more than ten or so subtasks. This also allowed the team to fit them in their bi-weekly sprints, over the next several months.

Another problem with this task division effort is that if the collection of smaller tasks takes too long, then they need to be prioritized, re-prioritized, and again, re-prioritized. This could result in solving many easy problems, while delaying hard ones for the future. This also usually involves duplicating work, creating throw-away work, or creating a half-baked system due to any change in requirements. In one project at an organization, there were several short term decisions for quick gains made during the development phase. Those decisions were supposed to be changed eventually for better robustness and stability. But instead, those decisions got written in stone and became very hard to change later on. This not only frustrated the architect (me), but also other developers who ended up working on that piece of software lacking the expected robustness and stability.

One way to solve this is to set the right expectations about the time the project would take. The expectation should keep open ended time estimates for unknowns, and should show the estimation trade-offs against software quality, robustness and stability aspects. This is because an accurate estimation of software implementation timeline is a very hard problem, especially for projects based on cutting edge technologies, with a lot of unknowns. Generally, if the prototype took N amount of effort, then incorporating it as a product could take 10x or more. Adding more resources to the implementation phase may reduce the time, but does not cause proportional reduction.

Some people believe in the 80-20 rule or its variant, i.e., 80 percent of the implementation is done with 20 percent of the effort, and the remaining implementation takes a lot more. With this belief, one way to set the expectation is as follows. Wait until 80 percent of the feature, task, or implementation is completed, before adding more resources or announcing it. Keeping less resources in the project until it is almost complete also reduces the pressure from upper management to show quick progress or to cut corners. Once the core developers are satisfied that most of the feature or implementation is done, then adding additional resources to polish and finish the remaining tasks is not too hard to estimate more accurately. If the announcement of the project deadline is made based on such informed estimates, it is more likely to be fulfilled or even beaten.

Unfortunately, this idea goes counter to the more common agile practices and team oriented development effort, where every member of the team is expected to know and contribute to every part of the software. This is because the above proposal assigns one (or two) core developers who are expected to complete most of the implementation without any pressure from outside and without any process overhead. My attempts to convince others to adopt this idea was not successful in that organization. However, in my prior projects and other places, this model worked quite well for implementing complex software pieces that were based on emerging technologies.

Dividing the tasks is largely about making decisions based on trade-offs involving several factors including developer skills, expected timeline, or component modularity. Communicating the reasons for such division and their timeline to various stakeholders is also important.

Create knowledge base and training for others

With my research background, I have written many academic papers, technical reports and articles. I often find it a lot easier to describe new project proposal or system, after an initial prototype is implemented. Many new questions and challenges arise after the initial implementation, and these usually form the basis of the proposal. Such technical report style document is one way to share knowledge.

Knowledge sharing by an architect happens in many forms - synchronous in meetings and live presentations, as well as asynchronous via stored documents and knowledge base. My opinion is that creating documents is a lot more productive than engaging in live debate or brain storming to answer architectural challenges. Even if live sessions are involved, they should be done in the context of a pre-created document or knowledge base, where significant thought process goes into creating that document. This saves time, avoids the bias to certain personality traits, and creates a meaningful digital trail that can guide and contribute to the overall architecture.

The architect must adapt to how the information is shared based on the work situation, the job responsibilities and, most importantly, the audience. Creating hundreds of pages of architecture document in a team with 140-character communication style or meeting-first habit is likely a waste of time. Creating one video tutorial to describe a complex system to managers, customer support and developers alike is not going to work.

In the past I have often relied on technical report-style documentation, followed by a meeting or training session to summarize the architecture. I assume that a typical audience is just like me. When I need to communicate with others, such as product managers, customer support or operations, I end up creating a separate deck and host a separate session. 

This is not the best option is some situations. Some teams prefer a wiki page with fixed templates, others have powerpoint decks, or word documents, and some have tons of video recordings of sessions discussing the architecture. In a computer science software engineering class at an undergraduate or graduate level, many different types of documents and templates are taught. Modern software engineering has largely moved away from such traditional practices especially in small and agile organizations. Nevertheless, one thing is worth remembering - no single document or approach is the right one, and that any document will soon become stale. Keeping the document close to the code, e.g., via API document generation, or automated message flow generation, can keep it relatively fresh, but only has a limited audience

The format of the document is not terribly important. What is important is that the document captures any crucial and non-trivial architecture and design decisions, and that the information is shared with the right audience - be it the software developer, test engineer, product designer, customer support, operations lead, or people manager. For example, a manager may only be interested in the high level components - which feature belongs to which part, and how complex each part is. A test engineer may be interested in message flows, example scenarios, and corner cases that are, or are not, handled, but not much about the internal application logic.

At one company, I found that the nature and complexity of my early implementation left the other developers in the dark, who were reluctant to contribute due to unfamiliarity with the new technology and the new programming paradigm. I created hundreds of pages of slide deck, in a tutorial style, and presented to the team in over ten live sessions with recordings, covering technology background, various components, complex application logic as well as various design decisions. This served not only as the bootstrap material for new developers but also an extensive document for what had been implemented and what was planned for the future.

The same set of slide decks included information about how the video conference layout worked in grid, spotlight or pagination, and how it was updated with newer features such as when the active talker indication was implemented. However, that technical slide deck was not appropriate for product designers. So I created a separate set of detailed screen transition diagrams, highlighting the important aspects of various layouts. Such document was intended to serve product designer to consider all relevant states and application logic whenever any change is proposed in the video layout.

At another startup where I worked, spending time writing document instead of code was discouraged. After creating a few pieces of software applications, when it was time to leave, I was encouraged to transfer the knowledge about those applications verbally in a meeting to another developer. The developer felt confident at that time. But a few weeks after I left, they reached out to me for help. Due to my new employment obligation I could not. From what I heard later, those applications got abandoned.

To avoid such situations, in the next startup, I took the initiative and spent my last few days with another competent engineer, creating video walk-through of all the important pieces of code I wrote, going file-by-file, function-by-function, and in some cases, line by line. Even if the software is not important now, creating some form of documentation never hurts for the unpredictable future.

Continuous monitoring and improvement of the system

Emerging technology software often behaves like a dynamic living organism. Depending on what is fed, it can behave differently. It lives in an environment that can change frequently, and can affect how it behaves. For example, a change in the browser's handling of certain WebRTC APIs can break the application's media flow. Or an application designed and tested with residential routers and firewalls will misbehave when dealt with a stricter corporate firewall that does not keep persistent WebSocket or TCP connection for more than, say, five minutes. An application, which expects the input webcam video in landscape orientation, is now suddenly fed with portrait mode mobile camera, could cause unexpected and excessive cropping.

In the past, I have focussed on three aspects of software monitoring: (1) abnormal behavior, (2) failure and recovery, and (3) quality of service. There is some overlap between the first two. Let me explain these.

Abnormal behavior is when the software does something unexpected. Often this is reported as a bug or alert, which needs to be fixed. For example, uploading a PNG image works but a JPEG one causes it to keep spinning. Logging such abnormal behavior using error or warning level is important. Such error logs should be monitored and addressed. 

Failure, and recovery from failure, are done for problems that are usually known in advance. For example, if the call setup is not completed in some duration, it is terminated or retried. Logs of important control flow in the software are useful. Moreover, dynamically created message sequence diagrams or state transition diagrams from the running system are great at visualizing and monitoring such events. They help in diagnosing problems quickly and effectively, instead of having to go over a large set of log lines.

Quality of service usually means how well the system performs in doing what it is supposed to do. This is measured independent of failures or abnormal behavior, and is measured when everything is working - but may not be working to the best of its ability. For example, a video call is connected, but the picture quality is poor even with the high speed Internet connection of the user. Or there is a lot of noise in the audio path even with a small fraction of packet loss shown in the network.

Monitoring, and subsequent improvement based on that, are often domain specific. For example, a video conferencing software will need a different of monitoring tool compared to a voice-only telephony system, or a web-based low latency chat application. The domain expertise is required in thinking ahead as well as in making decisions about what to monitor, how to interpret the metrics, and what to improve.

At one company, we were using a third-party product for quality monitoring of the video conferencing system. After that product was acquired by a competitor, we started looking for an alternative, but did not find anything that met our needs. I decided to spend a few weeks to put together an in-house quality monitoring tool. Over the next several month, I sporadically spent additional effort in further customizing it for our unique application logic as well as creating another integrated tool for connectivity diagnostics, application state insight, and message sequence visualization among various components of our system. That tool not only helped us reduce the diagnostic time of an issue from several days to few hours, and sometimes, minutes, but also provided a platform to experiment with many WebRTC-style key performance metrics.

Unlike the traditional telephony (VoIP) system's simple metrics for quality and performance, WebRTC stats are quite comprehensive and extensive. Deriving a few crucial metrics based on tens or hundreds of individual data points requires deep domain expertise as well as relentless experimentation.

At one organization, people were already familiar with the traditional mean-opinion-score (MOS) for audio quality measurement, and were inclined to apply the similar concept and calculations for Internet video flows based on WebRTC. Even though internal testing in controlled environment went through great, the project eventually became a wasted effort because the calculated quality score rarely reflected the actual quality experience. I attributed this failure to the lack of understanding of how WebRTC differs from traditional voice calls, and a misplaced decision making capability based on targeting only low hanging fruits.

In the pursuit of continuous improvement, one important lesson for software architects is to focus on hard problems - and delegate away low hanging fruits. Misplaced priorities often causes frustration and leaves the product worse than what one started with. Architects should fight for what they believe in, and when situation demands, refuse to give in, especially for important and hard problems. Honest communication is the key here.

When improvement decisions have to be made, not everything is known. For example, decisions on replacing the video codec with a new one, or changing the bitrate allocation algorithm to improve performance, require some data. However, such data is not available until the change is deployed. Such situations benefit by delaying the decision and by allowing to choose any option via some configuration. Using A/B testing and feature flag based deployment is recommended to gather the necessary data.

At one company, I made it a common practice to always put performance or quality related improvements or changes of existing working logic behind feature flags, which could be turned on or off dynamically. Even parameters that affect the behavior could be customized, so that if things did not work as expected, we could easily test with altered parameters, cutoffs, levels, factors, and such. Many of such flags also got exposed as the end user customizable features available to advanced users.

Identify and address disruptive technologies

At one company, with significant investment in, and portfolio of traditional VoIP equipments and systems, I started looking at how the emerging technology could disrupt what we did. The company was selling soft-switches to enterprises to control and monitor all voice (and video) traffic in and out. With peer-to-peer encrypted WebRTC, such a proposition became challenging. I invented and showed new ways to accommodate control and monitoring of such traffic for enterprises - so that when the situation demanded in the future, we would be ready.

At another company, while creating a video conferencing product, I identified a need to be able to watch third-party videos together in a synchronous meeting, with participant or moderator controlled synchronized playback of the shared view. After I created an initial prototype for YouTube videos as well as locally uploaded one, the feature became quite popular. Later, I worked on creating other controlled content sharing for slides and documents as well.

At yet another company, there was an identified need for incorporating background blur and virtual background in video meetings. After creating the initial proof-of-concept using a third-party AI engine and model, we were not satisfied with the quality and performance of its background detection logic. I worked with another engineer from another group, to address the issue - by creating our own model, and incorporated many tweaks to the parameters of the engine. It took more time and effort than expected, but the result was quite noticeable, and satisfying!

Thinking ahead, beyond the daily, weekly or even quarterly development and implementation efforts, is at the core of what a good architect does. The decision whether to address the actionable items immediately depends, however, on other factors, including resource availability, and flexibility of the work situation. 

Although many companies do not expect the architect to keep up with the technology advancements in the outside work, I feel it is quite important to do so, especially when working on cutting edge technologies such as WebRTC and AI. As an example, periodically looking at upcoming browser changes from the Chromium release notes are useful - to know which deprecated APIs will be removed soon, or what will change in how the quality stats are collected.

Many times, due to business or other reasons, product improvements are driven by competitors or other market leaders, e.g., Zoom added virtual background, so we must add it soon, or that Hangout does not show flexible layout, so we do not need to. This could force the architect to go in a follower-mode, and only worry about problems that are already solved elsewhere. I find this approach quite demotivating. My suggestion to architects is to also look beyond what the companies are seeing, to try new ideas, and to not be afraid of failure. It is better to try and fail quickly, then to wait and lose the opportunity, even if there is a small chance of success.

Conclusions

I talked about many aspects of an architect's role. In summary, software architects are technical owners. They must have domain expertise, enough experience and knowledge, and must be able to drive experimentation and analysis of trade-offs when making informed decisions. In doing so, they often look ahead, beyond the daily standup, bi-weekly sprints, or quarterly updates, identify areas to improve, and strive to solve hard problems. They share relevant knowledge with others, and tailor their communication with the right audience in the right format. They end up becoming the go-to-person for any technical needs of that piece of software or technology they own.

As a software architect, I thrive in an environment that promotes flexibility, innovation, and recognition for hard work and for solving challenging problems. I suspect other architects have their own desirable environment that they prefer to live in. 

In the past several years, most of my time as an architect was spent under three hats - software developer, technology expert, and innovator. With the software developer hat, I created many production-grade software pieces, performed routine debugging and bug fixes, and took part in software engineering processes. With the technology expert hat, I kept myself up to date with what was happening in the outside world, and provided guidance and help to others in becoming familiar with the new technology, investigating issues related to new technology, and in solving hard implementation problems involving complex state machines and robustness constraints. With the innovator hat, I researched, analyzed and presented my findings about new technology and new ideas, and created numerous proof-of-concept implementations to show case many emerging ideas.

One topic that I briefly alluded to in my previous discussion is the issue of time and deadlines. Many work situations are under time pressure, some real and some artificial. Once a deadline or estimation is proposed, the pressure starts building up. Then comes the struggle to fit the effort within the time frame, often times by cutting corners, or by creating sloppy work. Some of this sloppy work stays forever in the product. My recommendation is to drive the deadlines, if really needed, based on informed estimations. Secondly, avoid artificially imposed due dates, by being flexible - and do not propose a deadline or estimation without real analysis, just because everyone expects it or the process demands it. For an early stage software with many unknowns, do not be afraid to say, "it will be ready when it is ready." Finally, employ competent developers who create high quality software at high speed. 

As a closing remark - a software architect should have the responsibility and power to influence technical decisions, should use domain expertise and experimentation when needed to analyze trade-offs, should think ahead beyond the short term goals, and should communicate effectively about those technical decisions to others.


No comments: