The Broken Mentality Around The Software Development Process
Building software is not like running an assembly line
I spent many years being frustrated by the process at many companies. At the same time I struggled to put that frustration in words. I knew a lot of the practices being used were worthless, but they sounded good and I could not craft the right counter arguments.
I have since realized why these practices sound good, but fail when used in software development:
Those practices were copied from manufacturing and are good practices in that context.
Running a factory requires the ability to create tens of thousands of the exact same thing. The process has to be 100% repeatable where even a 1% error rate can have a massive impact on margins.
For some reason, the idea that software development is analogous to running a factory took hold and is prevalent today. The problem is that building software looks nothing like creating a good in a factory. You would not need software developers if it was. You would only need people who knew how to copy/paste code and generate 10,000 copies of the exact same code. That does not happen. New software being built is very likely bespoke.
A single factory does not generate 10,000 bespoke goods.
Let's look at how copying the manufacturing process has negatively impacted software development.
Quality Assurance
Software development is one of the few professions where it seems acceptable to have the quality checked by a separate team that does not share that profession. Is there a separate QA team for marketing? Or do other marketers act as a sounding board for any marketing plan. Does an executive have a QA team to review an acquisition deal? Or are other executives consulted? Do scientists have their research reviewed by a QA team? Or is it peer reviewed by other scientists?
Every company I have been involved with that had a manual QA team separate from the development team has built lower quality software and at pace slower by at least one order of magnitude. I have not worked at a statistically significant number of companies, but the correlation is 1 for the ~20 companies I've been involved with. The qualifier I added to this is manual QA teams. None of the below will apply to automated QA teams that focus on creating test infrastructure and patterns for the development teams. Unlike manual QA teams, automated QA teams are also staffed with software developers.
The first reason for this is that most QA testing should be automated. Modern software applications tend to have too many common use cases, let alone edge cases, for a team of humans to manually go through in a reasonable amount of time. To mitigate this in manual testing, compromises are made to define a subset of areas to test. There is not only have the possibility of human error in the testing, but now we're adding the possibility of human error in the evaluation of that subset.
With automated tests, we can just run all of them regardless of whether we think some of the tests make sense or not. The automated tests I have for our internal systems execute in under three minutes. It would take me a week to go through all of them manually.
Some things are also impossible to test manually and be confident in the results. Concurrency is a perfect example of this. Adding concurrency to a system means something can work perfectly fine on development and QA environments, but be completely broken on production. You know that saying that the definition of insanity is doing the same thing repeatedly and expect different results? That actually happens with concurrency. Before I saw the light, I tested concurrent code by repeating the test case at least a dozen times hoping I didn’t get a different result. Even that was no guarantee.
Preventing these race conditions requires a review of the code by developers familiar with that codebase. No one other than a developer has the knowledge to perform that validation. In many cases, automated tests can be created to make the execution order of the code deterministic and test various combinations of that execution order.
The creation of automated tests also results in better code. Developers have to structure code differently to make it easier to test. There is a great talk on this topic here.
One example is that applications with tests are more readable because they use an appropriate amount of abstractions. Those abstraction layers can get out of hand otherwise. A company I consulted for had an onboarding process for new developers that took months because that's how long it took to unravel all the abstraction layers. I tried creating automated tests for that system and ended up having 600-700 lines of setup code for a single test.
Developers who are responsible for the quality of their software often write better quality code as well. Many developers in companies with manual QA teams will act (and sometimes say out loud) like bugs in the software are not their fault. QA is responsible for catching them after all! Your first thought may be that these are just bad developers. This is a trait everyone in every profession shares though. Humans in general lose incentive to do something when the responsibility is on someone else.
Lastly, how do we actually define quality? Is the definition that software performs exactly as specified in the requirements? What if the requirements don't serve the end user? Is software quality still high if the requirements are met, but users prefer to use spreadsheets or pen & paper? Most manual QA teams are dedicated to testing that everything was built to spec. They don't test whether the right things were put into those requirements. The latter is the one area where manual testing is essential.
At a company with an automated testing team, we ended up in a conversation about how we weren't doing manual testing and it wasn't necessary. One of the product managers responded "People do manual testing here. They're called product managers."
Product & Project Management
The testing done by product managers is a critical part of creating high quality software. The problem is that this testing is done at the very end of a development cycle after all the development time is spent. I wrote a separate post on this exact problem.
This linear process makes sense when you have a factory creating 10,000 copies of the same item. It makes less sense when the only real test for requirements is to show working software to users. Before a factory commits to making 10,000 copies of an item, work is done to make sure that people actually want that item. Many bespoke prototypes are made and shown to potential customers. Creating software that people want is not that different than creating a physical item that people want. People need to be see and interact with it in order to provide useable feedback.
Making time to create those prototypes is difficult with a development process copied from manufacturing. Tickets/tasks are made. Estimates are attached to those tickets. The tickets are assigned to people. Completion progress is tracked. The creation of all this provides a lot of metrics that can be measured as development goes on. Those metrics get misused in many ways, such as tieing them directly to performance reviews.
The result of prototyping is confidence that users will want the software. That confidence is difficult to measure. When confronted with a useful metric that's hard to measure vs a potentially misleading metric that's easy to measure, people will gravitate to the latter most of the time. It's easier to create a performance review from the easy to measure metrics. Everyone involved is now incentivized to focus on those metrics rather than creating prototypes.
Team Silos
One of the more important concepts for running a factory is the division of labor. Divide the work into distinct tasks and have each person focus on one of those tasks. For repeatable tasks, this results in higher quality and higher efficiency.
Software development is often broken down this way as well: product managers, UX designers, frontend developers, backend developers, SREs, QA testers, etc. This breakdown gives us the exact opposite result when building software: lower quality and lower efficiency.
By leaving developers out of product management, you close possibilities. I have lost count of the number of times I've made a suggestion to a product manager and the response I received was "Oh! I didn't think that was possible. That's way better." It helps to have dedicated product managers focused on product and user research, but it also helps to involve developers in that process. A deeper understanding of the technology provides a different perspective that results in different ideas.
Separating UX designers, frontend developers, and backend developers has its own problems as well. Business logic can exist in the frontend, the backend, or be duplicated in both. Business logic in the frontend can be a security issue, but it often isn't in many cases. Having the logic there provides quicker responses to the user and reduced load on application servers. An example would be the amortization tables I generate for loans. The tables are generated from user input. Someone could intercept the API calls and make the table look different, but they'd only be hurting themselves. Also the damage can be easily undone by another user. They just need to save the loan again.
In cases where the user modifying the request is a problem, then duplicating the logic can provide that same quick user experience. The backend would be validate the result asynchronously to ensure the data hasn't been manipulated.
Meanwhile, the logic for online purchases should only happen on the backend. You would not want to show someone that their credit card was charged successfully unless it actually does. Accuracy is more important than speed in those cases.
A team that is responsible for a product as a whole is incentivized to optimize for what is best for the product. A set of silo-ed teams are incentivized to make it look like they did their part and the problems are with someone else.
Silo-ing developers from SREs has a different set of problems. There are a lot of reasons why I think developers should be involved in managing production infrastructure, but the big one is this: developers who write bad code that results in an SRE waking up at 2am are going to be very very apologetic and promise it won't happen again. Developers who write bad code that results in themselves waking up at 2am are going to learn to write better code real fast.
----------------------
These are just a handful of the issues with the predominant mentality around the software development process. What works for performing a repeatable task does not work for creating bespoke goods. The fact that building software has been made to look like running a factory is based an a fundamental misundestanding of what is involved in building that software. Engineering leaders, including myself, have maintained this process because it is how things have been done for decades so they default to running things this way instead of rethinking the process from scratch.
We all need to think really hard about what practices we follow because of this default and which are actually useful. Software development is a lot more analogous to deciding what the factory should build rather than operating the factory after that decision is made.
I have been trying to tackle this exact problem with my team. I'm hanging on to the idea that the "three legs of the stool" are Product, UX, and Engineering, and that all three of those need to be involved at every stage of the "Lean Startup" philosophy/framework that I think our company uses. Still too many problems with that, so I doubt it is the right solution.