Kategorien
Coding Tinkering

Writing my own x86_64 Operating System

tl;dr Repo: https://github.com/jbreu/jos

During mid of last year I dug myself into the OS development rabbit hole on Youtube and in my fall vacation I had some slack allowing me to hands-on hack together my own Operating System called JOS (Jakob’s OS). This was the starting point of a pretty exciting journey. Till today, there is no particular purpose or goal other than learning more about low level basics. Nonetheless, I enjoyed this exercise so much that I can even grant it some therapeutic effect during very stressful business life 🙂 Of course, as a family father and engineering manager, there is not much time and I often found myself pondering in my mind while doing chores how to solve the next stack corruption puzzle, instead of actually coding/debugging. In this article, I will tell you a bit about my key learnings/technologies this exposed me to.

The starting point was a brief video series by David Callanan on how to write a bootloader for an OS till printing a Hello World to a console. I started to fork his repo and worked my way from there. First thing was to rewrite the whole OS code (besides the bootloader) in Rust, as this was another thing I wanted to learn. It took me a while to get used to Rust, but eventually came to a working Hello World. The next steps were minor adaptions to the console printing ecosystem. The first real big step digging into the intrinsics of x86_64 CPU instruction set was the implementation of interrupts. This took a great amount of debugging, trial and erroring and reading of many specifications and internet resources. From the latter, I want to highlight the AMD64 Architecture Programmer’s Manual, Volume 2, and Philipp Oppermann’s Writing an OS in Rust. The latter is really good for the first steps with one caveat: It is based in big parts on the author’s library, which abstracts all x86_64 basics away and is in many parts more a tutorial to use the library. I didnt want to use this library because my impression was that it masked too much of the interesting stuff how things actually work on the low level. Hence, I implement everything on my own terms. After adopting keyboard and time interrupts I used the latter to display the clock time on the console in the top right corner.

A huge help was that I could use the qemu emulator and the gdb debugger in combination (via VS Code’s extension native-debug). It was the first time using qemu and gdb as a developer and after getting beyond the initial learning curve I genuinely enjoyed debugging with it. As additional tools I also used Ghidra (disassembler) and ElfViewer (inspect executables).

A really huge next step for me was to introduce userspace programs and later on multiprocessing (running multiple userspace programs „in parallel“ with a simple round robin scheduler). After studying basic patterns for this, I chose the hard path to implement it mostly by myself. This is the part which I am most proud of so far, because this required me to derive many inner works from the specs directly and it has cost me a long time to get right. For weeks and months I fought with sporadic stack corruptions and CPU exceptions. As userspace program I implemented a simple Hello World program, which interacted with the kernel via syscalls. Due to lack of a file system, which to date I was for some reasons not eager to implement, this userspace progam is stored inside the os executable and loaded from there – a hack which is actually really cool. Then I added a vga mode which enabled the userspace programs to print colored pixels to the screen (one line for each of the 4 userspace programs):

OK so here we were, but what you gonna do with an operating system which has keyboard input, can run user programs, has vga output? Yes, you are correct – we run Doom on it 🙂

You probably heard of people getting to run Doom on all sorts of awkward devices, engineers made it to run on potato batteries and toothbrushes:

So if they can run it on such devices, there must be a way to run it on JOS, right? And yes, its possible. It required me to write a small C wrapper around PureDOOM, which in turn also made me translate my Rust-based libc to C. After adding some additional syscalls, fighting with the 6 bit color maps, malloc implementation I finally made it. So here we stand today, JOS runs DOOM.

Kategorien
Allgemein

Remote target doesn’t support qGetTIBAddr packet

I had a weird issue today connecting via the VS Code extension Native Debug to a qemu instance. It was giving:

Visual Studio Code
Failed to attach: Remote target doesn't support qGetTIBAddr packet (from target-select remote :1234)

[Open 'launch.json'] [Cancel]

That sounded like gdb/native debug is expecting a feature qemu is not offering; however, just the day before it ran successfully – so what happened? Unfortunately and coincidentally, I did some housekeeping a few hours before, so my suspicion was that I somehow uninstalled some facilities, like Windows SDK or so. After 2 hours trying to reproduce my earlier setup, checking older versions of qemu, gdb and native debug I almost gave up, when I stumbled upon this via Google:
https://github.com/abaire/nxdk_pgraph_tests/blob/main/README.md?plain=1#L187

NOTE: If you see a failure due to "Remote target doesn't support qGetTIBAddr packet", check the GDB output to make sure
that the `.gdbinit` file was successfully loaded.

Now, of course I checked the gdb output before, but besides some warnings nothing appeared suspicious. The link made me re-check, and indeed there was this:

undefinedBFD: reopening /cygdrive/c/Users/Jakob/Documents/workspace/os-series/C:\Users\Jakob\Documents\workspace\os-series\dist\x86_64\kernel.bin: No such file or directory

That appeared new on second thought, and I removed following line from my VS Code’s launch.json

"executable": "dist/x86_64/kernel.bin",

That made it work again. At least partially, of course now the info to the executable is missing, but I have the feeling this is a minor thing to fix.

Addition (Nov 11th):

I have two additions to make. First it seems reader Piotr found the actual proper solution to this, see his comment below. After setting „set osabi none“ in .gdbinit I can indeed use gdb as before. This setting makes gdb not assuming any operating system ABI, which makes sense when you code your own OS from scratch. Thank you so much Piotr!

Second, just fyi and in case someone has similar issue but the aforementioned solution doesn’t work for some reasons, here is tzhe workaround I used till Piotr came for the rescue. As written above, removing the „executable“ line from the launch.json made gdb work, but of course now the executable and its debug symbols are missing, so setting breakpoints from the UI didnt work. After much tinkering I realized that adding the executable later during the debugging session helped. So what I did was adding a hardcoded breakpoint in the very beginning of my test object. When this breakpoint was hit, some commands were executed, of which adding the kernel as symbol file is the most important one. Also I had to add another breakpoint inside this command, which made gdb reload all other breakpoints from the UI, too.

b *0x100018
commands
  add-symbol-file dist/x86_64/kernel.bin
  b isr_common_stub
end

This worked with the caveat, that each debugging session halted at this first hardcoded breakpoint and I had to manually continue once. It was an acceptable workaround, but I am happy today Piotr gave a proper solution.

I still have no clue what exactly made this issue pop up; as mentioned I blame my „housekeeping“ activities, which were too many to reproduce the exact root cause.

Kategorien
Book Culture

Thoughts on “Implementing Lean Software Development”

Reading and summarizing books on lean software development, so you dont have to. Part 3 (see Part 1 and Part 2).

“Implementing Lean Software Development” written by Mary and Tom Poppendieck and published 2007 at Addison-Wesley. The Poppendiecks are quite famous in the lean-agile software development community, as they published the constitutive book „Lean Software Development: An Agile Toolkit“ in 2003, the first (recognized) book about bringing the lean principles to the software development space. The book reviewed here is a successor book aimed at delivering more practical advice. As in the last parts, my review will not focus on re-iterating lean and agile fundamentals, but rather focus on novelty aspects, ideas, and noteworthy pieces.

In the foreword, Jeff Sutherland (co-founder of the Scrum framework) introduces the Japanese terms of Muri (properly loading a system), Mura (never stressing a person, system or process) and Muda (waste):

Yet many managers want to load developers at 110 percent. They desperately want to create a greater sense of “urgency” so developers will “work harder.” They want to micromanage teams, which stifles
self-organization. These ill-conceived notions often introduce wait time, churn, death marches, burnout, and failed projects.
When I ask technical managers whether they load the CPU on their laptop to 110 percent they laugh and say, “Of course not. My computer would stop running!” Yet by overloading teams, projects are often late, software is brittle and hard to maintain, and things gradually get worse, not better.

page xix

In their historical review the authors bring a very interesting statistics which should resonate with many of my peers:

Both Toyodas had brilliantly perceived that the game to be played was
not economies of scale, but conquering complexity. Economies of scale will reduce costs about 15 percent to 25 percent per unit when volume doubles. But costs go up by 20 percent to 35 percent every time variety doubles. Just-in-Time flow drives out major contributors to the cost of variety. In fact, it is the only industrial model we have that effectively manages complexity.

page 5

As evidence, two papers are given: „Time -The Next Source of Competitive Advantage“ by George Stalk and „Lean or Sigma?“ by Freddy and Michael Balle. Managers and engineers increasingly become aware about the not-so-visible cost of complexity, typically by experiencing project failure or long-term product degradation.

For the aspect of inventory, the authors provide a quite good methaphor:

Inventory is the water level in a stream, and when the water level is high, a lot of big rocks lurking under the water are hidden. If you lower the water level, the big rocks begin to surface. At that point, you have to clear the rock out of the way, or your boat will crash into them. As the big rocks are removed, you can lower inventory level some more, find more rocks, clear them out of the stream, and keep on going until there are just pebbles left.

page 8

That adoption of lean practices and mindset is not straightforward and many organizations struggle or fail to do so is explained by the authors by pointing at a „cherrypicking“ approach. Hence, only some activities of the lean domain are adopted in isolation, like just-in-time or stop-the-line. Instead, they a classic:

The truly lean plant […] transfers the maximum number of tasks and responsibilities to those workers actually adding value to the car on the line, and it has in place a system for detecting defects that quickly
traces every problem, once discovered, to its ultimate source.

Womack, Jones, Roos: The machine that changed the world, page 99

I think this cannot be underestimated. To seldom I have seen organizations and management really focussing on the „value creators“ and the impediments those are facing.

In earlier blog posts I already wrote about the differences and similarities in the lean manufacturing and lean development. The Poppendiecks provide a table putting both side-by-side (page 14):

Later, in a footnote, the authors refer to a paper by Kajko-Mattsson et al. on the cost of software maintenance. The paper’s sources vary a lot, however its obvious that considering a typical big software project it becomes clear that this ratio quickly translates to millions of Euro/Dollar.

The published numbers point out that maintenance costs between 40% to 90% […]. There are very few publications reporting on the cost of each individual maintenance category. The reported ones are the
following: (1) corrective maintenance – 16-22% […] (2) perfective maintenance – 55% […], and (3) adaptive maintenance – 25% […].

Kajko-Mattsson et al: Taxonomy of problem management activities, page 1

On the lean principle of waste, the Poppendiecks make a simple but revelating statement:

To eliminate waste, you first have to recognize it. Since waste is anything
that does not add value, the first step to eliminating waste is to develop a keen sense of what value really is. There is no substitute for developing a deep understanding of what customers will actually value once they start using the software. In our industry, value has a habit of changing because, quite often, customers don’t really know what they want. In addition, once they see new software in action, their idea of what they want will invariably shift. Nevertheless, great software development organizations develop a deep sense of customer value and continually delight their customers.

page 23

Too often have I experienced software development projects who dont know what their product and the value they provide actually is. Of course, everyone has a vague feeling about what it could be, but putting it in clear words is seldom attempted and easily ends in conflict (a conflict which can be constructive if facilitated well).

On the second principle „Build Quality In“, there are some interesting distinctions on defects and the relation to „inspection“:

According to Shigeo Shingo, there are two kinds of inspection: inspection after defects occur and inspection to prevent defects.10 If you really want quality, you don’t inspect after the fact, you control conditions so as not to allow defects in the first place. If this is not possible, then you inspect the product after each small step, so that defects are caught immediately after they occur. When a defect is found, you stop-the-line, find its cause, and fix it immediately.
Defect tracking systems are queues of partially done work, queues of rework if you will. Too often we think that just because a defect is in a queue, it’s OK, we won’t lose track of it. But in the lean paradigm, queues are collection points for waste. The goal is to have no defects in the queue, in fact, the ultimate goal is to eliminate the defect tracking queue altogether. If you find this impossible to imagine, consider Nancy Van Schooenderwoert’s experience on a three-year project that developed complex and often-changing embedded software. Over
the three-year period there were a total of 51 defects after unit testing with a maximum of two defects open at once. Who needs a defect tracking system for two defects?

page 27

The authors are citing two papers by Nancy Van Schooenderwoert („Taming the Embedded Tiger – Agile Test Techniques for Embedded
Software
“ and „Embedded Agile Project by the Numbers With Newbies„). This resonates well with me, because accumulating too many defect (tickets) is very expensive waste. Its a kind of inventory with the worst properties. To break out of this is not straightforward, I have attempted and failed multiple times to establish a „zero defect policy“ (i.e. as long as there is a defect no further feature development happens). In that context let me at two more quotes from the book:

The job of tests, and the people that develop and runs tests, is to prevent defects, not to find them.

page 28

“Do it right the first time,” has been interpreted to mean that once code is written, it should never have to be changed. This interpretation encourages developers to use some of the worst known practices for the design and development of complex systems. It is a dangerous myth to think that software should not have to be changed once it is written.

page 29

On the fifth principle of „Deliver Fast“ a very important statement is made:

Caution: Don’t equate high speed with hacking. They are worlds apart. A fast-moving development team must have excellent reflexes and a disciplined, stop-the-line culture. The reason for this is clear: You can’t sustain high speed unless you build quality in.

page 35

Very often I observe a dire need for speed. Of course everyone wants to be faster in the software industry. Competition doesnt sleep. However, similar to unclear definitions of value and products, I have barely ever seen a clear definition of speed in a software project. Or, probably more correct: there were competing definitions of speed on people’s and especially decision maker’s minds. Its a huge difference to beat your team to „push out features now“ and grind to a halt when quality activities are started, or to maintain a sustainable pace:

When you measure cycle time, you should not measure the shortest time through the system. It is a bad idea to measure how good you are at expediting, because in a lean environment, expediting should be neither necessary nor acceptable. The question is not how fast can you deliver, but how fast do you repeatedly and reliably deliver a new capability or respond to a customer request.

page 238

The Poppendiecks are summarizing those effects in two vicious cycles (page 38):

For all the lean principles, the Poppendiecks also discuss myths originating from mis-interpreting the principles or applying them wrongly. One which caught my attention was the myth „Optimize by decomposition“. Its about the proliferation of metrics once an organization starts to apply the benefits of visual management. All of a sudden, there are tens if not hundreds of dashboards, graphs, KPIs, and such flying around. Their recommendation:

When a measurement system has too many measurements the real goal of the effort gets lost among too many surrogates, and there is no guidance for making tradeoffs among them. The solution is to “Measure UP” that is, raise the measurement one level and decrease the number of measurements. Find a higher-level measurement that will drive the right results for the lower level metrics and establish a basis for making trade-offs.

page 40

Speaking about myths, they encourage readers to check which myths apply to their situation – certainly a worthwile exercise also for you 🙂

Early specification reduces waste
The job of testing is to find defects
Predictions create predictability
Planning is commitment
Haste makes waste
There is one best way
Optimize by decomposition

page 42

Coming back to the notion of value, the authors are asking the fundamental question how great products are conceived and developed. They write:

In 1991, Clark and Fujimoto’s book Product Development Performance presented strong evidence that great products are the result of excellent, detailed information flow. The customers‘ perception of the product is determined by the quality of the flow of information between the marketplace and the development team. The technical integrity of the product is determined by the quality of the information flow among upstream and downstream technical team members. There are two steps you can take to facilitate this information flow: 1) provide leadership, and 2) empower a complete team.

page 52

The book has an extensive chapter on waste with many insightful aspects. I dont want to repeat all of them, and instead just provide some examples. For example I found this statement on the relationship of automation and waste/complexity very inspiring.

We are not helping our customers if we simply automate a complex or messy process; we would simply be encasing a process filled with waste in a straight jacket of software complexity. Any process that is a candidate for automation should first be clarified and simplified, possibly even removing existing automation. Only then can the process be clearly understood and the leverage points for effective automation identified.

page 72

In my current position, automation is a key activity, and we try to automate everything in an endeavour to increase speed, quality and convenience. The quote points out, that automation can hide or defer complexity. I can confirm this. Even though my team automated the complexity of product variants in the build process, our customers (e.g. manual testers) dont have a chance to test all the build we produce. Hence, even made with best intentions, our automation is overloading the whole.

Another good comparison between traditional manufacturing and software development is the following table, putting the seven waste equivalents side-by-side (page 74):

On architectural foresight, I like the following statement:

Creating an architectural capability to add features later rather than sooner is good. Extracting a reusable services „framework“ for the enterprise has often proven to be a good idea. Creating a speculative application framework that can be configured to do just about anything has a track record of failure. Understand the difference.

page 76

While discussing Value Streams, the authors dig into effectiveness and efficiency. They are of the opinion that

chasing the phantom of full utilization creates long queues that take far more effort to maintain than they are worth-and actually decreases effective utilization.

page 88

This opinion is not speculation, they provide a good analogy to road traffic and computer utilization:

High utilization is another thing that makes systems unstable. This is obvious to anyone who has ever been caught in a traffic jam. Once the utilization of the road goes above about 80 percent, the speed of the traffic starts to slow down. Add a few more cars and pretty soon you are moving at a crawl. When operations managers see their servers running at 80 percent capacity at peak times, they know that response time is beginning to suffer, and they quickly get more servers. […]

Most operations managers would get fired for trying to get maximum utilization out of each server, because it’s common knowledge that high utilization slows servers to a crawl. Why is it that when development managers see a report saying that 90 percent of their available hours were used last month, their reaction is, „Oh look! We have time for another project!“

pages 101f

I think in daily work, management typically does not pay enough attention to those basics. It is not that this is not known that too high utilization of resouces is bad, quite the opposite is the case in my experience. However, the root causes and the remedies are often not considered. Instead, there is a sentiment of capitulation: „Yes I know our team is stressed and overloaded, but we have to get faster nevertheless.“

In order to reduce cycle times, the authors refer to queuing theory, which provides several approaches:

Even out the arrival of work

Minimize the number of things in process

Minimize the size of things in process

Establish a regular cadence

Limit work to capacity

Use pull scheduling

page 103

In the chapter „People“, there is a lot of reference to William Edwards Deming, a pioneer of quality management. Its an iron of history, that this American actually was teaching the fundamentals of what leater became Lean in post-war Japan, while he was „discovered“ only in the 1980s by the US (industrial) public. Deming formulated a what he called „System of Profound Knowledge“:

  1. Appreciation of a System: A business is a system. Action in one part of the system will have effects in the other parts. We often call these “unintended consequences.” By learning about systems we can better avoid these unintended consequences and optimize the whole system.
  2. Knowledge of Variation: One goal of quality is to reduce variation. Managers who do not understand variation frequently increase variation by their actions. Critical to this is understanding the two types of variation — Common cause which is variation from the system and Special cause which variation from outside the system
  3. Theory of Knowledge: There is no knowledge without theory. Understanding the difference between theory and experience prevents shallow change. Theory requires prediction, not just explanation. While you can never prove that a theory is right, there must exist the possibility of proving it wrong by testing its predictions.
  4. Understanding of Psychology: To understand the interaction between work systems and people, leaders must seek to answer questions such as: How do people learn? How do people relate to change? What motivates people?
https://medium.com/10x-curiosity/system-of-profound-knowledge-ce8cd368ca62

When pursuing change and transformation, it is very important to take the staff on board. This is easier said than done, because the employees have a very fine sense. They realize very quickly, if for example a certain change in mindset is requested from them, but not exercised by their supervisors. In engineering projects, the demands and expectations of decision makers are often antagonistic to their communicated strategies and visions. Just consider if in your organization „quality“ is an essential part of your long-term goals, and totally overriden by daily task force death marches.

The challenge to achieve quality is handled in another dedicated chapter. The authors point out the importance of „superb, detailed discipline“ to achieve high quality. Here come the famous „5 S’s“ into play. The book’s authors transfer them also to the software space:

Sort (Seiri): Sort through the stuff on the team workstations and servers, and find the old versions of software and old files and reports that will never be used any more. Back them up if you must, then delete them.

Systematize (Seiton): Desktop layouts and file structures are important. They should be crafted so that things are logically organized and easy to find. Any workspace that is used by more than one person should conform to a common team layout so people can find what they need every place they log in.

Shine (Seiso): Whew, that was a lot of work. Time to throw out the pop cans and coffee cups, clean the fingerprints off the monitor screens, and pick up all that paper. Clean up the whiteboards after taking pictures of the important designs that are sketched there.

Standardize (Seiketsu): Put some automation and standards in place to make sure that every workstation always has the latest version of the tools, backups occur regularly, and miscellaneous junk doesn’t accumulate.

Sustain (Shitsuke): Now you just have to keep up the discipline.

page 191

I really enjoyed reading this book and can absolutely recommend reading it. It contains a lot of gems, and is probably one of those book you want to read every other year again to re-discover aspects and connect them to new experience.

Kategorien
Allgemein

Technology Radar #29: Automotive SW perspective

As written before, I really like the regular updates provided by Thoughtworks in their Technology Radar. My focus is on the applicability of techniques, tools, platforms and languages for automotive software, with a further focus on embedded in-car software. Hence, I am ignoring pure web-development and machine learning/data analytics stuff which usually makes a huge portion of the whole report. Recently, its volume 29 has been published. Let’s have a look!

In the techniques sector, the topic lightweight approach to RFCs has made it to the adopt area, meaning there is a strong recommendation to apply it. During my time at MBition, Berlin, I became exposed to a talk by Paul Adams on YouTube via my colleague Johan Thelin, which Paul later also gave during an all-hands event of our project:

Hence, the RFC thing very well resonates with me. It has been my style of creating documents about strategies, concepts, plans and very early requesting feedback from peers to check if the general direction is correct, and to finalize it later. Much what software engineers are used to do in Pull Requests, such scheme can and should be applied to more areas in a systematic manner. Architecture is one obvious area, but it can also be applied in many other areas. Confluence and similar collaboration platforms offer great inline commenting capabilities to discuss about any controversial aspects of a document and sort them out.

2.5 years ago I wrote about Dependency updates with Renovate. In the blip automatic merging of dependency update PRs the authors argue in favor of automatic merging of bot-generated dependency-updates. What can I say, makes total sense. Till today I have manually merged the pull requests created by the bot, but now I just let it automatically do that – of course only after a successful pipeline run. With renovate its as simple as adding "automerge": true to the renovate.json in each repo.

In tracking health over debt the authors describe a notion to focus more on the health of a sw system than tracking its (technical) debt. Its a worthwile approach, since focussing on debt means tracking an often ever-growing list. In my experience, some debt often gets obsolete, and some debt which was fiercely discussed when it was „implemented“ later is turning out significantly worse or better. Instead, tracking the health of the system as primary measure where to act at any time may yield better results in the overall long game.

In the tools sector, Ruff is recommended as a successor to the famous Python linter Flake8. Implemented in Rust, it seems to offer superior performance while still providing similar rule coverage:

A quite untypical entry (at least to my knowledge) is the mention of DevEx 360, a survey tool focussed on identifying impediments and potential improvements among the dev team.

Our platform engineering teams have used DX DevEx 360 successfully to understand developer sentiment and identify friction points to inform the platform roadmap. Unlike similar tools, with DX DevEx 360 we’ve received a response rate of 90% or above, often with detailed comments from developers on problems and ideas for improvements. We also appreciate that the tool makes results transparent to engineers in the company instead of just managers and that it enables team-by-team breakdown to enable continuous improvement for each team’s context.

https://www.thoughtworks.com/de-de/radar/tools/summary/dx-devex-360

This was it already for „my“ scope of the tech radar. This time around, the tech radar contained a looot of new entries and updates in the AI area, around GPT and LLMs. Certainly interesting, but nothing I have much experience and applications (yet).

Kategorien
Book Culture

Thoughts on „Lean Software Development in Action“

Reading and summarizing books on lean software development, so you dont have to. Part 2 (see Part 1).

“Lean Software Development in Action” written by Andrea Janes and Giancarlo Succi and published 2014 at Springer. The authors are scientists at the University of Bolzano, and the book clearly has a more scientific approach than the last one.

As the last – and probably every book – on the matter of lean, agile and software engineering this book starts with an introduction on each of those aspects. Again, I will not reiterate on what lean and agile are, but focus on interesting observations and perspectives exposing new angles on „known stuff“.

The first noteworthy piece is about „tame and wicked projects“. This section is referring to work by Rittel and Webber who came up with a discution between tame and wicked problems (see also). Poppendieck and Poppendieck extended this to projects, and in this book they are described and identified by following ten points:

  1. Wicked projects cannot provide a definitive, analytical formulation of the problem they target. Formulating the project and the solution is essentially the same task. Each time you attempt to create a solution, you get a new, hopefully better, understanding of the project.
  2. Wicked projects have no a stopping rule telling when the problem they target has been solved. Since you cannot define the problem, it is almost impossible to tell when it has been resolved. The problem-solving process proceeds iteratively and ends when resources are depleted and/or stakeholders lose interest in a further refinement of the currently proposed solution.
  3. Solutions to problems in wicked projects are not true or false, but good or bad. Since there are no unambiguous criteria for deciding if the project is resolved, getting all stakeholders to agree that a resolution is “good enough” can be a challenge.
  4. There is no immediate or ultimate test of a solution to the targeted problem in a wicked project. Solutions to such projects generate waves of consequences, and it is impossible to know how these waves will eventually play out.
  5. Each solution to the problem targeted by a wicked project has irreversible consequences. Care must be placed in managing assumed solutions. Once the website is published or the new customer service package goes live, you cannot take back what was online or revert to the former customer database.
  6. Wicked projects do not have a well-described, widely accepted set of potential solutions. The various stakeholders may have differing views of what are acceptable solutions. It is a matter of judgment as to when enough potential solutions have emerged and which should be pursued.
  7. Each wicked project is essentially unique. There are no well-defined “classes” of solutions that can be applied to a specific case. It is not easy to find analogous projects, previously solved and well documented, so that their solution could be duplicated.
  8. The problem targeted by a wicked project can be considered a symptom of another problem. A wicked project deals with a set of interlocking issues and constraints that change over time, embedded in a dynamic and evolving context.
  9. The causes of a problem targeted by a wicked project can be explained in several ways. There are several stakeholders who have various and changing ideas about what is the project, its nature, its causes, and the associated solution.
  10. The project must not go wrong. Mistake is not an option here. Despite the inability to express the project solution analytically, it is not allowed to fail the project.
page 9

These are some interesting observations which resonate with my experience. One might say „many of those points do not apply to our project as we have a quite clear understanding of our product delivery (like an ECU for a car, providing a certain set of functions/features for the customers)“. However I think its not that simple. Many projects of non-trivial complexity I have been involved in do not only have the goal of releasing a product to the market, but there are other, interlinked objectives which give the project in total a semi- or non-defined goal. Besides delivering a good, innnovative product these objectives may consist of e.g. financial goals (better return on investment), efficiency gains, usage of new technologies and approaches and in- or outsourcing of activities. While project I know typically start with those defined on a high-level and people are onboarded or recruited referring to those motivating goals, as soon as the project enters the death march the project’s goals are gradually getting more fuzzy, unbalanced and volatile. I don’t know silver bullets to such situations (yet), but the notion of wicked projects resonates with those and other observations. However, isn’t awareness the first step to improvement?

[…] the quality of the final product is seen as a result of the process pro-
ducing them. This assumption creates a high attention for a high-quality production process according to the credo “prevention is better than healing”

page 41

This lean wisdom sounds trivial, however, I have never seen it realized. I have yet to understand why so many managers ignore the efficiency and effectiveness gains by a proper process and instead decide to continously apply brute force which cost them more money, time, energy, motivation and subsequently of course quality. And, of course, by a „proper process“ I dont mean an perfect considers-everything-process which is both unreachable and undesirable.

„The role of standardization“, page 43

I like this illustration showing standardization as a wheel chock for the plan-do-study/check-act cycle. Similar to the paragraph above it strikes me how many projects and managers re-invent the wheel (haha) and then start the whole process from the bottom. Of course no one wants and says this, but this is what often happens.

The result is a development approach in which requirements are not “refined down to an implementation,” i.e., taken as the starting point to develop an implementation that represents those requirements, but where the business objectives are mapped to the capabilities of the technical platform to “equally consider and adjust business goals and technical aspects to come to an optimal solution corresponding to the current situation

page 63

This is another insightful depiction. It provided me a new perspective on requirements, as they help to close the gap between business goals and technical capabilities. This can be a good approach helping in situations in which a project is lacking a good „feeling“ on the right amount of requirements, between over-specification and under-specification.

page 80

Referring to studies conducted by Herzberg this shows how motivators and hygiene factors influence the motivation of the staff, and how agile methods very well support/complement those.

Later the authors come to write about the „dark side of agile“, on which they also published a paper. As observed by others before, agile statements can be easily twisted and thwarted in good or bad faith to yield an abomination which leads to the opposite and extreme positions of the original intention. Citing Rakitin’s old paper the agile manifesto can be translated as

  • Individuals and interactions over processes and tools: “Talking to people instead of using a process gives us the freedom to do whatever we want.”
  • Working software over comprehensive documentation: “We want to spend all our time coding. Remember, real programmers don’t write documentation.”
  • Customer collaboration over contract negotiation: “Haggling over the details is merely a distraction from the real work of coding. We’ll work out the details once we deliver something.”
  • Responding to change over following a plan: “Following a plan implies we have to think about the problem and how we might actually solve it. Why would we want to do that when we could be coding?”
page 111

The above is one way to twist the agile manifesto in favor of what the authors a few paragraphs later call a „cowboy coder“. This reminds of the „Programming, Motherfucker“ webpage (thanks, Kris). While such sentiment exists, I cant often blame the engineers on mocking the agile manifesto and similar approaches that way. Very often, such engineer’s reactions are preceeded by even more unfaithful perversions pushed by all sorts of management. Just to bring one example to the table: Who has not witnessed a manager throwing new (vague) requirements at the development team every other week claiming this is what agile is about and, of course, everyone has to be faster in reacting. Because agile is faster. I could go on.

In chapter 6 the authors start to synthesise and bring lean and software development together. They start with citing the seminal book of Poppendieck and Poppendieck „Implementing Lean Software Development From Concept to Cash“. I couldnt read it yet as it wasnt available in print when I tried to get a hold of it. So they provide 7 principles for lean software development:

  1. Eliminate waste;
  2. Build quality—we used the terms “autonomation” and “standardization”;
  3. Create knowledge;
  4. Defer commitment—we used the term “just-in-time”;
  5. Deliver fast—get frequent feedback from the customer and increase learning through frequent deployments;
  6. Respect people—we used the term “worker involvement”;
  7. Optimize the whole—we used the term “constant improvement.”
page 131

Nowadays, those principles may sound obvious. However, the Poppendieck book was published in 2006 and I think at that time many if not all of those principles where both not clear nor any best practices and tooling was available back then to realize them.

In a break-out box a comparison between lean and agile is given

  • Agile Methods aim to achieve Agility, i.e., the ability to adapt to the needs of the stakeholders.
  • Lean production aims to achieve efficiency, i.e., the ability to produce what the stakeholders need with the least amount of resources possible.
page 144

After this section which gives some more references to earlier work the book enters its less interesting but extensive part. Janes and Succi present three methods which shall support lean software development. They all have their merits, but I have to admit I dont catch fire for any of them.

Let me sketch those methods in short: First they introduce the „Goal Question Metric (plus)“, short GQM+. GQM+ is based on GQM, which is a methodology to derive crisp business goals. While I find some of the leading questions worthwhile, the overall concept strikes me as overly complex and hard to grasp.

After this, the authors present the „Experience Factory“. This is essentially an extension of the classic plan-do-study/check-act cycle with additional steps and a „sub-cycle“. Its a semi-interesting read, but doesnt convince me in its current form.

Finally, the concept of „Non-invasive Measurement“ is laid out. The goal of this approach is to collect data without distracting the engineers. While such non-invasiveness is indeed desirable, the proposal seems overly complex to me. I mean, there are so many ways of analyzing process flows, code quality, efficiency, etc. Why do the authors describe a database scheme for a concrete solution.

All-in-all the book „Lean Software Development in Action“ didn’t convince me. Its best part is where lean and agile are described, and the book offers a few interesting new perspectives on them to an already somewhat informed reader. Those aspects I have mostly covered above. The part where the authors bring in their ideas for methodologies which augment existing known approaches is rather weak, probably because its about academic ideas with little (not none!) exposure to real project life.

Kategorien
Book Culture

Thoughts on „Lean-Agile Software Development“

Reading and summarizing books on lean software development, so you dont have to. Part 1.

Besides agile philosophy, practices, processes & methods, lean becomes an increasingly recognized topic around software development. At least I can say that about my peer group. After an initial training on the matter in which I learned about the „general lean“ practices from the industrial production area, I had a lot of questions and doubts about its applicability in software development. Of course there are obvious connections and transfers which one could try, but I was wondering about existing experience, studies, research and best practices. So I was checking out the available books and found three. Two of them I already read, and today I want to start with the first (which I actually read second). Please note: I will not provide introductions and details on neither Lean nor Agile, as for such there is a myriad of online resources available, and I assume my readers know at least the agile part very well. Also, as usual in my book reviews, I am less focused on how well-written a book is. My focus is on new thoughts, inspiring ideas, surprising perspectives and generally speaking everything which deserves an application in my professional life (and the ones around me).

„Lean-Agile Software Development – Achieveing Enterprise Agility“ written by Alan Shalloway, Guy Beaver and James R. Trott was published in 2010, which means that in the fast-paced software industry its already quite old. For this book this was not a downside, as I could compare their takes against current state-of-the-art.

In the foreword Alan Shalloway makes an interesting observation:

Too long, this industry has suffered from a seemingly endless swing of
the pendulum from no process to too much process and then back to no process: from heavyweight methods focused on enterprise control to disciplined teams focused on the project at hand.

page xviii

I can confirm this from various scales. On a grand scale this has been true when you look at the software development history, starting decades ago. Enterprise processes, which followed the wild-west of the early days of computing, were replaced by agile practices. Even more, even within an organization down to project-level and individuals, the continued conflict about the „right amount of process“ is probably the biggest philosophical debate around in software development. Shalloway claims that lean principles can „guide us in this“ and „provides the way“. Lets see.

On page xxxviii the authors summarize „core beliefs of Lean“, preceeded by core beliefs of Agile and Waterfall. As all of those are not taken from „canonical“ sources, let me share the lean ones here, as they are a first cood summary:

Even when applied to software development, Lean is not limited to software development teams alone. On page 7 a tables lines out the contributions from all parties:

This is easier said than done. In many organizations both business and management are focused on pushing and tracking the delivery team, but spend too less time on their contributions. Another thing noteworthy here is the notion of the „delivery team“. This is not a team supporting, testing, integrating and generally taking care of delivery, no this is actually the software development team. Hence, this seems a synonym to more widely used terms like „feature teams“. I like the term delivery team, and could think about combinding both in „feature and delivery team“. Each term focuses on one aspect, the former more about the product, the latter more about the activity. In modern software development, I think its crucial to combine both in one team. Diminishing on of both aspects will inevitably lead to a suboptimal efficiency because essential parts are outsourced to other teams.

Lean principles suggest focusing on shortening time-to-market by removing delays in the development process; using JIT methods to do this is more important than keeping everyone busy

page 8

This is very valuable statement. Too often I see engineers getting dragged into task forces just because in the moment they are not overloaded. As a consequence, this leads to a culture in which everyone wants to be perceived or actually be busy all of the time. Continuous busy-ness is not sustainable and leads to growing organizational and technical debt. The cited statement instead clarifies that a lean and efficient process doesnt correspond to a process in which everyone is busy all of the time. Essentially we are talking about different dimensions.

Eliminating waste is the primary guideline for the Lean practitioner. Waste is code that is more complex than it needs to be. Waste occurs when defects are created. Waste is non-value-added effort required to create a product. Wherever there is waste, the Lean practitioner looks to the system to see how to eliminate it because it is likely that an error will continue to repeat itself, in one form or another, until we fix the system that contributed to

page 10

While reading this, it comes to my mind that everything which is not automated which can be automated is also a waste. Manual execution is inherently more error-prone in any software process.

Developers tend to take one of two approaches when forced to handle some design issue on which they are unclear. One approach is to do the simplest thing possible without doing anything to handle future requirements. The other is to anticipate what may happen and build hooks into the system for those possibilities. Both of these approaches have different challenges. The first results in code that is hard to change. […] The second results in code that is more complex than necessary. […]

An alternative approach to both of these is called “Emergent Design.” Emergent Design in software incorporates three disciplines:

  • Using the thought process of design patterns to create application architectures that are resilient and flexible
  • Limiting the implementation of design patterns to only those features that are current
  • Writing automated acceptance- and unit-tests before writing code, both to improve the thought process and to create a test harness

Using design patterns makes the code easy to change. Limiting writing to what you currently need keeps code less complex. Automated testing both improves the design and makes it safe to change. These features of emergent design, taken together, allow you to defer the commitment of a particular implementation until you understand what you actually need to do.

Page 11f

Conflicts around the aforementioned two approaches are indeed quite common. Both sides typically are able to throw business needs into the ring (pragmatism vs. sustainability). Even more, I often observe conflicted parties which did take the opposite position in the last conflict. Hence, emergent design sounds like a promising middle ground. I already have ideas in which conflicts I may bring it to the table.

Table 1.2 lists a good transfer of the industrial production costs and risks to the software world, something my first training on lean was missing out on:

On assigning persons to multiple projects at the same time, the authors cite an interesting study by Aral, Brynjolfsson and Van Alstyne. This study showed that the overall productivity of on person is reduced by 20% for the second and third parallel project, each. This is huge, also considering that often the better engineers are pulled/pushed into multiple projects/teams to rescue them. As a result, the best engineer’s capacity is reduced and thinned.

In the chapter about „Going beyond Scrum“, there is a good summary of Misunderstandings, Inaccurate Beliefs, and Limitations of Scrum:

Misunderstandings commonly held by new Scrum practitioners

  • There is no planning before starting your first Sprint.
  • There is no documentation in Scrum.
  • There is no architecture in Scrum.

Scrum beliefs we think are incorrect

  • Scrum succeeds largely because the people doing the work define how to do the work.
  • Teams need to be protected from management.
  • The product owner is the “one wring-able neck” for what the product should be.
  • When deciding what to build, start with stories: Release planning is a process of selecting stories to include in your release.
  • Teams should be comprised of generalists.
  • Inspect-and-adapt is sufficient.

Limitations of Scrum that must be transcended

  • Self-organizing teams, alone, will improve their processes beyond the team.
  • Every sprint needs to deliver value to the customer.
  • Never plan beyond the current sprint.
  • You can use Scrum-of-Scrums to coordinate interrelated teams working on different products.
  • You can use Scrum without automated acceptance testing or up-front unit tests
page 84

I will not comment on each point. The first two sections I would confirm entirely. The last section is pointing at some „missing“ aspects in Scrum, but I think just because e.g. test-driven development is missing, its not a limitation. Scrum is not claiming to be describe every aspect of software development.

In general: tables. This book really contains some nice side-by-side comparisons in tabular form. Table 5.1 compares „Scrum and Lean Perspectives“:

The book is also strong in naming typical anti-patterns in agile execution, especially when those anti-patterns clash with lean mindset.

Some common anti-patterns for Scrum teams are

  • Stories are not completed in an iteration.
  • Stories are too big.
  • Stories are not really prioritized.
  • Teams work on too many things at once.
  • Acceptance tests are not written before coding starts.
  • Quality Assurance/Testing is far behind the developers.

Here are questions we always try to use.

  • Does the team’s workload exceed its capacity?
  • When was the last time you checked your actual work process against the standard process?
  • When was the last time you changed the standard process?
  • Where are the delays in your process?
  • Is all of that WIP necessary?
  • How are you managing your WIP?
  • Are developers and testers in sync?
  • Does the storyboard really help the team keep to its work-flow?
  • Are resources properly associated with the open stories?
  • How much will limited resources affect the team’s work?
  • What resource constraints are you experiencing?
  • Can these constraints be resolved with cross-training or are they something to live with?
  • Does the storyboard reflect constraints and help the team manage them?
  • What needs to be more visible to management?
  • How will you manage your dependencies
page 95f

The authors clearly are not satisfied with the amount of guidance provided by and the reality around Scrum. „Going beyond Scrum“, they present their own extended Scrum called „Scrum#“ in two pages. They also introduce Kanban as a simpler framework. A key concept already mentioned in the last citation, even more relevant for Kanban, are „work in progress limits“. WIP limits was a known concept to me since some years, learned from former colleagues. The relationship to lean, however, was new to me and it makes total sense. Focus is soooo important, it cant be overrated. In my own experience it would say around 50% of all issues in software projects originate from lack of focus and too many things going on in parallel. Finally in its comparison of process frameworks, the authors do not forget about Extreme Programming.

Before you write a line of code, set up the following:

  • The product
  • The team
  • The environment
  • The architecture
page 109

This sounds simple, still I never experienced a software project in which more than one of those points was clear to a basic degree. Too often, organisations spawn projects with a very fuzzy project idea, an undefined team, unknown environment and a notion of „architecture will be clarified along the way“. The book goes on to provide guidance on how to set up each points before the first development iterations are started. On page 140 the authors present a template for how to draft a product vision statement.

The book also spends a chapter on „The Role of Quality Assurance in Lean-Agile Software Development“. It has, in essence, one key recommendation which is Test Driven Development (TDD). They claim „The role of testers must be one of preventing defects, not finding them“. While TDD has its merits, I think this statement is too simple. It is a bit far from reality to expect testers to have tests ready before every implementation, especially in projects in which even technological basics are not remotely clear. On the other hand, I am not saying TDD is not recommended wherever it can be applied.

If the customer cannot or will not confirm that you have delivered what they want, you should simply state that you believe that the customer does not value the feature; that it is not a high priority. If it were valuable, they would make the effort to specify the tests. Moreover, you should tell your management team that you recommend not building the functionality. If you are required to build it anyway, go ahead, but know that it could well turn out to be a waste of time.

page 164

This is quite some radical take, and it its a bit hard to match to a typical setup in which customers typically do not care about the test cases. However, I can imagine that pushing for a project setup in which tests have such a crucial position that both customers and developer give them utmost priority can nothing but benefit the efficiency of the resulting project. The authors recommend to always keep the question „How will I know I’ve done that?“ in mind, which according to them is an ultimate tool tom avoid waste.

After focusing on the team-scale, the book goes on to widen the scope to full enterprises. Also here they provide some anti-patterns (excerpts):

  • Teams are not well formed.
  • Large batches of unprioritized requirements are pushed through the organization.
  • There is no mechanism to limit work to resource capacity.
  • Program managers and business sponsors compete for resources rather than working together to maximize the return on them.
  • Automated acceptance testing is not being done. Test-driven development also is not being done. Testing is initiated too late in the development cycle.
  • Code quality is left up to programmers’ personal beliefs.
  • Finding and removing the root causes of problems is not pursued aggressively. Bugs are tolerated as a way of life in the software world. In fact, many organizations utilize bug tracking as status for release readiness.
  • Continuous process improvement is not practiced or valued. Most companies are so busy trying to fix the latest crisis that there is no time to focus on process improvement to avoid causing the next one.
Page 171f

This is a good list, of course its rather examples than extensive or complete. The second item from the top is re-iterated on page 182 when the authors state „For example, it is common for management to track the number of unfixed bugs. It seems like natural approach to assess how a team is doing. Lean-Agile thinking uses a different approach: Instead of worrying about fixing the bugs, we should concern ourselves with what is causing them.“

In my opinion, all or most of the above points originate from a lack of discipline in the management team, leading to evasive activities with the above symptoms. For examples it is much simpler to track bug lists than to solve root causes in the organization. It is mentally simpler to run from fire to fire than to reflect about fundamental improvements for the process. It is simple to request new reports in every escalation meeting than to use existing ones continuously to create a sustainable frame for the development team. Who is to blame? I think its a management culture based on 100% meetings giving almost no time to reflect and short-sighted office politics. It speaks for the authors, when Alan Shalloway write:

Some people are natural managers; I am not one of them. Historically, I have always micromanaged. Because I am good in a crisis (often creating and then solving them), when one occurred I would tend to jump in and tell my team how fix it. I knew that this behavior was inhibiting the team’s growth, so I tried delegating—letting the team figure out how to do things on their own—often with very poor results.

I was really abdicating via delegation. I needed to find a way to let the team figure out the solution but remain involved enough to ensure that it would be a good one. Fortunately, Lean management provides a way to do this. With visual controls, I can see the team’s process—I can see how the team is doing at any time—and I can see the team’s outcomes.

If the team gets into trouble, I can actively coach them to improve their results without telling them what to do. Lean gives me a way to become a better manager without resorting to old habits.

page 190

Already earlier, we have touched on software architecture. In a separate chapter the authors dive more into the question how to find the sweet spot between too much and too less architectural work.

Build only what you need at the moment and build it in a way that allows for it to be changed readily as you discover new issues.

page 204

and

The purpose of software design is not to build a framework within which all things can fit nicely. It is to define the relationships between the major concepts of the system so that when they change or new requirements emerge, the impact of the change s required is limited to local modifications

page 208

Almost at the end the book comes to speak about the origins of lean at Toyota. Interestingly, they are highlighting that

One of the brilliant insights at Toyota was that Lean principles are implemented differently in manufacturing than they are in product development.

This gave rise to another great example of Lean: the Toyota Product Development System, which is a better example for us in software development.

page 215

Now this is one revelation. So far all trainings were about Toyota’s Production System, not the Product Development System. This makes me wonder if our sources are the right ones.

With that let me close this review. This book was a good read, it explained the existing frameworks, showed their flaws and issues both in theory and practice, made concrete recommendations. All in all this book is a recommendation if you want to read how agile and lean can be combined in state-of-the-art software development processes.

Kategorien
Coding Culture

Technology Radar #27: Automotive SW perspective

As written before, I really like the regular updates provided by Thoughtworks in their Technology Radar. My focus is on the applicability of techniques, tools, platforms and languages for automotive software, with a further focus on embedded in-car software. Hence, I am ignoring pure web-development and machine learning/data analytics stuff which usually makes a huge portion of the whole report. Recently, its volume 27 has been published. Let’s have a look!

As usual, lets start with a dive in the „Technologies“ sector and its „Adopt“ perimeter. The first entry we can find is about „path-to-production mapping„. Its as familiar as it sounds – many of my readers will have heard about the Value Stream Mapping or similar process mapping approaches. Thoughtworks state by themselves that this one is so obvious, still they didnt cover it in their reports yet. Sometimes, the simple ideas are the powerful ones. I can confirm from my own experience that a value stream map laying out all the process steps and inefficiencies in an easy to digest manner is a good eye opener and can help to focus on the real problems instead of beating around the bush.

Something very interesting for all the operating systems and platform plans in Automotive is the notion of an „incremental developer platform„. The underlying observation that „teams shooting for too much of that platform vision too fast“ is something I can confirm from own experience. Engineers love to develop sustainable platforms, but underestimate all the efforts required for it, and management with its impatience is further undermining platform plans. Following the book Team Topologies‘ concept of a „thinnest viable platform“ makes sense here. Not shooting too far in the first step, but also treating a platform product as an incremental endeavour.

Another one which strikes me is „observability in CI/CD pipelines„. With the increasing amount and complexity of CI/CD pipelines in one project, let alone a whole organization, many operational questions arise. And operations always benefit from clear data and overview. Recently, a then-student and now colleague and me designed and realized a tool which enables CI/CD monitoring for more than one repo, but for a graph of repos. I hope we can publish/open this project anytime soon.

In the platforms sector, Backstage entered the „adopt“ perimeter. The project is actively developing forward, and indeed could be an interesting tool for building an internal sw engineering community.

Looking at the tools sector, I liked Hadolint for finding common issues in Dockerfiles.

Kategorien
Culture

Developing a CI/CD maturity model for ECU SW engineering

Continuous Integration in Automotive is like teenage sex:

everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…

(after Dan Anely)

When talking about software integration and its modern CI/CT/CD patterns, there is a lot of confusion and misunderstanding. I was already in multiple rounds in which projects showed their process maturity in shiny colors, however when observing the daily practice it was far away from the presented state. Claiming to do „CI/CD“ and really living it is a huge difference. As we seek objectivity in comparisons, we went ahead and started to develop a CI/CD maturity model, applied to ECU software engineering.

As input we took our own experience and a lot of research papers. Of course we could hardly avoid the canonical work Accelerate from Forsgren et al. But even that one cannot be applied in a straightforward manner. For us, it is important to focus on one ECU project, and not the overall carline project. The latter is of course desirable, however far out of our scope and reach.

Without further ado, here is our current state for discussion, review and comments. Feel free to share constructive criticism. At the bottom you will find our references (the ones we already incporporated and the ones we still have on our bucket list).

Unfortunately the social intranet crops the table. You can still read it by clicking the middle mouse button and dragging it.

KPIDescriptionL1: InitialAd-hocL2: ManagedManaged by humansL3: DefinedExecuted by toolsL4: Quantitatively managedEnforced by toolsL5: OptimizingOptimized by tools
Lead time for changesadapted to ECU project:Product delivery lead time as the time it takes to go from code committed to code successfully running in release customers (carline, BROP orga)One month or longerBetween one week and one monthBetween one day and one weekless than one dayless than one hour
Deployment frequencyadapted to ECU project:The frequency of code deployment to internal customers (carline, BROP orga). This can include bug fixes, improved capabilities and new features.Fewer than once every six monthsBetween  once per month and once every six monthsBetween once per week and once per monthBetween once per day and once per weekOn demand, once per day or more often
Mean Time to Restore (MTTR)How quickly can a formerly working feature/capability be restored?(not: How quickly can a bug be fixed!)More than one monthLess than one monthLess than one weekLess than one dayLess than one hour
Change Fail PercentageApplicable ?
Pipeline knowledge bus factorHow many people are able to fix broken pipelines within reasonable time and/or add new pipeline features with significant complexity (doesnt include rocket science)?≤ 1233 + people in various teamsat least one in most project teams
DevOpsAre pipelines managed by the software developers or not?Pipelines are (almost) exclusively managed by dev-team-external staffPipelines are maintained by the developers; new capabilities are added by externalsPipelines are maintained and improved by the developers in almost all casesPipelines are maintained and improved by the developers in all cases
AccessibilityCan all contributors, also from suppliers, access whatever they need for their jobs?Developers can access only fragments of the system, and additional access is hardly possible (only with escalation/new paperwork)Developers can get access to most parts of the system, and for the other parts there is some replacement (libraries)Developers dont have full access, however they can trigger a system build with their changes and get full resultsDevelopers have access to the full project SW and are able to do a full system build.
Review CultureWhat (code) review culture does the project have?Optional reviews without guidelines and unclarity who is actually supposed to give reviews; regular flamewarsMandatory reviews with defined reviewers; defined guidelines; flamewars happen seldomTools conduct essential parts of review (code style, consistency); barely any flamewars
Communication
Release Cycle DurationTime needed from last functional contribution until release is officially deliveredmonthsweeksdayshoursminutes
Delayed deliveriesHow often do delayed contributions lead to special activities during a release cycleEvery timeOften (>50%)Seldom (<25%)Rarely (<10%)Never
Release timelinessHow often are releases not coming in timeOften (>50%)Seldom (<25%)Rarely (<10%)NeverNo release timelines needed
Release scopeAre given timelines determining the planned scope for the next release (excluding catastrophic surprise events)Planned scope is mostly considered unfeasible; priority discussions are ongoing at any time80% of planned scope seems feasible; priority discussions are coming up during the implementation cycle repeatedly100% of planned scope usually is feasiblePlanned scope doesnt fill available capacity, leaving room for refactoring, technical debt reductionThere is no scope from outside the dev team defined, things are delivered when they are done (team is trusted)
Release artifact collectionHow are release artifacts gathered, combined, put togetherEverything manualSome SW manual, some SW automaticall SW automatic, documentation manualSW + documentation automatic
TracabilityConsistency between configurations, requirements, architecture, code, test cases and deliveriesNo consistency/tracability targetedIncomplete manual activitiesMostly ensured by toolsUntracable/inconsistent elements are syntactically not possibleUntracable/inconsistent elements are semantically not possible
DeliveryHow is delivery happening?Ad-hoc without appropriate tooling (mail, usb stick, network drives)Systematic with appriopriate tooling (artifactory, MIC)Automatic delivery from development to customer with manual triggerAutomatic delivery for every release candidate
Ad-hoc customer/management demoability
Feature Toggles
Test automationNo or a bit of exploratory testingTest plan is executed by human testers for all test activitiesAll test activities except for special UX testing are automatedNo contributions can avoid passing the obligatory automated testsAautomatically deriving new test cases
Test activitiesA bit of exploratory testingManual, systematic black box testingStatic code analysisUnit TestingIntegration TestingSystem TestingE2E Acceptance TestingChaos engineering, Fuzzing and Mutation testing
Virtual Targets
Quality CriteriaSW Quality management does not exist/not known/not clearSW Quality management is a side activity and first standards are „in progress“SW quality is someone’s 100% occupation and a quality standard existsSW quality is measured every daySW quality measured gaps are actively closed with highest prio
RegressionsHow often do regressions occur (regression = loss of once working functionality)Regressions happen with every releaseRegressions happen often (>50%)Regressions happen seldom (<25%) and are known before the release is deliveredRegressions happen rarely (<10%) and are known before the released software is assembled
ReportingVideowall, dashboards, KPIsNo or ever changing KPIsDefined KPIs, manually measuredSome KPIs are automatically measured, but its meaningfulness is debated, therefore doesnt play a huge role in decision-makingAutomatically measured and live displayed KPI data, used as significant input by decision-makersLive data KPIs are main source of planning and priorization
A/B Testing
Release CandidatesHow often are release candidates made availableNo release candidates existingBefore a release, release candidates are regularly identified and sharedDaily release candidatesEvery change leads to a release candidate
Master branchWhen do developers contribute to master?Irregularly, feature branches are alive for weeks or moreRegularly, feature branches exist for daysChanges are usually merged to master every day
DevOpsAre pipelines managed by the software developers or not?Pipelines are (almost) exclusively managed by dev-team-external staffPipelines are maintained by the developers; new capabilities are added by externalsPipelines are maintained and improved by the developers in almost all casesPipelines are maintained and improved by the developers in all cases
SW Configuration ManagementHow are product variants and generations managed?No systematic variant management, some variants are just ignored yetsystematic manual management of variants, variants are partially covered by branchesAll variants are managed via configurations and pipelines on same branchSoftware is reused over generations and variants
Reproducability
Customer FeedbackCustomers can be internal, too; not only end-usersCustomer not known/existing, no customer feedback availableInternal proxy customer, providing feedback which doesn’t play a big roleExternal customer is available and his/her feedback plays a relevant roleEnd-users‘ feedback is available to the developers and a relevant input for design, planning and priorizationEnd-users‘ feedback is main input for design, planning and priorization
HeartbeatIs a regular heartbeat existing?No regular heartbeat (e.g. time between reelases)Regular heartbeat, but often exceptions happenRegular heartbeat without exceptions
IT reliabilitySW Factory is regularly breaking; developers have local alternatives in place to not get blockedSW Factory is often broken..
Developer FeedbackHow quickly do developers get (latest) feedback from respective test activitiesMonths after the respective implementationweeks after the respective implementationdays after the respective implementationhours after the respective implementationminutes after the respective implementation
DevSecOps Stuff
Security Scanning
ISO26262?

Incorporated references:

Kategorien
Coding Culture

Technology Radar #26: Automotive SW perspective

As written before, I really like the regular updates provided by Thoughtworks in their Technology Radar. Since the new version #26 was released a few weeks back, I found now the time to put down my notes. My focus is on the applicability of techniques, tools, platforms and languages for automotive software, with a further focus on embedded in-car software. Hence, I am ignoring pure web-development and machine learning/data analytics stuff which usually makes a huge portion of the whole report. Let’s go!

In the techniques section in the „adopt“ circle we initially have „single team remote wall“. In a nutshell I think they mean having a dashboard showing the essential data, kpis and tasks for a remote development team. I think the trick here is the „single“ as I assume that most remote teams have dashboards, however usually multiple ones loosely coupled. In my current team, our Scrum Master has created a great Jira dashboard showing some essential data which could give hints at the team’s performance.

The second noteworthy technique is „documentation quadrants“. Referring to documentation.divio.com/ this provides a nice taxonomy of different documentation types. This is very relatable, as I very often experience a fuzzy mixture of all those types scattered in many places. Certainly this is something I will bring to my work network’s attention.

Third, we have „rethinking remote standups“. This follows a general observation that conducting remote daily standups in the same duration and content like the were recommended in former times (e.g. the typical 15 min Scrum daily) does not provide the same amount of alignment within a development team. This is not necessarily because of the the meeting itself, but because other casual sync occasions during the day are happening less in remote setups. In the radar, its recommended to try an extension to one hour, and of course the goal is to decrease the overall meeting load by this. I am thorn on this one, as I was always a fan of crisp daily meetings, avoiding random rambling on topics concerning only parts of the team. Blocking 1 hour for everyone every day sounds like an overshoot approach.

Next there is again the „software bill of materials“ topic. This is currently a huge topic in the software industry, there have been very concrete examples recently (e.g. the Log4Shell or NPM package events you probably read about). Tool support to transparently and consistently managing the used software in a bigger project is really needed. While in the web and cloud business there is a growing number of tools, in the embedded world there are only some puzzle pieces. I can currently think of some Yocto support for this, however this covers only Linux parts in usually more complex multi-os automotive ECUs.

„Transitional Architecture“ sounds like a promising thing, even though the radar’s description stays a bit vague. Luckily there is an extensive article by Thoughtworks‘ Martin Fowler on this approach. In my opinion, managing legacy software in complex setups is one of the key challenges in the whole software industry, even more so in automotive embedded software, which is characterized by the co-existence of decades old technologies with state of the art approaches. Formalizing the transition from on older architecture to a newer makes sense, as usually this transition is often not architecturally covered as extensively as a the target architecture. This leads to misunderstandings, hacky workarounds and other unwanted side effects to a sustainable development.

Going one circle to the outside, in the „assess“ perimeter, we first find CUPID. Aimed at replacing the SOLID rules with instead a set of properties of „joyful code“, it has some interesting observations and paradigms. Currently I only skipped over it, I think this deserves more time and maybe a dedicated article. However, I can recommend to check out the well written original blogpost by Dan North.

In the „hold“ perimeter we see „miscellaneous platform teams“. In contrast to „platform engineering product teams“ described earlier in the radar, this is kind of a degradation form. If a platform team fails to define a clear product goal and identify its customers, usually the scope becomes (or is) very fuzzy, leading to a unclear platform system. Hence, its strongly recommended to avoid this by achieving clarity of what actually is the scope of the team.

In the platforms sector, I could only identify one relevant blip „GitLab CI/CD“. Recently I see a lot of discouragement of using Jenkins, and of course if you use already Gitlab for its other elements (code hosting, code review, issue tracking) you may as well use it for CI/CD pipelines. For sure its better integrated in the overall Gitlab experience. However its just another vendor-specific DSL, so I wonder if there will be practical standardization on the pipeline definition soon.

Looking at the „Tools“ sector, I found the reference to the two code search tools Comby and Sourcegraph. Besides offering code search and browsing using abstract syntax tree analysis, they are also offering semi-semantical batch changes, enabling „large scale changes“. Comby is an open source tool, while sourcegraph is commercial. I think I will try at least one of them soon.

Kategorien
Coding Tinkering

Simple CI/CD for embedded devices

After some years of private tinkering with CI/CD workflows for web development, and a good load of professional exposure to embedded projects working hard to get CI/CD in a scaling, fast and reliable setup, I wanted to combine both. Earlier, I did some trial-and-error-project leveraging the great NodeMCU boards, but it was without any automated testing and no ci pipelines used. So it was time to make a step further in my private endeavors and setup a CI/CD pipeline with automated flashing and testing of new embedded code. An important requirements was that the automated testing should test the complete embedded device consisting of its hardware and software in completeness (black box test). Hence, only the „official“ outside interfaces like serial interface and physical output (LED!) should be used for automated tests. Of course, the used hardware, software and complexity in no way match what our projects‘ engineering teams handle every day, and I don’t intend to compete with the engineers. Its an exercise for myself to learn.

Without further ado, lets have a look at the setup:

We see the following: My vserver-hosted Jenkins is the same as usual. However, as local node connected to the target device (NodeMCU), I am using a Raspberry Pi Zero 2 W. I connected it as a node (formerly called slave) to my Jenkins master. The NodeMCU target is connected via a USB cable to that Raspberry Pi. The Raspi is able to build the code, flash it to the NodeMCU and run some tests written with Python. The NodeMCU target has an LED (with resistor) connected, which is controlled by the embedded software. To close the circle, the Raspi also has a BH1750 light detector board connected. The idea is: Whenever a software change is built and flashed, the Raspi can automaticall test if the LED is correctly lit, and if not, fail the test, hence the overall pipeline.

Eventually, with some vacation-breaks I made this work. Yay! There were, however, some impediments to overcome. You can find them here in case you would ever have a similar endeavor 🙂

  • To build the software on the Raspi, I had to install on it the according framework. After great experience with it on the desktop VSCode extension, I learned in the docs that PlatformIO also has a headless cli client. How to get it onto the Raspi? First I thought using a docker container would be a good choice, to have a reproducible environment. However, it was really hard to find any working, up-to-date docker for that. Hence, I finally decided to got with PlatformIO’s super-handy installer script.
  • Getting the LED to work with some simple code was not as straightfoward as I hoped after studying the docs initially. I could not make the LED light up at 50% no matter what I did to the values or wire connections. It was always at full power (which would be generally fine but not with the „product“ I had in mind, more on that in a later blog post I guess). I even used my simple osciloscope. Finally, it turned out that all the tutorials which explained this dead-simple setup had one snag: the provided value range had beed changed as a breaking change quite recently. After adapting to the new range it worked!
  • The BH1750 light sensor is connected with an I2C connection to the Raspi, which was my first own engineering exposure to I2C ever. In my first attempt I could connect it to the NodeMCU successfully, but after some travel and trying the same via the Raspi failed miserably. Again a lot of trial-and-error, until some random guys with the same issues on the interner hinted towards the pin connection on the BH1750. Indeed, it was extremely sensitive and the pins were always a bit off. I finally soldered it together and then it worked like a charm.
  • Using a Raspberry Pi Zero 2 to build the embedded software doesn’t sound very proper, and indeed a clean build takes a while and mostly blocks the complete machine for other things. The good news is that PlatformIO offers a simple out-of-the-box build cache solution, which requires not more than configuring a build cache directory, which is subsequently used. Of course, caches should always used with care and for reproducible build they should probably be turned off. Again, in my very simple setup it does its job.

So, as usual I spent most time debugging unexpected snags than on the actual concept or missing pieces. Still learned a lot.

I still have some open to-dos and followups:

  • Of course, my current „product“ is dead simple. Having the basics in place I want to extend the functionality gradually. With this minimal CI setup I can add/modify functionality without introducing regressions on the existing functionality.
  • Improve the pipeline duration. The average pipeline takes about 1 min, of which the flashing consumes the majority (30-40 secs) of time. Also there is a significant waiting time in the beginning which Jenkins does not include in the pipeline duration. I need to investigate what that is about.
  • At the moment the software is updated via USB cable. The NodeMCU framework also offers software update over the air (OTA). Leveraging that would some more flexibility in placing the device, and of course enabling Continuous Deployment on „production targets“. A first step I am considering is adding OTA as a second flash procedure, followed by another test run.
  • Exploring more options of NodeMCU’s and PlatformIO’s frameworks/tools. While setting up the above, I saw a lot of interesting things worth to investigate further.

Here is the (shortened) Jenkins log for reference:

Push event to branch master
Looking up repository jakob/NodeCI
Querying the current revision of branch master...
[...]
[Pipeline] Start of Pipeline
[Pipeline] node
Running on rpizero in /home/jakob/jenkinsnode/workspace/Jakob_NodeCI_master
[Pipeline] {
[Pipeline] stage
[Pipeline] { (Declarative: Checkout SCM)
[Pipeline] checkout
The recommended git tool is: NONE
using credential [...]
Fetching changes from the remote Git repository
Fetching without tags
[...]
Commit message: "added acceptace test for physical led"
 > /usr/bin/git config core.sparsecheckout # timeout=10
 > /usr/bin/git checkout -f
[...]
[Gitea] Notifying branch build status: PENDING Build started...
[Gitea] Notified
[Pipeline] }
[Pipeline] // stage
[Pipeline] withEnv
[Pipeline] {
[Pipeline] stage
[Pipeline] { (build)
[Pipeline] sh
+ /home/jakob/.platformio/penv/bin/pio run
Processing nodemcuv2 (platform: espressif8266; board: nodemcuv2; framework: arduino)
--------------------------------------------------------------------------------
Verbose mode can be enabled via `-v, --verbose` option
CONFIGURATION: https://docs.platformio.org/page/boards/espressif8266/nodemcuv2.html
PLATFORM: Espressif 8266 (3.2.0) > NodeMCU 1.0 (ESP-12E Module)
HARDWARE: ESP8266 80MHz, 80KB RAM, 4MB Flash
PACKAGES: 
 - framework-arduinoespressif8266 3.30002.0 (3.0.2) 
 - tool-esptool 1.413.0 (4.13) 
 - tool-esptoolpy 1.30000.201119 (3.0.0) 
 - toolchain-xtensa 2.100300.210717 (10.3.0)
LDF: Library Dependency Finder -> https://bit.ly/configure-pio-ldf
LDF Modes: Finder ~ chain, Compatibility ~ soft
Found 36 compatible libraries
Scanning dependencies...
No dependencies
Building in release mode
Retrieving maximum program size .pio/build/nodemcuv2/firmware.elf
Checking size .pio/build/nodemcuv2/firmware.elf
Advanced Memory Usage is available via "PlatformIO Home > Project Inspect"
RAM:   [===       ]  34.3% (used 28088 bytes from 81920 bytes)
Flash: [===       ]  25.7% (used 268009 bytes from 1044464 bytes)
========================= [SUCCESS] Took 7.07 seconds =========================
[Pipeline] }
[Pipeline] // stage
[Pipeline] stage
[Pipeline] { (flash)
[Pipeline] sh
+ /home/jakob/.platformio/penv/bin/pio run --target upload
Processing nodemcuv2 (platform: espressif8266; board: nodemcuv2; framework: arduino)
--------------------------------------------------------------------------------
Verbose mode can be enabled via `-v, --verbose` option
CONFIGURATION: https://docs.platformio.org/page/boards/espressif8266/nodemcuv2.html
PLATFORM: Espressif 8266 (3.2.0) > NodeMCU 1.0 (ESP-12E Module)
HARDWARE: ESP8266 80MHz, 80KB RAM, 4MB Flash
PACKAGES: 
 - framework-arduinoespressif8266 3.30002.0 (3.0.2) 
 - tool-esptool 1.413.0 (4.13) 
 - tool-esptoolpy 1.30000.201119 (3.0.0) 
 - tool-mklittlefs 1.203.210628 (2.3) 
 - tool-mkspiffs 1.200.0 (2.0) 
 - toolchain-xtensa 2.100300.210717 (10.3.0)
LDF: Library Dependency Finder -> https://bit.ly/configure-pio-ldf
LDF Modes: Finder ~ chain, Compatibility ~ soft
Found 36 compatible libraries
Scanning dependencies...
No dependencies
Building in release mode
Retrieving maximum program size .pio/build/nodemcuv2/firmware.elf
Checking size .pio/build/nodemcuv2/firmware.elf
Advanced Memory Usage is available via "PlatformIO Home > Project Inspect"
RAM:   [===       ]  34.3% (used 28088 bytes from 81920 bytes)
Flash: [===       ]  25.7% (used 268009 bytes from 1044464 bytes)
Configuring upload protocol...
AVAILABLE: espota, esptool
CURRENT: upload_protocol = esptool
Looking for upload port...

Warning! Please install `99-platformio-udev.rules`. 
More details: https://docs.platformio.org/page/faq.html#platformio-udev-rules

Auto-detected: /dev/ttyUSB0
Uploading .pio/build/nodemcuv2/firmware.bin
esptool.py v3.0
Serial port /dev/ttyUSB0
Connecting....
Chip is ESP8266EX
Features: WiFi
Crystal is 26MHz
MAC: e0:98:06:85:d6:23
Uploading stub...
Running stub...
Stub running...
Configuring flash size...
Compressed 272160 bytes to 199933...
Writing at 0x00000000... (7 %)
Writing at 0x00004000... (15 %)
Writing at 0x00008000... (23 %)
Writing at 0x0000c000... (30 %)
Writing at 0x00010000... (38 %)
Writing at 0x00014000... (46 %)
Writing at 0x00018000... (53 %)
Writing at 0x0001c000... (61 %)
Writing at 0x00020000... (69 %)
Writing at 0x00024000... (76 %)
Writing at 0x00028000... (84 %)
Writing at 0x0002c000... (92 %)
Writing at 0x00030000... (100 %)
Wrote 272160 bytes (199933 compressed) at 0x00000000 in 18.0 seconds (effective 121.2 kbit/s)...
Hash of data verified.

Leaving...
Hard resetting via RTS pin...
========================= [SUCCESS] Took 27.74 seconds =========================
[Pipeline] }
[Pipeline] // stage
[Pipeline] stage
[Pipeline] { (test)
[Pipeline] sh
+ python test/acceptance.py -v
testHelloWorld (__main__.AcceptanceTests) ... ok
testLEDIsTurnedOn (__main__.AcceptanceTests) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.743s

OK
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
[Gitea] Notifying branch build status: SUCCESS This commit looks good
[Gitea] Notified
Finished: SUCCESS