At some point, your co-worker (peer or otherwise), or even an entire team, will massively screw up. Or you will. Perhaps you already have. If the latter is true, congratulations – you’re in a great place to make a positive difference.
I’m going to assume, perhaps barring certain egregious transgressions, that you already manage failure from a place of empathy and growth, not through shaming or punitive means. If that’s not the case, you’ll first have to look elsewhere for help to make that change.
The top players in any game will have messed up on several occasions by the time they win the championship. I’m delivering zero percent news here. People are going to make mistakes, and we need to incorporate how to handle them in our plans, processes and controls. There’s little reason, in an effectively managed engineering environment, that a mistake should be a surprise or even treated as much of an exception. It’s a natural condition that will arise sooner or later and we can be ready for it with some foresight.
After stopping any bleeding with remedies you already have in place on the technical and process levels, the need to manage the people aspect of the event will quickly bubble up to the surface. Depending on the issue, this may come true before a long-term solution can be implemented.
I have my own disaster story ready for when I encounter an engineer with the recognizable look of someone who pressed the enter key a bit too soon.
It’s 1999, two days before Super Bowl XXXIII. I knew nothing about American football other than it’s big business. I still know nothing about American football but that’s not important right now. I was working at a dot-com building out ecommerce sites for clients on our proprietary platform. (The same platform also powered personal websites five years before Myspace was launched. Props to our founder.)
Our source control system consisted of us yelling filenames to each other across the room. Our QA process began and ended when an engineer got their code running. Our deployment process entailed each engineer manually uploading said code to the right directory on the production server via FTP. Well, that was best case; many times we’d just SSH into the server and start vim.
We were all hands on deck getting ready to launch a website for a client that sold football merch. As you can imagine, the Super Bowl was to them what the holiday season is to most retail businesses. We’d had extra staff on site for weeks populating the production inventory database with SKUs and uploading photos. We’d been writing lots of code under time pressure and made alterations to the database schema to match. It was crunch time and we know how that goes.
And go, it did.
As part of my development testing, I had a SQL script that I used in each iteration to clean up and re-seed the test data. It was safe and surgical up until the point I ran into some issue that drove me to take a shortcut in the interest of time. I added a DROP DATABASE statement at the top of my script and just rebuilt the entire schema each time. In other words, instead of tracking down the root problem with the surgical clean-up method, I dropped a nuke from orbit with every script execution. It allowed me to focus on the remaining feature development in hopes of delivering this whole thing before the deadline.
Problem solved. I was now fast again. It’s Friday afternoon and the Super Bowl is on Sunday. Feature complete. Ship it. I emailed (!) the SQL script to the company’s database administrator, a fantastic, trusted – and trusting – teammate of mine. I can still hear our go/no-go conversation in my head:
“Ready to go, Juan?”
I was standing on the other side of his desk when our DBA ran the script in the production environment. The satisfying sound of the big red button being pressed reached my ears.
Then, a great disturbance in the Force, as if megabytes of data cried out in terror and were suddenly silenced.
I knew what we had just done. The DBA knew that we had done something very inappropriate because it was pretty clear I was trying to figure out how to spawn a wormhole, jump into it and never be seen by human eyes again.
At this point, an impressive variety of expletives saturated the soundscape, eventually dwindling to simple repetition of “no” and then all was quiet.
Our database administrator tried to be kind and place the blame squarely on himself for executing a script without reviewing it. He may have been ultimately responsible for that database, sure, but I felt it was mostly my mistake. Looking back, the truth is that this screw-up was on both of us. It happened because we’d grown reliant on heroic behavior, developed bad habits due to time pressures and had become overly trusting of each other (in the bad way) on a tight-knit team.
Heroics don’t scale along any axis. Good habits need to be maintained through consistent practice or it all snowballs. Trust (the good kind) is knowing my teammate will check my work and push back when required, and vice versa, as a safety mechanism for the team and product.
Oh well, restore from backup, right? Not so fast. The day before, our DBA had disabled logging in the Oracle database for performance reasons. You know, getting ready for scale, 1990s-style. The unfortunate and unexpected side-effect of this was that the primary method to rebuild the data was no longer available to us. There was no other backup. (We did hire an infrastructure guy some time after this, having come to the realization that such a person might be good to employ.)
At this point, everyone except for the two of us had left the office for the day, what with all the development completed and such. The ghostly shells of our former rock star selves were just floating around some space that was vaguely reminiscent of our workplace. We had to call the boss and instruct him to fire us.
He didn’t. He got in his car on a Friday evening and drove back to work. In utter fear and shame, we explained what had transpired and he listened calmly. Leading by example, he aptly proved to us that what’s important isn’t to eradicate mistakes (it can’t be done) or destroying a person after they mess up. What matters is what the team does after the mistake has been made, how we grow from it and how we improve the way we work together in terms of processes, controls and communication.
I hope you’ll never have to experience a 27-hour workday, but that’s what the three of us did. We managed to catch the tail end of Oracle Support US’s shift to open a high-priority ticket. We then spent eight hours on the phone with Oracle Support Australia, trying various solutions to no avail. Eventually, Earth had spun enough in space during our workday that we had to be transferred to a third continent and Oracle Support Europe. The support engineer there walked us through slicing up and piecing together some binary files in some obscure Oracle directory which somehow resulted in us getting our data back, minus some minor pieces. My memory is blurry as to the details after more than two decades, but the lesson is etched in stone.
Our client’s ecommerce site went live with less than 24 hours to spare, but it did go live. We had stuck together as a team and as human beings. As the weeks went by, the feelings of fatigue, shame and failure faded and our boss and the rest of us were able to laugh about it. All of us learned several huge lessons that day. More importantly, we learned many lessons about the period of time preceding it; the habits, processes and team dynamics that had emerged as a result of the environment and parameters around the project. We were OK. We were fine. We moved on.
This is my disaster story. I have shorter versions for the TikTok crowd. The point is, you should consider having your own story ready because someone, some day, will need to hear it.
The entire event taught me the value of caution and discipline in engineering, and I’m happy to say I haven’t blown away any production databases since. I have mental checks in place that may appear alien to those who shoot from the hip, but they’ve served me well. Equally, in the midst of disaster, my boss showed me – without a word about him doing so – how to deal with mistakes in a manner that serves to educate, grow and improve people and teams over the long run.
He was in a great place to make a positive difference.