By Dr. Jamie Weiss, Senior Consultant Kepner-Tregoe, Inc.

Walk through a one-day, troubleshooting session with Dr. Jamie Weiss. He describes how he used the KT rational process approach to guide the logical use of knowledge, data, observation, and teamwork to resolve a serious manufacturing problem that had defied resolution for eight months.

The premise of Kepner-Tregoe's critical thinking processes is that spending a little of your time can save you a lot of your money. One of the more spectacular instances of this recently occurred at a large pharmaceutical company. This example has been blinded to protect client confidentiality. Some details have been changed or obscured to protect the client's identity.

(Click image for larger version.)
The call that initially came in to our office sounded desperate:

"We have a serious problem that's costing us a ton of money and, frankly, we're at our wits' end. A number of us took one of your three-day Introduction to Problem Solving and Decision Making workshops, and we have tried to apply what we learned to the current problem, but without success. I don't know if it's a lack of depth in our understanding of the process, or maybe just the politics of the situation. It's gotten to the point where you can't even ask certain questions without getting into trouble. Everyone has a pet theory – facts be damned – and we can't seem to get past it. I remembered that our instructor said that in cases like this KT can send in a consultant to facilitate a problem solving session themselves – could we have someone come here for two or three days to work with us on this? We'll gladly pay whatever it costs – it can't be a tenth of what this has cost us so far."

A few days later I was on-site, sitting in front of a group of 12 people who didn't look happy to be there. Arms were crossed, faces were scowling, bodies were slumped down in their seats. This is understandable – no one likes to admit that they have come up short in a situation like this.

I asked them to briefly describe the issue they were dealing with. They made vaccines of different types, and all of them required a "buffer" of some sort. It turns out that most vaccines are only a few percept vaccine, with the rest being some neutral buffer or solvent. They made this liquid buffer in three large tanks – let's call them Tank A, Tank B, and Tank C. And after each Tank, and before the next, was a filter – Filter 1, Filter 2, and Filter 3. The problem was that they had found rust in Filter B, and since rust was not supposed to be part of the buffer liquid, each batch with rust in it had to be discarded, at considerable cost.

But it was worse than that. The simple fact was: no vaccine buffer, no vaccine. As a result, they had been unable to sell any of one specific vaccine for over eight months, at a cost of millions of dollars, and were quickly approaching a back-order situation. The company was understandably concerned about the financial implications, but, to their credit, they were also concerned about the impact on patients and physicians.

I asked what attempts they had made to solve the problem. This seems like an innocuous question, but usually, by the time someone calls us for help, they have put layers and layers of short-term fixes on top of the problem. One of our first tasks is to strip away all these layers of band-aids so we can get to the original state when the symptoms first began to appear. Listening to them, it sure sounded like they had tried everything.

* They had changed Filter 2 after every failed batch, and had gotten rust every time after.

* They had swapped Filter 1 for 2 and 2 for 3 and 3 for 1, rotating them, and had gotten rust in the former Filter 1 when it was in Filter 2's position. In fact, whichever filter they put in Filter 2's position showed up with rust in it.

* They had changed the model of Filter 2 – they still got rust.

* They had change the brand of all three filters – they still got rust.

* They had cleaned and re-coated all of the tanks – they still got rust.

* They were even fiddling with the SOP – turning the temperature to the maximum or the minimum of the specified range; moving the hold time to the maximum or minimum of its range; varying the pressure within specifications. In all cases – rust, rust, and rust.

They were so frustrated they might have considered just dropping the product from their line, except they knew it was a crucial vaccine for a serious condition with no ready substitutes, and they felt an obligation to keep making it. They felt they were letting patients down.

At that point, someone entered the room pushing a hand-truck piled with six overflowing cartons of paperwork. "Can I help you?" I asked. "What's all this?"

"Oh," he answered, "these are all the files we figured we would need to help find the cause." If we were going to need all that data, two or three days wouldn't suffice – we would need two or three months. But I just smiled and pointed toward the corner.

Before we dove into the KT Problem Analysis process, I tried to set some expectations. "We're going to spend pretty much all of today just specifying the symptoms of the problem, filling out the problem space, as it were. Before we begin, let me warn you of two things. First, there are going to be questions for which you don't have the answers," I said, glancing over at the pile of boxes teetering in the corner. "That's no criticism – it's just the way it is. And second, it's going to be a little frustrating. We're not going to discuss possible causes at all today, just symptoms. I know you already have some pet causes in mind, and I can tell you're all bright enough to generate a handful of them on the spot, whenever we're ready. But we won't be ready until we describe the symptoms of what's happening."

I looked around; they were not happy. "Think of it this way," I said. "If you were a patient and I were a physician and you came in with some physical malady, you wouldn't want me to start prescribing medication or physical therapy or surgery before I took the time to hear what the malady was, right? Does it only happen after you eat? Is the pain in the lower right quadrant or along the mid-line? Is it a 2 on the old 1-to-10 scale, or is it a 9?"

I saw a few grudging nods. "I can see you're disappointed, but think of it this way: You've been coming up with causes for, what, eight months now? One more day of delay isn't so bad, relatively speaking. And by the end of today or tomorrow at the least, depending on how much missing data there is, we should be able to eliminate most of those pet causes and narrow down to the root cause."

They calmed down and we began to dive down into the data. The way the KT process works, you specify the What – Where – When – and Extent of the problem. What object has the defect or deviation? What deviation does it have? Where are the defective objects? Where on the object is the defect? When was the defect first observed? When was it observed since then? When in the process or life cycle was it seen to be defective? How many objects have the problem? How many problems are there? What trends do you see?

But as we specify this "IS" about the problem, we also specify what we call the "IS NOT". What other similar things could have the same problem but do not? Where did it not show up where it might have? When did it not show up? How much of it could there be but is not? If you think about it, any hypothesis that is the true cause has to explain when and where the problem shows up, and also when and where it doesn't.

Our first pass looked like this: [NOTE: Again, some details have been changed or obscured.]

In our experience, a good 60 to 70 percent of the time, a pattern will emerge from this IS / IS NOT Specification, and a cause can quickly be formulated and tested. This did not fall into that scenario. Nothing jumped out at us, even when we drilled down deeper into the IS / IS NOT to specify which batches, which filters, which maintenance folks had done the installations, or which lab analysts had done the testing. At this point we broke for the day and agreed to pick it back up the next morning.

(Click image for larger version.)

In cases like this, we take two additional steps. We ask, for each IS / IS NOT pair, "What is distinctive, odd, unusual, special, different, or unique about each IS compared to its corresponding IS NOT?" Where we find a distinction we ask, "What has changed on, around, or about this distinction?" The reason is simple. Causes come from Distinctions and Changes. The IS / IS NOT can weed out the ones not related to the symptoms of the problem, and can later also be used as a test of any hypothesis, since any cause, to be valid, has to explain why it happened here and not there, at this time and not that time, to this degree and not to that degree. Any cause that cannot explain the basic facts of the case cannot logically be the cause. The final step is to confirm the most likely cause—through checking assumptions, observation, testing, or installing a fix and seeing if the change goes away.

In this case, no distinctions were immediately obvious—the filters were all the same model of the same brand, installed at the same time by the same maintenance person. When I asked about the filters and the tanks, the response I got was, "They're functionally identical." But something nagged at me, poking at the edge of my awareness. Maybe it was their tone of voice, maybe their body language. During a break, I asked, "Would it be possible to go and actually see these tanks and filters?" They were glad to take a break, so we headed down to the area where this all occurred.

After gowning up and entering the air-lock, four of us walked in. Looking at the whole set-up, nothing was immediately obvious, but then I looked closer.

On closer inspection, Tank B was further from Tank C than it was from Tank A. The reason was soon obvious – a large steel I-beam separated Tanks B and C, so they had added some extra tubing between Tank B and Filter 2. We moved in for a closer look. This tubing was different from the other tubing. It was plastic, not steel, and it looked yellowed, old, worn. Hmmm . . .

As we ungowned and left the room, I saw one of our team members taking notes. When he saw the inquiring look on my face, he muttered, "I'm just making a note to get a sample of that plastic tubing tested, to see what kind of plastic it is." I said nothing, and he continued. "It's just that, whenever we get the rust, we also see a small amount of stearate at the same time. We never made much of it. We don't have a spec for how much stearate we will allow in the buffer—it's just never come up. And even if we did, it was always in miniscule quantities. But I wonder . . ."

"Remind me," I said. "It's been a long time since high-school chemistry class, but stearate is a plasticizer, isn't it?"

He nodded. "Yes, it is."

We now had a thread, an area of distinction, and it was time to pull on it and see what unraveled. When we got back to our temporary war-room, I tried to sum things up. "So we see a distinction between the three filters – downstream of plastic tubing versus not. And that's suggestive. But it does not explain why this started all of a sudden, eight months ago. Let me ask: How long have the tanks and filters been there? How long has that tubing been there?"

They shook their heads vigorously, side-to-side. "Since forever," one woman said. "At least as long as I've been there, and I think as long as we've made this product."

Someone else chimed in, "Yeah. It's been there since the beginning."

So we began a search for related changes that had occurred around 8 months ago. In many cases, the cause is an interaction effect – there is a distinction that has always been in place, but some change is introduced which selectively affects the IS only. This finally led us into the six boxes of files in the corner. Each batch of buffer had a number of components to it, and we were looking for some change in one of those components at around the same time as the problem hit, or just before then (but not just afterward – that can't be the cause, unless you accept travelling backwards in time to be a reality). A change in suppliers, a change in formulation, a change in testing – it could be anything like that. An hour of careful digging turned it up.

One of the components had changed its manufacturing site. It had formerly been manufactured in South America, and now it was bring made in Europe. And the change in sites had happened right before the first Batch that went bad. This is what we call an "area of sharp contrast" – a fact that jumps out at you. After all, how many companies in the early 21st Century are shifting production from the ‘third world' to the ‘first world'? It's quite rare; usually the shift goes in the opposite direction.

We dug into the piles of paperwork one more time to look for data on this supplier, and we quickly found what we needed – the certificate of analysis (C of A), a document every supplier has to provide that proves that what they supplied is what was specified. On the bottom of the page submitted by the South American Plant, it said, in very small type, "May contain trace amounts of Cu, Zn, Na", or copper, zinc, and sodium, for the non-chemists. In the European Plant's C of A it said, "May contain trace amounts of Cu, Zn, Na, Fe".

Fe is iron.

What had happened was that the stearate in the plastic tubing had precipitated out the iron in the raw material from Europe. The same solution had gone through all three tanks and all three filters, but only the tubing before Filter B had the stearate in it, and hence the rust.

Within a few days the cause was confirmed by both direct observation and by experiment. The plastic tubing was replaced by stainless steel, the supplier was required to filter out any iron, even trace amounts, and the client went back into production, back into making money, back into saving lives.