So here’s another occasion with a similar scenario where a small kludge solution turns a complicated design into something really easy but it requires a separate perspective on the problem. This is from my work on the CPU debug team at Intel, circa 2007.
Engineers in my group were trying to resolve a problem. We had a tester that had really slow access to the output data of the tester, a Sapphire tester from Schlumberger. We could run a test a million and one times but we didn’t have the bandwidth to efficiently dump the entire output to check the results after each test. If we stopped and dumped out the pattern memory after every run it would actually take hundreds to thousands of times longer. We wanted to run tests really fast, so we could run lots of them. We were doing debug testing, we would run the same test over and over again under different conditions.
So they were talking about this problem in the cube next to me. They were trying to figure out ways to optimize the tester software or ways they could optimize the pattern memory layout. I joined the discussion and pointed out to them that what they were trying to do is compare a lot of data with a lot of other data. Well a tester is actually really good at doing this quickly. You have an expect data and compare it to the actual data and a tester can easily inform you when there’s a difference.
With our characterization of the silicon we wanted to find the failures as we changed the power, temperature and clock frequency. Scan chains provided us an internal snapshot of the microprocessor. Now a scan chain output could be checked as it comes out. However, the current way of doing things was to run the test, capture the scan chain output data in pattern memory, dump the tester pattern memory for post-processing with software that checked bit by bit for failure. The hit to the characterization speed was dumping the tester pattern memory.
I suggested adding two to three logic gates on the tester loadboard. Then the logic gates compare the scan output to a reference output stored by the tester. Then we only have a single bit to look for. This tester did have a ‘fail pin’ feature. As soon as the pin goes “1” we know we have a failure and we are just looking for pass/fail. We know the clock cycle at which that failure occurs. Our debug tools can provide the insight into all the failures after the characterization has been completed.
So instead of a taking a hardware problem turned into data transfer problem turned into a software problem, a different perspective takes it as a hardware problem with a hardware solution that requires just a couple of logic gates.
Manufacturing test is all about pass/fail so you don’t have to add logic gates to do it. Debug test has a different focus. We were running it in a scenario in which we wanted to find when the failures happened. We want it to fail we want to know the details of how it fails so that we can deduce whether it’s a design issue, a silicon issue or a test plan issue. Most of the time it’s a design issue– a sizing of gates or timing of delays had to be adjusted. And because you needed to know where exactly the issue is it’s important to identify at what point in the scan chain it failed.
So we could either grab the full chain output compare against the known good find out when the failure happened (the clock cycle) and use that to inform our process. OR we could have this one bit that flags a failure and we immediately know at which clock cycle the failure happened, and the tester would report, at full speed, the failing clock cycle of every single iteration of the test. Thereby removing the entire software and data transfer from the process.
I flipped it on its head, by adding a little hardware on the loadboard, using the tester’s innate capability to flag the failure and return the failing cycle number. I think a lot of the things I have done in throughout my work in all areas is to know when to use a simpler hardware solution to solve a hard software problem and when to use a simpler software solution to a hard hardware problem.
Have a Productive Day,
Dear Reader, please share your comments and stories that are sparked by this piece. Have you brought a different perspective to a problem? Did you overhear engineers debating a problem-solution and pipe in to add your thoughts? See Contribute for how you can share a story at The Engineers’ Daughter.
The Automatic Test Equipment industry has gone through a consolidation. At the time that Joe was working with the Sapphire Schulumberger had probably already spun off its ATE operations. As cited in a here “In 2003, the Automated Test Equipment group, part of the 1979 Fairchild Semiconductor acquisition, was spun off to NPTest Holding, which later sold it to Credence.” The Sapphire tester had been designed for characterization but not debug.
The engineers in Joe’s team focused on debug and characterization. A common tool for comprehending behavior of an electronic device over a set of parameters is called a Shmoo Plot.
Additional transcript from Anne Meixner and Joe FitzPatrick discussing this story.
Anne: Would you call that logic thing that kinda a kludge?
Joe: It’s definitely a kludge but I like to look at it like a hack.
Anne: Hack may be the new term for kludge (a generational choice of words).
Joe: Right, When I say a hack I mean like a creative use of resources in a way they weren’t intended to deliver results.
Anne: Which is different than a kludge? I recall an engineer looking at a board in the lab and exclaiming “this is such a kludge.” Kludge has a negative connotation.
Joe: When I use the word hack it’s a cool little solution to be proud of. A kludge is like “yeah this is ugly but this is how we had to do it.” So I should have another writing prompt a “Engineering Hack.” Different fields may have different interpretations of the word hack.
Anne: It could also be a generational thing kludge vs hack. I like your definition it’s a very precise definition.