The True Nature of I/O Timing Failures: Tales from the Intel I/O Test Road Map

on

Measurements have errors due to accuracy and precision of the equipment. In semiconductor testing, when setting a pass/fail limit one can set it such that it fails a few good parts–dubbed overkill, or it passes a few bad parts–dubbed underkill (aka escapes.) One errs on the side of protecting one’s customer, i.e. fail good parts in the interest of failing bad parts. To guarantee no escapes, guard-banding your limits due to equipment accuracy tolerances is often required. Using this approach for I/O timings, called tight timings, had motivated Mike Tripp to move from Intel’s design organization to Sort Test Technology Development.

A sequence of figures can illustrate overkill versus underkill and the impact that tight timings had on I/O design specifications.

 

Figure 1–Loose timing enables underkill, aka escapes

Figure 2–Tighten the Timings to avoid underkill, and now you have overkill

Figure 3–To avoid overkill, I/O timing performance has to be tighter than the specification

To understand the true nature of I/O timing failures, Mike worked with Spass Stointschewsky, a product development engineer, to investigate P6S (Pentium Pro on X process) fails. As a side benefit, they measured quite accurately the number of good and bad parts that failed. Such numbers can come in handy when introducing a replacement test method.

The investigation consisted of characterizing some 5000 P6S parts that failed the I/O timing test. First they repeated the manufacturing test; Spass reported that approximately 190 parts passed. Not unexpected, as timing accuracy varies from tester to tester and pin to pin. The manufacturing tests had been run on groups of pins at the same time. Hence, all one knew from the manufacturing test was that one of the pins failed.

The characterization test would identify the specific I/O pin/pins that failed and by how much they failed a specific timing test. Time consuming in nature, this data had previously been collected on first silicon to validate that the design met the timing specifications. In addition, four timing specifications had been checked on the CMOS GTL I/O circuits. A common clock to the system board provided the reference for I/O timings. The two output timing specifications, TCO min and at TCO Max (Time between Clock and Output) provided a window for valid output data. By guaranteeing a window of timings, the system interconnect performance could be guaranteed for the part receiving the output data. For input timings Setup and Hold assured that data sent from the other part would be captured if it met these timings.

Back to the experiment. The test program incrementally changed the timings to determine the timings of each failed part. As a second-hand source, I recall the following trials. They observed an inconsistent variation in the measurements across the I/Os; it turned out to relate to Die-self heating. With activity the parts temperature increased later in the characterization test program (this takes time,) which impacted the circuit performance. Once identified they adapted the test program to mitigate this effect. Spass informed me that after ninety seconds, die self-heating had an impact. He solved it by using coarser increments as he searched for the pass/fail point for each timing.

They discovered that in the population of failures approximately two-thirds turned out to be good parts. Hence there existed a ratio two good parts for one bad part. Now, before one becomes alarmed about yield impact, if the overall bad part population is on order of 1000s of units per a million parts, that would mean 2000 good parts were failed. Not too bad an impact in order to prevent a customer’s system failing (note: Intel sells on order of a hundred million microprocessor’s a year.) They also discovered that the majority of the parts failed for a single I/O; which meant a defects impacted only that I/O. A small number of timing failures resulted in a group of I/Os impacted. In this case a defect impacted the on-die clock network feeding the internal “common clock” signal feeding a set of I/O signals.

For developing the new test methodology, the empirical investigation yielded three key findings:

  • Single pins failures dominated. So knowing that one pin had significantly different timings than others would lead to the approach taken with AC I/O Loopback (link to first.)
  • The overkill was two good units for every true unit failure. As long as we didn’t do any worse than that with the new test method, no engineering team could complain.
  • Output timing failures dominated. Looking at the silicon area of the driver circuitry versus the receiver circuitry, this makes sense.

While Spass and Mike empirically collected data on failures I simulated the impact defects would have on the I/O circuit timings. Some of my findings confirmed the empirical results and it provided some unexpected insights. That story will be shared very soon.

Have a Productive Day,

Anne Meixner

Dear Reader, What comment or question does this piece spark in you? Have you had to puzzle out unexpected results? Please share your comments or stories below. You, too, can write for the Engineers’ Daughter–See Contribute for more information.

Leave a Reply

Your email address will not be published. Required fields are marked *