SLAC-PUB-2418 October 1979 CONF-791037--7 # FIFERIERCE USING THE 168/E MICROPROCESSOR FOR OFF-LINE DATA AMALYSIS\* Paul F. Kunz, Richard M. Fall, Michael F. Gravina, John H. Halperin, Lorne J. Lavinson, Gerard J. Oxoby, and Quang H. Trang Stanford Linear Accelerator Center, Stanford University, Stanford, California, 94305 ## ABSTRACT The 168/E is a SIAC developed microprocessor which enulates the 18M 360/370 computers with an execution speed of about one half of a 18M 170/168. These processors are used in perallel for the track finding and the controlled by a PDF-11 sintcomputer wis a three portinterface which we call the Bermuda Triangle. The tage handling and downloading is controlled by a DF-11 sintcomputer wis a three portinterface which we call the Bermuda Triangle. The tage handling and downloading is controlled by one of SIAC sill computers wis a SIAC built interface between the system of 6 168/E's which should be able to give sixtings the production capacity than can be attained by running production jobs on the SIAC Triplex system. The cost of the system, including the channel interpruter power of 3 IBM 370/168°s. Rence, this system is an extremely cost-effective method for off-line data analysis. ### INTRODUCTION In recent years, we have seen the construction of many large spectrometers at High Energy Physics Laboratories, Those spectrometers are capable of taking data at such, a rate that the amount of ccaputing the regular of the seen The goal of this project has been to add to the computer center inexpensive hardware that can execute identically the same program and get the same results as the large computer. This ied to the development of the 168/E microprocessor [1,2]. It "emulates" those IRN 170 instructions that are generated by IRN FURTRAN compiler and its speed in about one has for the large of larg ### GENERAL FEATURES OF PROCESSOR The 168/E consists of an integer CPU, a floating point processor, memory, and an interface. They are all built on boards seasuring about 12 by 16 inches, which are identical to those used by DEC in their VAX-computers. #### Integer CPU The integer CPU circuit is based on the 2901, which is a LSI bit siice microprocessor chip introduced by Advanced Micro Devices in the summer of 1975. This board handles the following types of 180 cm control of 1975. This board handles the following types of the 1975. This board handles the following types of the 1975. This board handles the following types of the 1975. This could be a control of 1975. The could be controlled the 1975 of 1975. The could be controlled to 1975 of 197 # Floating Point The floating point processor consists of two circuit boards. It is entirely MSI logic but uses the "nev" MSI circuits which have be introduced to support the LSI components. The processor handles all LBM 360/370 single precision floating point instructions with excession and the support of the support of the processor handles all LBM 360/370 single precision floating point instructions with excession in the support of #### Henory Henory The steady for the 168/E is in two parts, one for the program and the other for the data. Both are based on the Intel 2147 memory i.c. The Intel 2147 has become the Industrial arandar circuit atthough only increased the Industrial arandar circuit atthough only increased the Industrial arandar circuit atthough only increased the Industrial arandar circuit and the Industrial arandar companies have just announced they are also producing it. This circuit conteins 4,096 words by 1 bit with a 70 near access and cycle time. It has a unique leature in that when the memory in not being addressed, it powers in the Industrial Companies of the Industrial Companies of the Industrial Companies of the Industrial Companies of the Industrial Companies of the Industrial Companies in the System, and considerable power is saved in the system, and considerable power is saved in the Eystem, and considerable power is saved in the Eystem, and Considerable power is saved in the Eystem, and Considerable in the Industrial Companies of the Industrial Companies of the Industrial Companies of the Industrial Companies of the Industrial Companies of the Industrial Companies of the Industrial Containing 24 memory circuits for program. That is, one memory board contains exactly 16 k bytes of data and 4008 and contains exactly 16 k bytes of data and 4008 and 150 kbytes of the Industrial Containing State of the Industrial Containing State of the Industrial Containing State of the Industrial Containing State of Only Industrial Containing State Only Industrial Containing State Only Indus #### Interface The 168/E is not capable, as currently designed, of doing any input or output instructions. It can only support of the control of the control of the control of the control of the control of the second of the control start the processor. <sup>\*</sup> Work supported by the Department of Energy under contract number DE-ACO3-76SF00515. # UBJECT CODE TRANSLATOR AND LINKER The 168/E microprocessor does not execute 188 1997 in a fructions directly not not execute 188 1997 in a fructions directly 188 1987 tructions 1897 in a fruction of the control co Another program, we call the "linker" does two jobs. First, it does the job of the INN Linkage-Editor by reading the relocatable microinstruction object modules and linking them together. Second, it forms an absolute memory image which can be loaded into the 1887 memories that it is a present absolute and the second of the second is the second of the second is a second of the second is a second of the sec ### CAN A MICROPROCESSOR DO THE BIG JOB? Having built at very inw cost a microprocessor that can be programmed in FURTRAN and has a speed which is no worse than twice as sion as 370/1664, is a fine achievement. But due to the design choices that have been made, it is still fair to ask the question: can it do the real number crunching job that we have with the LASS production code? First of all, to be useful it must do a significant fraction of the time consuming part of the job. With the LASS production code, well over half of the CPU time is spent in the subrounders. Thus "libit guast he able to execute this subrounders. Thus "libit guast he able to execute this subrounders." It is subtouted the subrounders of the subrounders of the subrounders of the subrounders. The subrounders of the subrounders of the subrounders of the subrounders of the subrounders of the subrounders. The subrounders of The next question is whether the subset of 360/370 instructions that the 168/E can emilter is sufficient. In this part of the code, we found the two types of FUNTRAN statements which lead to JBN instructions that can not be emilated by the processor. These statements of 100 (10,20), NI and statements using one byte logical variables. It turns out, however, that their elimation is a good idea anyway. The computed CO TO statement is itself elimation in the computed the statement is itself elimation to the computed control of the computed control of the computed control of the Thus, the 168/E processor could be used to take the most time consulting part of the production code any most time consulting part of the production code any most time consulting part of the code is very large; much integer that the original raw input tape data. This is because the first part of the code unpacks the raw integer data such as wire numbers, widths, etc., into banks of finaling point coordinates appropriately a final raw of a final moint coordinates appropriately compact of the code unpacks the raw integer data such as wire numbers, widths, etc., into banks of final ray or the code in t into the processor's memory. Once overlays were necessary, it was easy to extend this technique to that code what it was easy to extend this technique to that code what it was easy to extend the time consuming part. The choice was to do overlays or increase memory size. Since memory is the most expensive component of the 168/E, and overlay time would be only 10% of the execute time, we chose to do overlays. The net result we the decision to execute all of the production program that the production program and production program that the production program are producted to the production program and p data tape. ## DEFINING THE OVERLAY STRUCTURE To define an overlay structure for a program takes knowledge of the program's structure and flow. The overlays for the 168/E were defined in the following WAY! - E ch overlay should be called only once per event to prevent losing real time in doing the overlay. - The size of the overiny is determined by the ingresc piece of code which satisfies the above restriction after one has tried to break the code up into the smallest pieces. In the case of the LASS production code, the solenoid track finding mentioned above is the largest overlay. - The number of overlays is determined by fitting the rest of the code into pieces whose mined by the criteria above. Defining the overlays for the LASS production code was relatively simple, since the code proceeds from unpacking to result tormatting sectially in several logically separate parts. The overlays for LASS production code are as foilows: - 1. Unpacking raw coordinates into corrected ficating point banks. - Counting the number of match points (or space points) in order to kill the event if there are too many, and finding bean tracks. - Finding tracks in the downstream spectrometer and following these tracks through the dipole to the region between the dipole and the solenoid. - Following these downstream tracks through the sole-noid up to the target. - Finding tracks in the solenoid starting with points in the plane and cylindrical chambers. - Fitting all tracks found to a 5 parameter neitx. - Following the tracks found in the solenoid down-stream to the Gerenkov and Time-of-Flight Counters. - Doing the vertex reconstruction on all found tracks including the beam track, - Formating the result record, and accomulating sta-cistics on chamber efficiencies, etc. With each overlay, the executable code is translated and saved as a 168/E program overlay. Unlike overlays on most real computers, subroutines which appear in more than one overlay such as SIN, COS, SQRT, etc., are lightly duplicated. When an overlay is executed on the 168/E, all of the processor's program memory will be the processor's program memory will be the processor's program memory will be which contains all the background of the subroutines. We can't till the was incremal to the subroutines. We can't till the local flower's and it may be defined as all the dark space a program uses which is not in a COMMSN block. The local memory also needs to be loaded into the 168/E data accord when the program memory is loaded with an overlay. We the LASS production code, the needs converted to the 168/E production code, the needs converted to the converted by an overlay. With the overlays described above, the 168/E can handle programs much larger than can fit into its accordance to the Still larger programs can be handled by further overlaying the remaining data memory which contains the program's CONNON is cks. In order to do this, additional knowledge of the program is needed. One would like to know exactly in which overlays a CONNON is needed, in which overlays data is stored into CONNON and in which overlays data is fetched from the CONNON and in which overlays data is fetched from the CONNON and in which overlays data is fetched from the CONNON and in which overlays data is fetched from the CONNON and in which overlays data is fetched from the CONNON and in the control of co A method has been developed to study the whole pro-gram in this level of detail [3]. When each subroutine is complied and object code loaded into a load module inbrary, a data set is created which contains a summary of the CONMON block usage. We call the data set and the contains a summary of the CONMON block of the contains the data of the COMMON block. The line contains the name of the COMMON block, the line contains the variable name, its offset from the start of the COMMON block, its length, and the length of the COMMON. It is contains the Store, Fetch, and other figs generated by the FORTMAN complier. Collecting these individual data sets into one master index file now completed the contains the start of the complete into the load module iterary, stance only the entries of the master file pertaining to that subroutine are changed. This master index file, along with a file which states which subroutines are to be used in each of the overlays, can then be used as the data base for the program. It was quickly realized that COMMON blocks could be put into one of three categories; - Constant. These are COMMON blocks in which all the variables never change in the course of processing phase of the production program by BLOCK DATA statements, reading disk fixes, and/or by calculation in subroutines called once per job. - Variable. These are COMMON blocks in which all the - variables are generated and used on an event by event basis. - Mixed. These are COMMON blocks which contain both constants and variables in the sense defined above. Since constant COMMONa never change their contents, they can be easily written into the 168/E data memory as required for a particular overlay. In a sense, they are logically similar to the local memory of the subroutines which is rewritten into data memory as required. We have chosen to do this even for constant COMMONs which are used in more than one overlay. Except for the large banks of constant COMMONs used in the unpacting large banks of constant COMMONs of the large constant commons the large constant commons the constant commons as The contents of the variable COMMON blocks is created by the 168/5 in the course of processing an event. For practical reasons once a block has been created it remains in the 169/E data memory for as many overlays as it is needed, then it may be overwritten by other COMMON blocks, either constant or variable, in succeeding overlays. Mixed COMMON blocks could be handled in another way, but not simplify they were siminated in smorter way, but not simplify they were siminated it. So the constant of similar way the constant of the similar way the similar way the similar way the similar way the similar way to 349587 Figure 1: Data Hemory Overlay Load Maps With the master index as a data base, software tools have been developed to generate data senory load mage for all the overlays. An example is given in figure 1. The left hand vertical scale is data memory location expressed in bytes, and the nine columns are the nine overlays. Note that one first loads the local memory (LNO) through LHO9) for the low addresses of the processor, then the constant country addresses of the processor, then the constant CHONO through LHO9 in the local memory than the constant CHONO block were the processor of scales of the processor in the processor of scales of the processor in the processor of scales of the processor in the processor of proce until the end of processing the event. The net: effect of the data wemony overlaying is a chestantial saving in mesory required by the 1687 processor. Since memory is the most expensive part of the processor, Since memory is the most expensive part of the processor, Since memory is saved to add more processors to the system. If all the COMNON were loaded into the amount of the control of the saved of the system is required. On the program side, if all the code was loaded into the program memory at one time it would require over 120 K microinstruction words, while with the overlays less than 20 K micro instructions are needed. On the program and program into its measures that the crait item spent overlaying is 90 meser per event. This is less than 102 of the average event experience of the control of the service ### BERHUDA TRIANGLE SYSTEM The Bermuda Triangle system, shown in figure 2, is our method of overlaying the 10s/E memory. The Bermuda Triangle is a three way interface with 1/O ports to a large buffer memory, a PDF-11 UNIBN, and a bus to the 10s/L processors. Data may be transferred bidirectionally between any two ports. Two Bermuda Triangles are used, one for the program sensory and one for the data memory. The first port of the Bermuda Triangle is to the buffer memorials. The program buffer memory, with 128 years and the program of the memory with 128 years and the program to be executed. The data buffer memory with 64 K words by 12 bits (256 K bytea), is long enough to hold all the local memory and copies of the convent Collinol blocks. The data buffer memory the convent Collinol blocks. The data buffer memory the convent Collinol blocks. The data buffer memory the convent collinol became the data buffer memory. The memories are implemented with general purpose memory cards purchased from Mostek Newtony Systems. The memories are implemented with general purpose memory cards purchased from Mostek Newtony Systems. The data memory is two cards depopulated to 64 K words of 16 bifes. The cycle tase is 500 nsuc with an access time of 375 nsec. We have used the buffering and chassis that Mostek provides Correlated to 64 K words of 16 bifes. The cycle tase is 500 nsuc with an access time of 375 nsec. We have used the buffering and chassis that Mostek provides Correlated to 64 K words of 16 bifes. The cycle tase is 500 nsuc with an access time of 375 nsec. We have used the buffering and chassis that Mostek provides Correlated to 64 k words of 16 bifes. The cycle tase is 500 nsuc with an access time of 375 nsec. We have used the buffering and chassis that Mostek provides Correlated to 64 k words of 16 bifes. The cycle tase is 500 nsuc with an access time of 375 nsec. We have used the buffering and chassis that words and the memories of the buffering that both the provides Correlated to 64 k words of 16 bifes. The cycle is the buffering that both the provides Correlated to 64 k words of 16 bifes. The cycle is the buffering that both the provides Correlated to 64 k words of 16 bifes. The cycle is the buffering that both bu The second port of the Triangle is the bus to the processors. It is a 50 line flat cable with TIL Tribitate drivers and receivers. The transfer uses a protocol which is essentially identical, to the one being developed by the PASIBIS committee (4). A 24 bit address field and 32 bit data field are used. They are tree multiplexed on a sea of the processor and the control of the processor and the control of the processor and the control of the processor and the control of the processor. The rate of transfer on this bus is one word in 70 masses, thus the processor. The rate of transfer on this bus is one word in 70 masses, thus the transfer rate on the data side if the equivalent to nearly 3 H bytes per second of IMM object code. The third port of the Triangle is a PDP-II UNIBUS. A PDP-II/OA with 40 K bytes of memory is used as the control feature of the port Figure 2: The Bermuda Triangle System The buffer memories are loaded from the UNIBUS. An 8 K byte portion of the buffer memory appears as n 8 K byte portion of UNIBUS address. Both files me and 18 k byte portion of UNIBUS address, but only one is enabled at a time by a bit in their page register. Each Trinngle has a 15 bit page register which is shifted left 8 bits and added to the offset from the start of the UNIBUS address. The control of the UNIBUS address The processors are normally loaded from the buffer memory. From the UNIBUS port, the PDP-11 loads an address register for the buffer memory, an address register for the buffer memory, an address register for the processor bus, and a word count register. When the word count register is loaded the Bermuda IT-angie transfers the date until the word count is exhausted. It then causes an interrupt on the UNIBUS port. The results from the processor are normally bear in the IT-angier memory in the same fashion. A the direction of transfer. The PDP-11 gets access to the control registers of the 16975 processor by a 1 word window of the Remuda Triangle from the control port to the processor bus. In Triangle from the control port to the processor bus. In assat address register mentioned above. The double use of this address register sentioned above. The double use of this address register is not a problem because one never attempts to gain access to the control registers of the processor while transferring date to no from it. One can also gain access to the processor control registers will although the program or data bornanda Trianglo. #### CHANNEL INTERFACE With the 168/E's and the Bermuda Triangie, the for the raw data and a sink for the raw data and a sink for the raw data and a sink for the raw data and a sink for the raw data and a sink for the purpose of purpos or a disk to the IBM computer. This means that ordinary batch jobs can transfer data to and from the Bermuda Triangle system. The FORTRAN programmer gets access to the system by a simple FORTRAN callable sub- Thus the IBM 360/370 reads the raw data from tage, sends it to the PDF-Il to be processed by the 168/E Bermuda Triangle system, secrets by the 168/E and 160 reads of the 168 re ize the PDP-II and buffer memories. To synchronize the PDP-II and 370 software, the 370 siways attempts a read from the PDP-II before a write. When the IBN computer reads resuits from the PDP-II, it before a write to be software, the saw it is suffered by the saw it is suffered by the saw it is suffered by the saw it is suffered by the saw it is a write can then always be done. For normal event transfers, the control unit transfers directly in 1805 afford the serious Triangie, with the PDP-II setting up the appropriate address and page registers. If the RH computer attempts a read when no data is ready in the computer attempts a read when no data is ready in the simply queues the read command without causing an incrrupt to the page of the sam is signal as received, the IBM channel is made a request for service to the IBM channel. This request signal wakes up the channel and the transfer is started. This is standard operating procedure for devices on a IBM 360/370 channel. The block for service to the IBM channel. The IBM CPU is free to work on other loss from the time it issues the Start Donacterial post from the time it issues the Start Donacterial to complete. ## PDP-11 SOFTWARE The PDP-11/04 computer has the job of controiling the 168/E o-criays, the transfer of event data to and from the 168/E, and the transfer of data to and from the 168/E, and the transfer of data to and from the 168/E, and the transfer of data to and from the 168/E, and the transfer of data to and from the control unit. The job is divided into a number of software tasks, corresponding to the non-shareable hardware resources. There is a task for each process cach of the processor busses. As was mentioned earlier, the Bermuda Triangle was designed so that the hardware resources could easily be assigned to specific software tasks. We have chosen a small multi-tasking executive called SFEX [5] which allows all the tasks to be resident in memory and hence no disk is required on the PDP-11. It has been used as the data acquisition system in several experiments at Permilab, Brookhaven, and CERM. Each of these tasks is "driven" by a queue of work to do. The channel interface tasks receives raw event data and queues it to the processor work queue. When a processor becomes available, its task will take an event, from the work queue, supervise its transfer to the processor work queue. When a revent, from the work queue, supervise its transfer to the channel interface work queue and start working on another event from the processor work queue. Heamwhile, the channel interface work queue and start working on another event from the processor work queue. Heamwhile, the channel interface task initiates the transfer of the channel interface task initiates the transfer of the channel interface task initiates the transfer of the channel in the buffer memory and queues it to the processor work queue. The processor work queue, the started by a task which supervises the initiat transfer of the overlays and constant COMHON blocks into the buffer memory. It also fills the remainder of the luster work of the processor work the two processor busses. Since the busses can only perform one operation at a time, the processor tasks queue their requests to the individual rasks which are assigned to the busses. The executive, SFEX, handles all queuing and synchronization of the tasks. The pDP-11 itself i loaded via the channel. The The PDP-II itneif is loaded wis the channel. The IBM computer can send a hardware "Boot" command to the PDP-II so that each job on the IBM computer complete very re-initializes the whole system. The PDP software is written using a cross-assembler which is run on the IBM computers, thus there is no need for any permanent storage devices on the PDP-II such as disks, tapes, etc. This absence of any peripherals, other than a terminal, should help make the system very reliable and reduce maintenance. The D.E.C. program, OUT-II, is loaded into the PDP-II with the executive and masociated make to aid in debugging. The complete was a sociated asks to aid in debugging. The complete the power of #### SUMMARY and group of the company of the contract of In October 1979 this system came into operation with one processor. It executes all of the LASS production program with essentially the same results as an IBM 370 computer. That is, I dentically the same opints are the same of the latter parameters except for fa of the traces whatm are very poorly defined. We feel the importance of emulation can not be overemphasized. The LSS production code is nearly 20,000 innes of FORTRAN. It would be extremely time consuming to have re-written this code in assembly innuage no less microcode. Before one could have inished such a project the FORTRAN source would have inished such a project the FORTRAN source would have indubtedly been compared to be changed but also verified down to be changed but also verified down to the contract of events with the results from the same events run set of events with the results from the same events run on the LBM computer. Even the same events run on the LBM computer. Even the same events run on the LBM computer. Even the same events run on the LBM computer in the LBS code on the system only cresults which were obviously brong, of the erfors led to run the LBM computer and the erfors was forgetting to load one of the constant COMMON blocks, ordinate banks. The cause of one of the errors was forgetting to load one of the constant COMMON blocks or continate banks. The cause of one of the errors was forgetting to load one of the constant COMMON blocks or continued banks. The cause of one of the BM computer and the 168/E processor. In fact, the input to the translator is the ink-edited load modules which are used to run the program on the LBM computer. When the production or ogram intended the land conduction of the processor of the source code which is used both for the IBM computer and the 168/E processor. In fact, the input to the translator is the ink-edited load modules which are used to run the program on the LBM computer. When the production of community maps for the retranslators the IBM is computer. When the retranslates the IBM load modules. We are now preparing to run many thousands of events on the system and on the IBM computer. This will check for pathological events to a level of one in ten thousand or better. We are also preparing additional 168/E processors and expect to have be processors on the system by the end of 1979. We will thus be able to start analyzing our 50 million events with a system nearly equivalent to 3 dedicated 1BM 370/168's running 24 hours a day. # **ACKNOWLEDGEMENTS** We would like to thank D.W.G.S. Leith for his support and encourgagement on this project. Special thanks must go to Manoch Bardana of the Weizmann Institute who designed the 160/E microprocessor during his visit to SLAC. The first real test of the processor was done with the efforts of Rafi Yaari, a miso of the Elzaman Institute, who bought us the track reconstruction code from the TASSO spectrometer at DESY to run on the processor in February JPJ. Thanks also goes to George Aiken and Ai Kilert of SLAC who designed the chassis four the processor. ## BIBLIOGRAPHY - (1) Paul F. Kunz, The LASS Hardware Processor, Nuc. Instr. Meth. 9 (1976) p. 435. - Paul F. Kunz et. si., The LASS Hardware Processor, Proc. 11th Annual Microprogramming workshop, SIGMICRO Newsletter 9 (1978) p. 25. - (3) Roger B. Chafee, GOODGNUS, CGTM No. 198 Stanford Linear Accelerator Center, Stanford, Calif., 94305 - (4) B. Wadsworth, FASTBUS An Emerging Laboratory Standard, a paper in this volume. - (5) SPEX: A Spectrometer Executive, General Structure, <u>Frogrammers Nanual</u>, and <u>Users Manual</u>, J.I. <u>Fassimo</u>, 5. Nelson, L.J. Levinson, Brown University High Energy Physics Group Internal Reports 123, 124, 125 (1972)