By: Kirk Weedman

www.hdlexpress.com

Below is my progress of the new Out of Order CPU architecture.

 

Mar. 2, 2016: In researching how these branch methods may apply in OoOE processors, I've gotten sidetracked today with a new algorithm I came up with for OoOE that is easier to implement than Tomasulo's. I've begun creating a new OoOE CPU to see if it's any good.

Mar. 9, 2016: Modified ISA for both KPU55 and KPU_OoOE. Working on creating KPU_OoOE CPU with Branch Method 3. Modifications to DECODE stage so that it may be easy to change it to use microcode instead of brute force decoding for every instruction. It would also facilitate changing over to a different ISA - such as an ARM Cortex. The microcode method will reduce the amount of logic gates in the DECODE and thus reduce delays in the stage. Might be nice to show this CPU running on an ARM instruction set. Anyone have any suggestions?

Mar. 17, 2016: Decided to go with the ARMv7 ISA for this new Out of Order design.... I know. Even though my OoOE internals are very different than ARM's, I guess I can't release the CPU with this ISA. I also have some of the basic modules created and they finally, tonight, compile without errors. Currently setting up a test bench and top level to integrate all the modules and begin debugging. First test will be just to get the basic flow working for all the Data Processing type instructions and see that my new style of logic works. I also setup IAR tools for ARM to generate binary code to use in my Verilog simulations.

Mar 22, 2016: Just found out during simulation that my new OoOE architecture has a design problem. So back to the drawing board on that idea. Still working on a new architecture based on the ARM7-TDMI ISA.

April 2, 2016: Created a disassembler (for a portion of the ISA) to display the assembly language instructions during ModelSim simulations. Also looking at a another OoOE architect idea. However, I now have to do some other work for the next 2 - 3 months before I can continue on this project.

June 24, 2016: Slowly starting to work on the new ARM OoOE architect idea I have as time permits. Just completed writing a new module (A linked list type of Reservation Station) I will need to make sure is synthesizable and then simulate it. I finally found a patent (owned by QualComm) that may have some similarity to my design, but it doesn't have features my design method does.

July 8, 2016: Worked more on a module that will be the equivalent of a reservation station for the new ARM OoOE cpu.

July 22, 2016: Finished initial design of the LLRS (Linked List Reservation Stations) module (llrs.v) and have most of the Reg. Rename (reg_rename.v) module written that will do register renaming logic for the new ARM OoOE cpu.

// Example: N = 4 wide instruction fetch

0 1 2 3 ---> Decode & Tag #0 ---> Reg. Rename #0 ---> LLRS #0 ---> Issue/Execute    
                         
          Decode & Tag #1 ---> Reg. Rename #1 ---> LLRS #1 --->   ROB/commit
                       
          Decode & Tag #2 ---> Reg. Rename #2 ---> LLRS #2 ---> ----->
                       
          Decode & Tag #3 ---> Reg. Rename #3 ---> LLRS #3 --->  
                           
          ...   ...   ...        

Decode & Tag - Instruction decoding and "tag" creation. The tag "magic" happens in LLRS and ROB/commit.

Reg. Rename - The number of physical registers is parameterized and can be changed to get the best performance per application. Currently it's set to 64. reg_rename.v keeps track, using a linked list, to which of the physical registers are the current architectural registers defined by the ISA.

LLRS - a new type of Linked List Reservation Station. The first version will not be superscalar as shown above, but will be designed & parameterized so that it can become a superscalar. There are two linked lists that use a common pool of Reservation Stations. One linked list is "used" pool items and the other linked list is the "unused" pool items. Only 1 clock cycle is needed by the module due to the way the linked lists are maintained. The "used" linked list keeps ready instructions in order from oldest to youngest so that the oldest can be passed on to Issue/Execute first.

Issue/Execute - contains a pool of execution resources (non floating point for the first version) usable by all available instructions issued per clock cycle from the LLRS module(s). Multiple instructions (more than N) can be passed per clock to the ROB/Commit

ROB/Commit can commit multiple (more than N) ready instructions per clock.

August 24, 2016: Spent several hours (out of the last month since last reporting) working on the Register Rename block above, but discovered a new method that looks promising. For now I'm calling it Hazard Control. Like register renaming methods, it eliminates WAW & WAR hazards. It keeps track of instruction dependencies but doesn't rename registers and/or allocate a new set of physical registers. So far, the logic appears to be simpler than typical renaming techniques. The goal is something simpler (in terms of logic levels which determine min. clock cycle widths) because I want to eventually make the issue width much larger than 4. I hope to be able to devote most of my time on this starting by the end of October. For the last 4 months I've had other obligations. :(

0 1 2 3 ---> Decode & Tag #0 ---> Hazard Control ---> LLRS #0 ---> Issue/Execute    
                       
          Decode & Tag #1 ---> ---> LLRS #1 --->   ROB/commit
                     
          Decode & Tag #2 ---> ---> LLRS #2 ---> ----->
                     
          Decode & Tag #3 ---> ---> LLRS #3 --->  
              ...   ...    
              ---> LLRS #M --->    
          ...   ...   ...        

 

October 5, 2016: It's started raining here in Oregon so I've been inside the last couple of days and made good progress and improvements on the Fetch, Decode&Tag, Hazard/Control, and LLRS sections shown above. I may rename some sections as this is a very different architect for Out of Order type CPUs. One important change to the above diagram is that there are now more LLRS blocks than the number of instructions Fetched per clock (Decodes too). The ROB/Commit section will be able to retire more instructions per clock (if available and can be done) than there will be instructions Fetched per clock. This will help improve IPC as some instructions may have to wait awhile (i.e. Loads, Stores,etc..) in the LLRS blocks before they are ready to be executed.

October 9, 2016: Got all but the ROB/commit module (not designed yet) to compile without errors. Next I'll start making testbenches for individual modules and the whole design.

October 13, 2016: Various code mods. Renamed Hazard Control to Dependency Control.

October 18, 2016: For debugging purposes I am optionally passing the full 32 bit instruction through each stage. I'm creating a Verilog code module that can be used to disassemble the instructions at various locations and display in ModelSim the disassembled instructions as as ASCII string in the waveforms. This will greatly help watching the data flow.

October 26, 2016: Worked the ALU logic used in alu_functional_unit.v, microcode ROM/RAM table and logic, decode.v logic updates

October 31, 2016: Added new ALU control bits to microcode ROM/RAM table and logic, added Multipliers and logic in alu_functional_unit.v, got logic completed for ARM Data Processing instruction flow from Fetch through Issue/Execute ready to begin debugging once I create a testbench.

November 1, 2016: Wrote commit.v, updated gpr.v and kpu_ooe.v

November 2, 2016: Wrote simple testbench top_tb1.v and started the debugging process. Typical bugs like port size mismatches between modules, wire vs. reg usage on port signals, etc... Still more to debug before I can actually get the simulation (waveforms) running so I can debug the logic.

November 4, 2016: Currently debugging debug_asm.v, fetch.v, etc.. Right click on the pic below and save it so you can see the full hi res pic. This shows the beginning of debugging instructions using the debug_asm.v module in a ModelSim simulation. If anyone is interested in helping debug this module and using it, contact me. You will need ModelSim and a tool such as IAR Workbench to compile code to generate assembly language to compare against ModelSim, and binary code (in hex format) that can be placed in a file that the simulation can read in.

November 7, 2016: Good debugging progress up to DC (Dependency Control). Also updates/fixes to debug_asm.v, which is renamed to armv7_disasm.v. Currently teh parameter determining the number of instructions per clock is set to 4. Most everything is parameterized so it can be easily changed. Things such as IPC, the number of LLRS units, instruction window length, the number of different types of functional units, etc.. This will make it easy to try different CPU design configurations later on.

November 9, 2016: Working on DC - LLRS flow/control logic. Snapshot below (open in another tab to see full size) of the Dependency Control unit working on 4 instructions per clock. This can easily be changed to 8 ... 64... by changing 1 parameter. Similar changes to other modules have to be made.

November 10, 2016: Fixes to llrs.v logic. Looking much better now.

November 14, 2016: llrs.v is beginning to work. Still different test cases need simulating. In the picture below, I created some special Verilog code just for debugging the linked lists. Look at the "used_str" and "unused_str" signals. These directly correlate to two singly linked lists called "used" and "unused" Both lists use a single array, but have pointers that separate which entries are in which list (used or ununused). The special verilog code goes through both lists and creates ascii strings that can be displayed to show the linking order of both lists during simulation. This was a great way to debug the actual RTL code to know if my linked lists were working properly and that all links are updated correctly. Each time a CPU instruction is saved it takes an "unused" entry and links it into the "used" linked list. Not shown, but yet to test will be when an instruction is removed and a "used" entry goes back to the "unused" list. There is no data physically moved, just linked list pointers (values) get updated. In debugging of the architecture, the original 32 bit instructions are passed to each stage along with microcode and other signals. What is not shown here is all the microcode and various signals saved in the linked lists corresponding to the instructions that are being used. For the final version (RTL only) all the debug and instruction passing will not be compiled into the design. In this particular simulation display, it is showing 1 of the 6 llrs.v modules, with each having an instruction queue depth of 8 (parameter can easily be changed to 64, 256 or whatever is needed)

right click on pic and either save or view in a new tab to see the full resolution

November 16, 2016: more debugging of llrs.v (problems with adding/removing data in linked lists), dependency_control.v and just about to start debugging llrs.v - commit.v interactions.

November 17, 2016: Starting creating br_functional_unit.v so that branches dont get stuck in the llrs.v units while debugging. For now I'll just bring them into the br_functional_unit.v and pass them on to commit.v so they can be retired and not hold up data processing instrucutions I'm currently debugging.

November 18, 2016: Anther fix to llrs.v to correct what happens when both a trasfer in and out of the module occur at the same time. Also some minor fixes to commit.v and kpu_oooe.v. Today was the first day that the first instruction made it all the way through the architecture. Unfortunately nothing else did... Debugging is going well.

November 20, 2016: Ran into a dependency algorithm problem. May take some time to figure this one out.

November 24, 2016: Found out why the CPU appeared to be going very slow - no forwarding logic. Added one (of 3 I need to add) level of forwarding from the ALU Functional Units back to the LLRS stage. The commit.v now has a Reorder Buffer and then commit logic before data is written back to the architectural registers. Thus forwarding logic will get added for the ROB and the COMMIT stages once I get other debugging done.

November 27, 2016: Bugs found and fixed. Got first 100 instructions executed in 31 clocks. Still needs more forwarding logic. The following pic shows the instruction flow in the ROB/Commit section using my ARM disassembler in the simulation. Right click on the pic and save or open in a new tab to see the full resolution.

November 29 - Dec 5, 2016: Microcode table data fixes and improvements to the microcode state machine/instructions. Feedingvarious instructions to armv7_disasm.v and debugging them. I use the results to also check decode.v functionality. This will take a lot of time.

Right click on the pic and save or open in a new tab to see the full resolution.

Dec 6, 2016: Changed armv7_disasm.v. Changed the disassembler code to use SystemVerilog "string"s instead of a method with regular Verilog. It shortened the code a bit. Spent the day adding new instructions and debugging them. Still need to add Media instructions. Being able to SEE the instruction flow during simulation is a big help.

Dec 8, 2016: Added most of the ARM 32 bit Media Instructions to armv7_disasm.v.

Dec 11, 2016: Completed adding ARM 32 bit Media Instructions to armv7_disasm.v. Currently the architecture does not yet incorporate any THUMB instructions and probably won't for awhile as the real goal is to show the new Out of Order microarchitecture being able to work. Debugged 200+ different instructions from various types (Data Processing Register, Data Processing Register-shifted Register, Misc., Data Processing Immediate, Halfword multiply and multiply accumulate, multiply and multiply accumulate, Sync. Primitives, Extra Load/Store, Branch, etc..)

Dec 13, 2016: Started working on the Load/Store Functional Unit and microcode control logic. Fixed a couple issues with decode.v.

Dec 14, 2016: After months of having random booting issue after a poweron, my desktop computer finally died today :(. After trying to figure out the problem, I discovered the Power Supply voltages were all very low! Drove in the afternoon snowstorm to Fry's in Wilsonville, OR to get a new supply and put it in. Its working again and not having the boot issue. Started creating a powerpoint presentation about this new Out of Order Execution microarchitecture.

Right click on the pic and save or open in a new tab to see the full resolution.

Dec 20, 2016: Worked on microcode table and logic for ARM "Extra load/store" instructions. Continued PPT presentation - 45 slides so far.

Dec 22, 2016: Continuation of decode,microcode and disassemble fixes. See bottom signals in waveform below
Right click on the pic and save or open in a new tab to see the full resolution.

Dec 23, 2016: Worked on the load/store functional unit (ls_functional_unit.v) and related logic in the whole system. Still have a chunk of logic related to Load/Store to go into the final Commit stage. Updated microcode table in arm_micro.v related to load/store instructions.

Dec 26, 2016: Created a Xilinx ISE project and compiled the Verilog modules to see if I had any RTL problems. A few things had to be changed to be synthesizable. Created a new simplified block diagram that is now in my PPT presentation.

Dec 28, 2016: Added more ARM Media instruction microcode, debugged various microcode. Fixes to transfer logic from LLRS to Functional Units. Added simulation debug code to arm_micro.v to abort sim if a certain invalid type of microcode is used (pulled from table). Found a few instances of this problem and corrected logic.

Dec 30, 2016: Overhauling the integer multiplier section of the ALU to maximize resources and fix problems. Removed one of the multipiers. Only uses one 16x16 unsigned multiplier and one 32x32 unsigned mutiplier.I may post a block diagram here when finished.

Jan 2, 2016: Basically finished with new integer multiplier section of the ALU. Here's a simple block diagram of what was implemented.
ARMv7 integer multiplierUpdated Jan4. Right click on the pic and save or open in a new tab to see the full resolution.

Jan 3, 2016: Register forwarding updates/fixes. Microcode table updates/fixes. However I'm only getting 258 instructions is 111 clocks = IPC of 2.3 for a 4 issue wide CPU. I can see where the problem lies. Just need to figure it out.

Jan 4, 2016: Found that I had a parameter that was wrong and some of the LLRS units where not getting data. Fixed it and now IPC = 256 instructions in 73 clocks = IPC of 3.5 for a 4 issue wide CPU. Still appears to be another problem slowing it down, but things are getting better. The simulation test used is a random wide variety of instructions with no Load/Store delay penalty and no branch delay penalty so the IPC should be VERY close to 4 for this 4 issue wide CPU simulation setup. Currently the L/S and branch instructions are decoded and processed up to the Functional Unit related to them, but they are basically discarded until I have time to work on them.

Jan 5, 2016: Spent time debugging/fixing commit.v logic and related ALU problems.

Jan 6, 2016: Spending time determining how to issue multiple micro-ops due to a single CPU instruction.

Jan 10, 2016: To facilitate multiple micro-ops due to a single instruction, the microarchitecture has been changed to the following.

New Out of Order CPU microarchitectureRight click on the pic and save or open in a new tab to see the full resolution.

Jan 16, 2016: Updates/fixes to forwarding logic has greatly improved IPC. For the test being used (186 instructions completed in 47 clocks, hardware ignoring branches & load/store instructions), the IPC is close to 4 for a 4 instruction wide issue. The LLRS queues are only averaging 1 or 2 instructions in them up until where the test has a lot of back to back load/store type instructions. This backup is at least partially due to no forwarding logic for load/store instructions yet. There may be some other instructions doing this too. Anyway, the forwarding logic currently implemented seems to be working much better. More work on the powerpoint presentation about this design.

Jan 29, 2017: Updated the PPTX presentation. Worked on a new Load/Store logic. This will soon be integrated into the last stage of the design.

Feb 7, 2017: Nearing completion of writing code for the Load/Store logic. Need to create an L1 cache that will interface to the Load/Store logic.

Feb 12, 2017: Decided to change all code to use System Verilog features. Found a couple of coding bugs doing this. Working to get simulations working again. (Worked before converting to System Verilog)

Feb 16, 2017: Conversion to System Verilog mostly complete and simulations working again. Separated ROB/COMMIT code and fixed a couple bugs due to both being together in the same module.

Right click on the pic and save or open in a new tab to see the full resolution.
website counter unique visitors since Nov. 14, 2016