Welcome to HDL Express, the personal webpages of Kirk Weedman
HDL stands for Hardware Description Language.
This website also contains information on various Verilog/FPGA tutorials, Alternative Energy projects, and progress new CPU architectures that I'm designing.
I'm an electronic design engineer specializing in contract Verilog RTL FPGA design, functional verification, simulation and debug. I have a varied background in other disciplines too.
Currently available for new FPGA design contract work.
Below is my progress of the new CPU architectures I'm working on.
1. Nov. 2015: Working on a patent "A Novel Concept to Eliminate Branch MisPredictions in Pipelined CPUs".
So far I can't find a patent method like this. This method is very practical for FPGAs, and maybe even for modern CPUs. This will improve CPU performance as branch mispredictions can cause modern CPU's to lose significant performance. I'm currently working on designing a new 5 stage pipeline CPU that will demonstrate the method showing how smooth CPU branches could be in the future. A PowerPoint presentation has been finished describing two methods and I'm just going over it again before working on documentation and maybe creating a patent. Whether I decide to get a patent will depend on feedback from other engineers and engineering professors I know.
The new logic/pipeline smoothly transitions to a branch-taken or branch_not_taken at the end of the Execute stage, without stalls (caused by the branch instruction) or flushing the pipeline. It will smoothly switch even if there are multiple back-to-back branches in the pipeline or even if there is a branch-taken in the current pipeline sequence followed by an immediate branch-taken in the next instruction sequence. It's a unique design as far as I can tell.
The method is applicable to deeper CPU pipelines as well. These two methods are mostly hardware with a minimum of one new CPU instruction. There are no software techniques used or needed such as loop unrolling, or other branch misprediction optimization techniques.
Jan 4, 2016: First working RTL simulations.
Jan 8, 2016: Continued testing/debugging in Verilog/ModelSim simulation. Methods 1 and 2 seem to be working well so far. I'm also investigating a new technique (Method 3) that would allow pipelined processors to be redesigned with Method 3, and also allow it to work with existing compiled code (no need for the 1 new instruction(s) that Methods 1 & 2 use). It looks promising. It appears I have one main hurdle, but I think it's doable.
Jan 10, 2016: Preparing a PowerPoint presentation to explain Methods 1 & 2 & show working simulation to 3 engineering university professors in WA State to get their input/feedback.
Jan. 17, 2016: Working on implementing Method 3 (no new instruction needed as in previous methods). Working on Verilog code for new pipelined CPU with Method 3 as well as a simulation testbench.
Jan. 25, 2016: I will admit that under certain conditions there may be a stall of a few clocks, but nothing like flushing a pipeline, etc.. due to a misprediction. There are no mispredictions in this design. There are also certain conditions, depending on the type of CPU, where some instructions that follow a branch can execute faster than normal. (I'll have to explain that one day). Currently debugging Method 3 in simulation
Jan. 28, 2016: First working simulation of Method 3 in a new 5 stage pipeline (no extra CPU instruction like was used in Methods 1 & 2). This doesn't mean it's all debugged yet. The simulation included a loop and some IF/ELSE statements inside the loop. I will need to simulate different tests. I discovered that for the CPU designed, that it should really have another stage between the Fetch and Decode. There is some logic used with an adder (due to Method 3) currently in the Decode stage that would cause delays in an FPGA/ASIC that could be greatly reduced by putting it in a new stage before Decode. This is because the Decode stage needs the results of the adder. Putting it in Fetch is not good either and would cause a longer Fetch stage delay. To save time to show that Method 3 works, I haven't done this, but hope to someday. I'm more concerned about getting the method to work than designing a new CPU right now.
Feb 9, 2016: Modifying/debugging ISA. Most of the instructions appear to be working. Still slower than it should be due to no forwarding logic yet. Also fixes/changes/additions to the C compiler that generates the code. These changes will allow me to write more C code for testing. Pipeline still fixed at 5 stages for now. Began working on some forwarding logic.
Feb. 15, 2016: Added register forwarding. Will work on adding other types of forwarding. Started creating technial drawings. Made a Visio drawing of the pipeline showing the new logic for branching as well as forwarding logic and control signals used in the pipeline.
Feb. 22, 2016: Fixes/changes to KPU55 ISA and KCC. Updates to PPT presentation. Created a 29 page Word document (from the PPT presentation) explaining Methods 1 & 2. Still need to add Method 3 to PPT and Word docs. Word doc is long because it has LOTS of diagrams/illustration.
Mar. 1, 2016: Updated Word document to include Method3 as well as the PowerPoint presentation. Researching how these methods may apply to OoOE processors.
Mar. 2, 2016: In researching how these branch methods may apply in OoOE processors, I've gotten sidetracked today with a new algorithm I came up with for OoOE that is easier to implement than Tomasulo's. I've begun creating a new OoOE CPU to see if it's any good.
Mar. 9, 2016: Modified ISA for both KPU55 and KPU_OoOE. Working on creating KPU_OoOE CPU with Branch Method 3. Modifications to DECODE stage so that it may be easy to change it to use microcode instead of brute force decoding for every instruction. It would also facilitate changing over to a different ISA - such as an ARM Cortex. The microcode method will reduce the amount of logic gates in the DECODE and thus reduce delays in the stage. Might be nice to show this CPU running on an ARM instruction set. Anyone have any suggestions?
Mar. 17, 2016: Decided to go with the ARM7-TDMI ISA for this new Out of Order design.... I know. Even though my OoOE internals are very different than ARM's, I guess I can't release the CPU with this ISA. I also have some of the basic modules created and they finally, tonight, compile without errors. Currently setting up a test bench and top level to integrate all the modules and begin debugging. First test will be just to get the basic flow working for all the Data Processing type instructions and see that my new style of tags & Reorder logic works. I also setup IAR tools for ARM to generate binary code to use in my Verilog simulations.
Mar 22, 2016: Just found out during simulation that my new OoOE architecture has a design problem. So back to the drawing board on that idea. Still working on a new architecture based on the ARM7-TDMI ISA.
April 2, 2016: Created a disassembler (for a portion of the ISA) to display the assembly language instructions during ModelSim simulations. Also looking at a another OoOE architect idea. However, I now have to do some other work for the next 2 - 3 months before I can continue on this project.
June 24, 2016: Slowly starting to work on the new ARM OoOE architect idea I have as time permits. Just completed writing a new module (A linked list type of Reservation Station) I will need to make sure is synthesizable and then simulate it. I finally found a patent (owned by QualComm) that may have some similarity to my design, but it doesn't have features my design method does.
July 8, 2016: Worked more on a module that will be the equivalent of a reservation station for the new ARM OoOE cpu.
July 22, 2016: Finished initial design of the LLRS (Linked List Reservation Stations) module (llrs.v) and have most of the Reg. Rename (reg_rename.v) module written that will do register renaming logic for the new ARM OoOE cpu.
// Example: N = 4 wide instruction fetch
|0||1||2||3||--->||Decode & Tag #0||--->||Reg. Rename #0||--->||LLRS #0||--->||Issue/Execute|
|Decode & Tag #1||--->||Reg. Rename #1||--->||LLRS #1||--->||ROB/commit|
|Decode & Tag #2||--->||Reg. Rename #2||--->||LLRS #2||--->||----->|
|Decode & Tag #3||--->||Reg. Rename #3||--->||LLRS #3||--->|
Decode & Tag - Tags are assigned a simple incrementing number (i.e. 0-255, 0- ...) The Tag "magic" happens in LLRS and ROB/commit.
Reg. Rename - The number of physical registers is parameterized and can be changed to get the best performance per application. Currently it's set to 64. reg_rename.v keeps track, using a linked list, to which of the physical registers are the current architectural registers defined by the ISA.
LLRS - a new type of Linked List Reservation Station. The first version will not be superscalar as shown above, but will be designed & parameterized so that it can become a superscalar. There are two linked lists that use a common pool of Reservation Stations. One linked list is "used" pool items and the other linked list is the "unused" pool items. Only 1 clock cycle is needed by the module due to the way the linked lists are maintained. The "used" linked list keeps ready instructions in order from oldest to youngest so that the oldest can be passed on to Issue/Execute first.
Issue/Execute - contains a pool of execution resources (non floating point for the first version) usable by all available instructions issued per clock cycle from the LLRS module(s). Multiple instructions (more than N) can be passed per clock to the ROB/Commit
ROB/Commit can commit multiple (more than N) ready instructions per clock.
August 24, 2016: Spent several hours (out of the last month since last reporting) working on the Register Rename block above, but discovered a new method that looks promising. For now I'm calling it Hazard Control. Like register renaming methods, it eliminates WAW & WAR hazards. It keeps track of instruction dependencies but doesn't rename registers and/or allocate a new set of physical registers. So far, the logic appears to be simpler than typical renaming techniques. The goal is something simpler (in terms of logic levels which determine min. clock cycle widths) because I want to eventually make the issue width much larger than 4. I hope to be able to devote most of my time on this starting by the end of October. For the last 4 months I've had other obligations. :(
|0||1||2||3||--->||Decode & Tag #0||--->||Hazard Control||--->||LLRS #0||--->||Issue/Execute|
|Decode & Tag #1||--->||--->||LLRS #1||--->||ROB/commit|
|Decode & Tag #2||--->||--->||LLRS #2||--->||----->|
|Decode & Tag #3||--->||--->||LLRS #3||--->|
2. Researching/designing/learning to create a new C compiler to target brand new CPUs (which I'm also designing a new one) for embedded FPGA designs.unique visitors since Mar. 3, 2016