gencode.c Details




Currently this file may be a bit large due to all that its doing and may need to be divided into multiple files. The purpose of the code in this file is to create CPU independent code that will eventually be turned into CPU specific code.



This file, like others, contains a lot of debugging statements that can be used to debug various sections of the code. Shown below is a setup where some are turned on and some off. You should search through the code looking for where each of these different DEBUG_ printf() statements occur.



Source Code - Overview

After the DEBUG_ statements you will see some global variables used in the compiler as well as some local function prototypes.



After that, the function gen_parsed(struct exp_tree *tptr) occurs. This function is passed a pointer to the AST tree created by parser.y. The entire essence of the program parsed should be contained in the AST.



gen_parsed() basically does the following:

1. It creates a variable called fwd_data (forward data) that is a structure of various data that can be used throughout the gen_code() function. More about that later.

2. It makes 3 passes at present. This is so branching addresses can be determined if the CPU allows both short/relative branching and absolute branching.

3. It calls gen_code() which produces all the machine code and/or assembly language code needed for the CPU being targeted.

4. It closes the C source file

5. It frees up the memory used by the AST - FreeExp(tptr);

6. It frees up the memory used by the Scope Block structures - FreeScopeBlockSymbTab();


Further down is the gen_code() functions. This function is recursive and is used to traverse the AST (Abstract Syntax Tree) created by parser.y. Noticed that this function is passed "fwd_data" and "tptr" and also returns "struct gen_data" type data. Notice that gen_code() is basically a big SWITCH, executing code for a particular AST node. Each switch case (i.e NODE_CAST, NODE_OP, etc..) are defined in kcc.h and saved in each node in node_functions.c


Line 179: line_num is a static so that it doesn't disappear while this function is being recursively called. line_num holds the current line number of the C code being processed. Look at node_operand() and other functions in node_functions.c to see when the line number gets stored. Then go back further in parser.y to find this too.


Line 180: gd is a structure of gen_data type and is used throughut this function as is fwd_data.



Line 182: tc. This variable is located in parser.y as well as it's used there too. It's referred to as the Tree Counter and is used in debugging to know at what "level" in gen_code() we're at. Each time gen_code() is recursively called tc increments.


The value of tc can be seen during debug by turning on (#define) DEBUG_GEN_CODE. Below is an example.



For debugging purposes a dgb_gen_code() statement occurs once at the beginning of a NODE_ and once at the end of the NODE_ code. It may occur in the middle of the code as well but should NOT contain the "begin:" or "end" wording. Below is an example.


tc is #1 when gen_code first starts because NODE_TRANS_EXT is the very top level of the AST. At the end of a PASS, tc should be back to #1 or else some problem has occured. In the code above, notice the nesting level goes up to #6, comes back down to #2, then goes up to level #7...



In order to understand more of gencode.c, we first need to understand more of the idea behind how this compiler processes variables, etc..

Source Code - Variables/Data Types

C has various size simple data types like CHAR, INT, SHORT, LONG, LONG LONG, as well as signed and unsigned, etc.. Since this compiler may want to target to different CPUs, each of the simple data types may be CPU specific, thus they are defined in cpu_defs.h. For the KPU55 CPU they are defined as follows.



Thus, this C compiler must always use these defines when needing info about these simple data types. The values can then be changed when a different CPU is used, and the cross compiler rebuilt for that CPU.


Similarly there are other CPU specific details defined in cpu_defs.h that the compiler will always use to determine how to use memory, stack, branching, etc.. See the explanation of kcc.h for more details.


Defining CHAR_SZ, etc. is fine, but C uses variables that may be different in size than an actual CPU register. For example, the KPU55 has 16 bit registers (#define REG_WIDTH 16), so how will the generated machine code be able to move a value from one LONG (4 byte) memory location to another LONG (4 byte) memory location and be able to do it properly for different CPU's? Below is an example of C code we'll look in test2.c containing the following code. Ignore the fact that B hasn't been initialized.



Based upon cpu_defs.h, external memory is 2 bytes wide (RAM_BYTE_WIDTH) and a CPU register is 2 bytes wide (REG_WIDTH_BYTES). Memory is also byte addressable (RAM_BYTE_ADDRESSABLE). Thus the compiler determines that A will be stored in memory locations 0 - 3 and B in memory locations 4-7, based upon the define's. To do this requires the CPU to write a 16 bit register value at addresses 0 and 2 for variable A and locations 4 and 6 for variable B. Depending on the definitions in the cpu_defs.h, the generated code can vary greatly.


For this simple program let's look at what happens by looking at the very end of the compile with DEBUG_GEN_CODE turned on. First look at the highlighted line with A = B. Notice that NODE_ASSIGN is called. This is where we'll go in some detail through the code to see how data is handled.


-Exhibit 1-



Line 394 obtains the C source code line number from the node information and along with tc is printed on line 396. 398 and 399 determine that since the last time source code was printed on the Cygwin console, it needs to print line 4 of the source (A = B) as highlighted above.


tptr for NODE_ASSIGN contains info for a left hand node (variable A), an op node (=) and a right hand node (variable B). See node_assign in node_functions.c


Line 401 looks for variable A in the symbol table. Line 402 checks to see if find_symbol() was able to find the name or number. In this case its a name so if it didn't find the name then the code prints out an error and aborts. Otherwise for a number it will save it.


Line 409 gets the starting address for variable A saved earlier in the declaration of variables A and B. In this case the value will be 0.


In line 410 a value of 0 is stored to num_regs in fwd_data. num_regs will be used/changed to determine the number of CPU registers used to do this operation


Line 412 checks to see what kind of assignment operator is being used (=, +=, -=, *=, ...). In this case it's just a simple assignment (i.e. =), so on line 420 it saves st_ptr to variable ga.st_ptr. st_ptr is the pointer to the symbol table entry that contains information about A.




On line 421, function gd_reg_alloc(&ga) is called. This function determines how many and which physical CPU registers will be used in this operation. THe KPU55 CPU has 8 general purpose (DEF_REGS in cpu_defs.h) registers that can be used. For this simple C program for variable A, it will determine and allocate registers 0 and 1 (R0 and R1). It will also detemine that num_regs = 2 (2 writes of 16 bits) and that reg_size = 2. See the printout in the Cygwin window:

gd_reg_alloc: byte_cnt = 4, reg_size = 2, num_regs = 2

gd_reg_alloc: name = A, byte_cnt = 4, num_regs = 2, reg_size = 2


Lines 421 - 423 save just the info needed and free up the used registers that were allocated for now.

Lines 426-427 then save this info in fwd_data for use during another node.


Line 430 below then recursively calls gen_code() passing the fwd_data using the Right Hand Node (->rh_node). The right hand node is just variable B and thus the operation of NODE_ASSIGN is suspended here at line 430 until we process the rh_node which is NODE_OPERAND. Notice that we will get return data in gx which we need before we can continue this NODE_ASSIGN.



Now, node is NODE_OPERAND gets processed. As you have time, study parser.y to see why A = B creates the node tree in this order.



Line 254 prints the foolowing on the Cygwin console:

gen_code: #8: NODE_OPERAND: begin: line_num = 4


Notice that tc incremented to level 8.


Line 256 will determine that show_line_source_code() doesn't need to be called because we're still on line 4 of the source which has already been printed out.


Line 261 will cause the following to get printed on the Cygwin console:

gen_code: #8: NODE_OPERAND: name = B


Because variable B has been defined in the Line 1 declaration of the C program, line 266 will skip execution of lines 267 - 318. As an exercise, determine and see what would happen if B was not declared in the program and line 4 replaced B with a number.


Next, line 319 will print out lvalue on the console. For this example, its not useful


Line 323 saves the symbol table pointer to variable B and line 324 saves the is_signed information in gd which will get returned when this NODE_OPERAND code completes and the gen_code() function returns back to where it was in gen_code()


Line 325 now allocates CPU registers that will be used to access variable B. These will be register R0 and R1 since all registers are "free" up to this point.


Line 326 prints out more info on the Cygwin console:

gen_code: #8: NODE_OPERAND: lvalue = 4



Line 328 checks st in the symbol table for variable B to see if this is a LABEL, CONSTANT, FUNCTION, MEMORY, etc.. These enumation constants can be found in kcc.h. In this case, it will be ST_MEMORY and execute lines 337-338 before ending.


Line 337 creates a comment string with the name of the variable B.

Line 338 passes all the information about B to function genREAD. genREAD generates the machine/assembly code that reads the contents of memory location B into two CPU registers (R0 and R1) in order for NODE_ASSIGN to be able to use. At this point NODE_OPERAND is done and gen_code() will return the info back to the calling function which was gen_code() which was in the midst of NODE_ASSIGN.


Line 433 is where we now begin execution. More debug info is printed out on the console window.

gen_code: #7: NODE_ASSIGN: gx.num_regs = 2, gx.reg_size = 2, gx.reg[0] = 0


Lines 435-448 looks forward in the AST to see if the rh_node was a NODE_POST_INC or NODE_POST_DEC. In this example it wuld not be and we would skip lines 437-440 and lines 444-447. These lines are for when you have a statement like A++ = B or A-- = B, ...





Line 450 checks to see what type of op was used. In this example case ASSIGN on line 533 is a match. Specific machine code/assembly is produced for the CPU that will write the contents of R0 and R1 (containing B) to memory locations used to hold variable A.




The info returned in gd by function gd is a copy of gx set earlier above in the code.


Where do we go from here?

Throughout gencode.c you will find calls to functions that begin with the prefix "gen", such as genREAD, genWRITE, genLD_REG, genAND, genBOR, genTST_BZ_LABEL, etc.. which generate CPU specific code. The idea behind these functions is to have non-CPU specific functions that create code by eventualy calling CPU specific functions defined by the user. gencode.c contains other functions called from gen_code() that shuld be studied. The genXXXX functions can be found in genINSTR.c which is the next code to look at.