This is a short tutorial on how to run UCLA_DRE tool. It contains details on how to run an example with different netlist formats (SPICE and ASX) and some evaluation examples for different design rules (DRs) and layout styles.

STEP 1 - SETUP
==============
You need to set some environment variables as indicated below:
LD_LIBRARY_PATH  <OA Installation Directory>/lib/linux_rhel30_64/opt
PATH  <OA Installation Directory>/bin/linux_rhel30_64/opt.

You need to setup the input files required by the tool including one file that contains all design rules and process parameters and another file that contains the number of instances for every cell in the design. Samples files are /example/dr/freePDK45_LI_finfet and /example/spice/spice_instances. We recommend using the same files and replace the DR values with the ones corresponding to the process technology that you would like to test. In any case, the exact syntax and order of DRs have to be followed.
Please refer to SPICE EXAMPLE section of this tutorial for an example or to runDRE script in exec folder.
If the netlist is from a previous technology generation, you can scale down all transistor sizes by setting the "Lib Scaling factor" in the DRs file. Note that only the transistor width will be scaled since transistor length is defined by gate poly width rule number 2.1.1.

STEP 2 - Run
============
The tool runs with the following commands:
$ ./UCLA_DRE [OPTIONS] config_filename inst_filename DRs_filename oaLibraryName netlist_filename(s)

If you set the option to output the GDS file of the cells used in the evaluation (with --gds), then use the command "$ ./stream oaLibraryName gds_filename" to have the output GDS file in "exec/gds_output/gds_filename.gds".

The arguments are required and they must be entered in the same order. 

config_filename:   contains configuration information about netlist and other parameters including: netlist_format (spice or
                   asx), vdd/gnd names, pmos/nmos models (to determine transistor type), the terminals order in which they 
                   they appear in the transistor definition, unit, the maximum number of chaining iteration (8 is recommended                   but can be increased if you notice the chaining solution is not optimal for some cells), and parameters 
                   that models intra-cell routing efficiency (precomputed for Nangate open cell-library cells, please do not                    modify). More details can be found in the sample config file in /example/spice/config and example/asx/
                   config_asx, please follow the same syntax i.e. parameter : value #(one per line).

inst_filename:     contains the number of instances of each cell-type. You can enter an empty (or non-existing)
                   filename to set all instances to 1. Otherwise, the syntax of file /example/spice/spice_instances and has
                   to be followed exactly, i.e.:
                   Cellname : num_of_instances (one cellname per line in any order) 

DRs_filename:      contains the list of design rules, process parameters, and layout styles. The syntax of file 
                   /example/dr/freePDK45_LI_finfet has to be followed exactly, i.e.:
                   DR_name = DR_value (one DR per line in the SAME order as in the example)

netlist_filename:  hierarchical circuit definitions and extracted netlists are NOT supported. Make sure to 
                   include details on the netlist format in /example/spice/config for SPICE format and 
                   /example/asx/config_asx for ASX file.

                   Connectivity between two transistors is specified by having the exactly the same node
                   name appearing in both transistors. 
                   
                   In case of SPICE format, the syntax is standard, i.e.:
                   .subckt CELLNAME etc...
                   m0 t1 t2 t3 t4 fet_model w=value l=value 
                   m1 t5 t2 t6 t7 fet_model w=value l=value //(Here m0 and m1 are connected through node t2)
 
                   .subckt and CELLNAME are required to determine the name of the cell. The default order of 
                   terminals is drain gate source bulk, however this is configurable with the config file.

                   In case of ASX format, the syntax should be as follows: 
                   p0 = model x (t1,t2,t3,t4)(...pl=value, pw=value,etc...) //separators between terminals
                   can be ',' or '-'. To determine the cell name, the keyword "cellname: <name>" must preceed 
                   ELEMENTS definition either inline or in the comment.  

                   In both format, single and multiple netlists can be used for different cells (entire cell library
                   can be in a single file), but the cellname must preced the definition of the cell elements as 
                   described above.

                   The unit can be specified right after the "value" (attached or separated with white spaces).
                   This will set the unit of the parameter on top of the unit specified in the "config" file
                   ("um" in netlist and "m" in config would result in "um" as the final unit). In case the unit
                   is not specified in the netlist, the final unit would be set to the one specified in the 
                   "config" file. 

[OPTIONS]

 -h  --help             Print this help
 -v  --version          Print version number
     --verbose          Run tool in verbose mode
 -r  --runtime          Display runtime information
     --m1explore        Explore M1 half pitch
     --gdexplore        Explore gate-to-diffusion spacing (GD) rule
     --gcexplore        Explore gate-to-contact spacing (GC) rule
     --gds              Output layout (in OpenAccess format)
-e   --elec             Estimate delay of cells    
--chip                  Do chip area estimation 
 

You can use the command "./UCLA_DRE -h" or "./UCLA_DRE --help" to display usage information at any time.

LAYER MAP for Output GDS file
=============================
Layer map below identifying different layers in output GDS file of DRE.
RX   1
PC   9
V0   10
M1PIN   11
M2PIN 13
RXFIN 21
CA   22
CB   23

NOTE
=====
Chip level estimation and .lib generation features are still under development. Results may be unreliable, especially for finfets.
In config file, there is a parameter called n_Cores_per_die (number of cores per die). This is treated like a number of copies of the given design that will reside on die. This used because if the benchmarks are relatively small in size, the yield values will be almost 100%, and the effect on design rules on realistic values for yield will not be visible. Thus we assume, that the design is replicated n_Cores_per_die times. If your given benchmark is already large enough, you can set n_Cores_per_die=1.

SPICE EXAMPLE
=============
You can run the tool from exec folder with the following command (FROM BASH SHELL):

$ ./UCLA_DRE --gds --lef ../example/spice/config ../example/spice/spice_instances ../example/dr/freePDK45_LI_finfet DesignLib ../example/spice/invx1.sp ../example/spice/nand2x1.sp ../example/spice/oai211x4.sp
$ ./stream DesignLib gds_filename

Output gds will be in "exec/gds_output/gds_filename.gds".

Also, if writing the GDS file is not desired, you can use the same command without --gds option (./stream is not necessary in this case).

ASX EXAMPLE
=============
You can run the tool from exec folder with the following command (FROM BASH SHELL):

$ ./UCLA_DRE --gds ../example/asx/config_asx ../example/asx/asx_instances ../example/dr/freePDK45_LI_finfet DesignLib ../example/asx/asx_inv_x1 ../example/asx/asx_nand2_x1
$ ./stream DesignLib gds_filename

Output gds will be in "exec/gds_output/gds_filename.gds".

Also, if writing the GDS file is not desired, you can use the same command without --gds option (./stream is not necessary in this case).

HOW TO GENERATE .lib
===================
.lib generation works by scaling a reference .lib after estimating delay. The scaling factors are the ratios between the delay values computed for the design rules under test to the delay values computed for the reference design rules, for which a reference .lib is available. Note that this approach is acceptable for the purpose of relative assessment of design rules.
1-Run DRE with the required cells and with -e --elec options (without .lib option). i
This run should be performed with a design rules file that's considered a reference. It should be a good match to the reference .lib file.
2-DRE will output "DelayReport" file in the project directory.
3-Rename this file as DelayReportOrig and put in the same place.
4-Change the reference .lib file entry in the config file as follows: 
  Liberty_File_Path : <write path here to reference .lib here>
5- Run DRE again for the cells you want with --lib option.
5- The generated .lib file will be DRE_out.lib.

RUNNING EVALUATIONS
====================
 - EXAMPLE DIFFUSION VS. M1 POWER STRAPS
-----------------------------------------
Perform a first run with rule 8.2.1 in /example/dr/freePDK45_LI_finfet set to 0 for M1 power straps. Store the results displayed by the tool. Now, perform a second run with rule 8.2.1 in /example/dr/freePDK45_LI_finfet set to 1 for diffusion power straps. Store the results and compare with results of first run.

Most other evaluations including 1D vs 2D poly, fixed gate pitch vs not, and different design rule values are performed in a similar fashion.

 - LIST OF LAYOUT STYLES THAT CAN BE EVALUATED WITH CURRENT VERSION OF DRE
-----------------------------------------------------------------------
These can be set using rules 7.1.1 to 11.5 in DRE input rules file
Some layer definitions:
CB = LI layer in horizontal direction to perform poly-to-poly routing
CA = LI layer in vertical direction to connect fins and possibly connect to power rail
RX = diffusion/active
PC = Poly
V0 = contact to CB and CA
 
7.1.1 to 7.1.8: 	power rails width on M1, M2, CB, and/or RX
7.1.9: 			enabling/disabling non-shared power rail (i.e. on M2 and over RX) -- currently inactive and being coded into DRE.
7.2a/b/c: 		specify cell height
7.4.1/2 and 8.7.1/2: 	can set pins with width larger than min metal width
8.1.1 to 8.1.4: 	poly layout styles: fixed pitch, 1D/2D, limited routing just to connect direct neighboring polys, and enabling/disabling CB.
8.1.5   		enabling/disabling multiple fin spacing, i.e. dummy fins may have a different fin pitch than active fins. If set to 1, you need to provide max/min/step size for the fin pitches (rules 1.1.2b/c/d)
8.1.6   		enabling/disabling poly/diffusion tuck under
8.2.1/2   		Power-straps with diffusion/CA or with M1
8.2.4   		No need for CA for single-finger devices
8.2.5   		No need for CA stacked devices
8.3.1/2 		set the vertical location of PMOS/NMOS: 0 -> aligned toward top of the cell, 1 -> aligned to center of P/N networks, 2-> aligned to the p/n interface or center of cell
8.4     		Extend poly line-end extension if room is available
8.5     		Discrete cell width
8.6  8.6.1/2     	enable/disable Redundant contacts
8.8.1   		enable/disable CA routing from nmos to pmos (only straight vertical connections)  currently does not optimize transistor locations to maximize this
8.9.1   		V0 connects directly to poly (CB needed to connect to PC)
8.9.2   		Trench contacts for poly (no need for poly pad around poly contact)
9.1.1   		Allow pairing of trans with different gate signals (imperfect pairing)
9.1.2   		Allow poly cross coupling (special case of imperfect pairing)
10.1.1  		FinFET or MOSFET
10.1.2  		Tapered devices, i.e. jogs on RX (0->forbidden, 1->allowed, 2--> forbidden but perform W increase to level the RX
11.1.1  		Cell right/left boundaries (0->cell starts/ends at half poly space, 1->at center of poly line)
11.1.2  		Importance of placing power straps at cell boundary (w.r.t. abutment)  currently inactive
11.2    		No contacts in power rail (no V0)
11.3    		Pins on M1/M2 layers (0->on M1, 1->on M2, 2->on either)
11.4    		Vertical metal on track
11.5    		Maximize pin length if room allows  currently inactive

One evaluation that requires a slightly different procedure is the evaluation of redundant contacts. The current evaluation is single contact versus multiple contacts (2 for contact doubling). Currently, we assume that redundant contacts can always fit. Since this is not always true for diffusion contacts (transistors might be too small to fit 2 contacts), we increase the size of transistors in the first run of this test to make sure contacts can fit. To perform this evaluation, perform a first run with rule 8.6 in /example/dr/freePDK45_LI_finfet set to 1 and rules 8.6.1 for number of diffusion contacts and/or 8.6.2 number of poly contacts set to 1. Store the results displayed by the tool. Now, perform a second run with rule 8.6 again set to 1 and rules 8.6.1 and/or 8.6.2 set to 2 (or more). Store the results and compare with results of first run. If you get an increase of cell area between the two runs, it is caused by the increase of M1 congestion from redundant contacts. 


RUNNING EXPLORATIONS
====================
Currently, the tool supports the exploration of the gate poly-to-diff spacing (rule 2.3.1) using the gdexplore switch.

For the exploration, run UCLA_DRE with the gdexplore switch and you will be asked to provide the maximum and minimum values of the DR under exploration as well as the step between the explored values. For example (FROM BASH SHELL),

$ ./UCLA_DRE --gdexplore ../example/spice/config ../example/spice/spice_instances ../example/dr/freePDK45_LI_finfet ../example/spice/invx1.sp ../example/spice/nand2x1.sp ../example/spice/oai211x4.sp
  

FITTING OF PARAMETER OF THE CONGESTION ESTIMATION METHOD
========================================================

Current values are fitted to model routing efficiency in Nangate open library using FreePDK 45nm process DRs. The fit was performed to minimize the average error of area estimates for four designs (described in the ICCAD'09 paper). You can repeat the fit to suit your type of cells/process-DRs using the script /misc/cong_fitting.sh. A smaller set of cells that are representative of the library can be used instead of the entire library. Details on how to use the script and change the range and precision of the parameters can be found in the script itself.    


IMPORTANT NOTES ON SOME RULES
=============================
If rule 8.7.1 is set to 1 and cell-input pins requires M1-pads wider than the minimum M1 width rule, one neighboring M1 track will be blocked for the length of the pad. This will result in pessimistic M1 congestion as cell-input pads are fitted after intra-cell routing and most likely do not require the increase of cell area. As a result, it is recommended to keep this rule set to 0 for typical applications.

If rule 8.7.2 is set to 1 and cell-input pins requires M1-pads wider than the minimum M1 width rule, one neighboring M1 track will be blocked for the length of the *wire connecting the output pins* (vertical wire connecting n/p transistors connected to final cell-outputs).


KNOWN ISSUES
============
1-Hierarchical circuit definitions and extracted netlists are NOT supported. Only schematic and LVS netlists are allowed.

2-Runtime is relatively slow and memory consumption is high when running cells with very large drive current (except inverters and buffers that has very fast runtime - please see EXPECTED RUNTIME section for details). This might cause the tool to crash due to memory shortage. This can also occur if running a netlist with very large number of cells (more than 100). So, it is recommended to split the netlist file into smaller files and run them separately in this case.


EXPECTED RUNTIME
================
UCLA_DRE is tested on the entire Nangate open standard cell-library that includes 104 cells with relatively small size transistors. The runtime is between 20 minutes and 1 hour depending on the maximum number of chaining iterations set in the config file (20 min with 8 iterations and 1 hour with 20 iterations) on a single processor of 2GHz clock speed and 2MB cache. 

UCLA_DRE was also tested on the entire Synopsys 90nm Generic Library that has 251 cells with relatively large transistor sizes. For small cells, the tool runs extremely fast. For larger cells, the runtime increases but remains tolerable. However, the runtime is slow (can take up to 40 minutes) for cells with large number of fingers and very large drive current (except inverters and buffers that has fast runtime). This is because transistor chaining during topology generation is exponential with the number of folds (or fingers).

We allow tradeoff between quality of chaining solutions and runtime with "chaining_it" parameter in the config parameter. This will limit the number of iterations of the chaining algorithm to the specified number. It can be set to 6 without much loss of quality of results. In case runtime is tolerable and better quality is needed, this number should be increased. Any number between 8 and 20 is recommended.


GETTING HELP
============
For quick help, you can run the command "./UCLA_DRE -h" or "./UCLA_DRE --help" to display usage information. Useful information on the methods used by the tool and what's supported and what's not can be found in the documentation folder. In case further assistance is needed, please feel free to contact me at rani@ee.ucla.edu.
