|
| 1 | +<<< |
| 2 | +:sectnums: |
| 3 | +=== Custom Functions Unit (CFU) |
| 4 | + |
| 5 | +The Custom Functions Unit is the central part of the <<_zxcfu_custom_instructions_extension_cfu>> and represents |
| 6 | +the actual hardware module, which is used to implement _custom RISC-V instructions_. The concept of the NEORV32 |
| 7 | +CFU has been highly inspired by https://github.com/google/CFU-Playground[google's CFU-Playground]. |
| 8 | + |
| 9 | +The CFU is intended for operations that are inefficient in terms of performance, latency, energy consumption or |
| 10 | +program memory requirements when implemented in pure software. Some potential application fields and exemplary |
| 11 | +use-cases might include: |
| 12 | + |
| 13 | +* **AI:** sub-word / vector / SIMD operations like adding all four bytes of a 32-bit data word |
| 14 | +* **Cryptographic:** bit substitution and permutation |
| 15 | +* **Communication:** conversions like binary to gray-code |
| 16 | +* **Image processing:** look-up-tables for color space transformations |
| 17 | +* implementing instructions from other RISC-V ISA extensions that are not yet supported by the NEORV32 |
| 18 | +
|
| 19 | +[NOTE] |
| 20 | +The CFU is not intended for complex and autonomous functional units that implement complete accelerators |
| 21 | +like block-based AES de-/encoding). Such accelerator can be implemented within the <<_custom_functions_subsystem_cfs>>. |
| 22 | +A comparison of all chip-internal hardware extension options is provided in the user guide section |
| 23 | +https://stnolting.github.io/neorv32/ug/#_adding_custom_hardware_modules[Adding Custom Hardware Modules]. |
| 24 | +
|
| 25 | +
|
| 26 | +:sectnums: |
| 27 | +==== Custom CFU Instructions - General |
| 28 | +
|
| 29 | +The custom instruction utilize a specific instruction space that has been explicitly reserved for user-defined |
| 30 | +extensions by the RISC-V specifications ("_Guaranteed Non-Standard Encoding Space_"). The NEORV32 CFU uses the |
| 31 | +_CUSTOM0_ opcode to identify custom instructions. The binary encoding of this opcode is `0001011`. |
| 32 | + |
| 33 | +The custom instructions processed by the CFU use the 32-bit **R2-type** RISC-V instruction format, which consists |
| 34 | +of six bit-fields: |
| 35 | + |
| 36 | +* `funct7`: 7-bit immediate |
| 37 | +* `rs2`: address of second source register |
| 38 | +* `rs1`: address of first source register |
| 39 | +* `funct3`: 3-bit immediate |
| 40 | +* `rd`: address of destination register |
| 41 | +* `opcode`: always `0001011` to identify custom instructions |
| 42 | +
|
| 43 | +.CFU instruction format (RISC-V R2-type) |
| 44 | +image::cfu_r2type_instruction.png[align=center] |
| 45 | + |
| 46 | +[NOTE] |
| 47 | +Obviously, all bit-fields including the immediates have to be static at compile time. |
| 48 | + |
| 49 | +.Custom Instructions - Exceptions |
| 50 | +[NOTE] |
| 51 | +The CPU control logic can only check the _CUSTOM0_ opcode of the custom instructions to check if the |
| 52 | +instruction word is valid. It cannot check the `funct3` and `funct7` bit-fields since they are |
| 53 | +implementation-defined. Hence, a custom CFU instruction can never raise an illegal instruction exception. |
| 54 | +However, custom will raise an illegal instruction exception if the CFU is not enabled/implemented |
| 55 | +(i.e. `Zxcfu` ISA extension is not enabled). |
| 56 | + |
| 57 | +The CFU operates on the two source operands and return the processing result to the destination register. |
| 58 | +The actual instruction to be performed can be defined by using the `funct7` and `funct3` bit fields. |
| 59 | +These immediate bit-fields can also be used to pass additional data to the CFU like offsets, look-up-tables |
| 60 | +addresses or shift-amounts. However, the actual functionality is completely user-defined. |
| 61 | + |
| 62 | + |
| 63 | +:sectnums: |
| 64 | +==== Using Custom Instructions in Software |
| 65 | + |
| 66 | +The custom instructions provided by the CFU are included into plain C code by using **intrinsics**. Intrinsics |
| 67 | +behave like "normal" functions but under the hood they are a set of macros that hide the complexity of inline assembly. |
| 68 | +Using such intrinsics removes the need to modify the compiler, built-in libraries and the assembler when including custom |
| 69 | +instructions. |
| 70 | + |
| 71 | +The NEORV32 software framework provides 8 pre-defined custom instructions macros, which are defined in |
| 72 | +`sw/lib/include/neorv32_cpu_cfu.h`. Each intrinsic provides an implicit definition of the instruction word's |
| 73 | +`funct3` bit-field: |
| 74 | + |
| 75 | +.CFU instruction prototypes |
| 76 | +[source,c] |
| 77 | +---- |
| 78 | +neorv32_cfu_cmd0(funct7, rs1, rs2) // funct3 = 000 |
| 79 | +neorv32_cfu_cmd1(funct7, rs1, rs2) // funct3 = 001 |
| 80 | +neorv32_cfu_cmd2(funct7, rs1, rs2) // funct3 = 010 |
| 81 | +neorv32_cfu_cmd3(funct7, rs1, rs2) // funct3 = 011 |
| 82 | +neorv32_cfu_cmd4(funct7, rs1, rs2) // funct3 = 100 |
| 83 | +neorv32_cfu_cmd5(funct7, rs1, rs2) // funct3 = 101 |
| 84 | +neorv32_cfu_cmd6(funct7, rs1, rs2) // funct3 = 110 |
| 85 | +neorv32_cfu_cmd7(funct7, rs1, rs2) // funct3 = 111 |
| 86 | +---- |
| 87 | + |
| 88 | +Each intrinsic functions always returns a 32-bit value (the processing result). Furthermore, |
| 89 | +each intrinsic function requires three arguments: |
| 90 | + |
| 91 | +* `funct7` - 7-bit immediate |
| 92 | +* `rs2` - source operand 2, 32-bit |
| 93 | +* `rs1` - source operand 1, 32-bit |
| 94 | +
|
| 95 | +The `funct7` bit-field is used to pass a 7-bit literal to the CFU. The `rs1` and `rs2` arguments to pass the |
| 96 | +actual data to the CFU. These arguments can be populated with variables or literals. The following example |
| 97 | +show how to pass arguments when executing `neorv32_cfu_cmd6`: `funct7` is set to all-zero, `rs1` is given |
| 98 | +the literal _2751_ and `rs2` is given a variable that contains the return value from `some_function()`. |
| 99 | + |
| 100 | +.CFU instruction usage example |
| 101 | +[source,c] |
| 102 | +---- |
| 103 | +uint32_t opb = some_function(); |
| 104 | +uint32_t res = neorv32_cfu_cmd6(0b0000000, 2751, opb); |
| 105 | +---- |
| 106 | + |
| 107 | +.CFU Example Program |
| 108 | +[TIP] |
| 109 | +There is a simple example program for the CFU, which shows how to use the _default_ CFU hardware module. |
| 110 | +The example program is located in `sw/example/demo_cfu`. |
| 111 | + |
| 112 | + |
| 113 | +:sectnums: |
| 114 | +==== Custom Instructions Hardware |
| 115 | + |
| 116 | +The actual functionality of the CFU's custom instruction is defined by the logic in the CFU itself. |
| 117 | +It is the responsibility of the designer to implement this logic within the CFU hardware module |
| 118 | +`rtl/core/neorv32_cpu_cp_cfu.vhd`. |
| 119 | + |
| 120 | +The CFU hardware module receives the data from instruction word's immediate bit-fields and also |
| 121 | +the operation data, which is fetched from the CPU's register file. |
| 122 | + |
| 123 | +.CFU instruction data passing example |
| 124 | +[source,c] |
| 125 | +---- |
| 126 | +uint32_t opb = 0x12345678; |
| 127 | +uint32_t res = neorv32_cfu_cmd6(0b0100111, 0x00cafe00, opb); |
| 128 | +---- |
| 129 | + |
| 130 | +In this example the CFU hardware module receives the two source operands as 32-bit signal |
| 131 | +and the immediate values as 7-bit and 3-bit signals: |
| 132 | + |
| 133 | +* `rs1_i` (32-bit) contains the data from the `rs1` register (here = `0x00cafe00`) |
| 134 | +* `rs2_i` (32-bit) contains the data from the `rs2` register (here = 0x12345678) |
| 135 | +* `control.funct3` (3-bit) contains the immediate value from the `funct3` bit-field (here = `0b110`; "cmd6") |
| 136 | +* `control.funct7` (7-bit) contains the immediate value from the `funct7` bit-field (here = `0b0100111`) |
| 137 | +
|
| 138 | +The CFU executes the according instruction (for example this is selected by the `control.funct3` signal) |
| 139 | +and provides the operation result in the 32-bit `control.result` signal. The processing can be entirely |
| 140 | +combinatorial, so the result is available at the end of the current clock cycle. Processing can also |
| 141 | +take several clock cycles and may also include internal states and memories. As soon as the CFU has |
| 142 | +completed operations it sets the `control.done` signal high. |
| 143 | + |
| 144 | +.CFU Hardware Example & More Details |
| 145 | +[TIP] |
| 146 | +The default CFU module already implement some exemplary instructions that are used for illustration |
| 147 | +by the CFU example program. See the CFU's VHDL source file (`rtl/core/neorv32_cpu_cp_cfu.vhd`), which |
| 148 | +is highly commented to explain the available signals and the handshake with the CPU pipeline. |
| 149 | + |
| 150 | +.CFU Execution Time |
| 151 | +[NOTE] |
| 152 | +The CFU is not required to finish processing within a bound time. |
| 153 | +However, the designer should keep in mind that the CPU is **stalled** until the CFU has finished processing. |
| 154 | +This also means the CPU cannot react to pending interrupts. Nevertheless, interrupt requests will still be queued. |
0 commit comments