One of the first decisions when considering the hardware implementation of an AES design, is the datapath bit-width. This dictates how much of the State data is processed at a time: 8,32, or the full 128 bits per clock cycle iteration. Implementations with 16 and 64-bit datapath designs can also be considered, but are practically nonexistent.
- 8-bit datapaths [6, 13, 25] require less resources, but also the highest number of iterations (160 or more cycles), and consequently the lowest throughput. Implementations with 128-bit datapaths [2, 4, 10] can process more data in a single cycle (with one or more cycles/round), thus allowing for higher throughputs. Consequently, given the replication of the computation units operating in parallel, higher resource usage is also imposed.
- 32-bit datapath structures [5, 20, 23] are often consider as the more balanced compromise between performance and resource usage, originating higher efficiency results (throughput/resources).
(Inv)ShiftRows Implementations: Routing, Multiplexing, and Memory Based
As explained in Sect. 1.2.2, the ShiftRows operation requires the shifting of the second to fourth rows of the State matrix. From an implementation point of view, this simply requires that each of the 16 bytes are properly routed to their respective positions. On FPGAs, signal routing is performed by dedicated routing switches,
Fig. 1.4 The SRL16 (previous Xilinx FPGAs) and SRL32 (current Xilinx FPGAs) LUT modes
typically not requiring any additional functional logic components. This specific routing is performed when mapping, placing, and routing the structure onto the FPGA. However, ShiftRows and InvShiftRows (used on encryption and decryption, respectively) have opposite shifting directions. Thus the routing path of each operation cannot be shared.
Performing the (Inv)ShiftRows operation through routing is often the preferred choice in several proposed 128-bit datapaths such as Bulens et al.  and Liu et al. . However, this implies that a particular implementation can only handle one ciphering mode. With this approach, two AES cores need to be deployed when supporting encryption and decryption, as used in HELION Standard and HELION Fast AES cores . In order to support both encryption and decryption on a single AES design, both routing options need to coexist. If properly designed, and given the similarity of the remaining computations, only minimum multiplexing logic is needed, as presented in Chaves et al. .
In smaller datapaths of 32 and 8-bit widths, performing the (Inv)ShiftRows through routing is not viable, since the 16 bytes of the State are not available at the same time. The predominant state of the art solution for the (Inv)ShiftRows in compact FPGA structures is using addressable memory, as introduced in Chodowiec and Gaj . These authors show how a RAM memory can be used to temporarily store the State matrix between rounds, and perform either the ShiftRows or InvShiftRows by properly addressing the writing and reading operations of the consecutive 32-bit columns, or 8-bit cells, of the State [8, 11]. The authors further optimize this byte shift operation by eliminating the need to specify the writing address. This approach is optimized on Xilinx FPGAs using particular LUTs. On these devices, several LUTs have an operational mode called SRL32 (SRL16 in older versions). This mode allows for a single LUT to work as a 32-bit deep shift register with an addressable reading port, resulting in improved resource usage efficiency, as depicted in Fig.1.4. This approach can be found in 32-bit [5, 20, 23] and 8-bit [6, 25] AES designs.