FPGA Architectures for AES
While the previous section elaborates on the multiple state of the art techniques to improve the implementation of the several AES operations, this section details the architectural options regarding the scheduling of the operations. At this level, the designed decisions are mostly focused on the rolling or unrolling of the loop computation, and in the location and the amount of pipeline stages employed.
Rolled Versus Unrolled Rounds
One of the most direct ways to obtain a trade-off between area and throughput is with round rolling/unrolling.
When unrolling the round computation, multiple rounds of the algorithm are executed in parallel. As such, independent pipeline stages are assigned to each cipher round, as depicted in Fig. 1.15. For this computation to be efficient, data has to be streamed into the pipeline, and the more pipeline stages are placed the faster the overall circuit should run, as described bellow. These approaches are known for imposing higher area demands but, on the other hand, allow for higher throughputs. However, given the data dependency between AES rounds, these approaches can only provide good results if multiple, independent, data blocks are ciphered at the
Fig. 1.15 A pipelined unrolled round AES structure same time. When ciphering in feedback modes (such as CBC) with dependencies between blocks, the throughput improvements cannot be achieved.
Jarvinen et al.  proposed a fully unrolled pipelined architecture targeting a Xilinx Virtex-II 2000. This solution considers a logic-based implementation requiring four clock cycles to complete each round-stage. Later on, Hodjat and Ver- bauwhede  also designed a four cycles-per-round pipeline structure, logic-based, for the Xilinx Virtex-II Pro. However, these authors also presented a second design that uses a memory-based implementation for the first five rounds (two cycles per stage), and a logic-based implementation for the remaining ones (four cycles per stage).
Regardless of pipeline placement choices, the average throughput across the encryption of a data stream is not directly affected by the increase of pipeline registers in the structure, but by the clock frequency increasing with it. An example of this are two unrolled structures presented by Chaves et al.  on the Xilinx Virtex-II Pro. As briefly mentioned in Sect.1.3.7, both of them use a BRAM-based TBox implementation for all rounds. One structure takes one clock cycle per round, while the second one, with a deeper pipeline, takes three cycles per round. The latter achieves higher clock frequency and throughput values.
The structure presented by Liu et al.  updated the AES unrolled structure to the more modern Xilinx Virtex 5, 6, and 7 series. The technological upgrade allowed the authors to use a LUT-based SBOX solution and reduce the pipeline to two cycles per round, while also increasing the clock frequency.
When rolling the architecture, lower hardware requirements are imposed, since only the logic for one round is required. This round structure will process all rounds recursively, taking one or more cycles for each round. Actually, in 32 and 8-bit datapaths, the deployed logic is only able to compute part of the round on each clock cycle. Such datapaths typically allow for relatively small structures, at a cost of lower throughputs .