FPGAs are great devices for performance and functionality. FPGAs are primarily made up of flip flops, LUTs (Look Up Tables) and interconnect matrix. Additional to this there are often block rams, DSP blocks, fast adders, etc.
LUTs themselves however offer some often unrealized design features. In most Xilinx FPGAs, the LUT is a 16x1 configuration. i.e. there are four lines in and 1 out. Internally there are 16 elements, each representing the output state for a specific input combination. Many people leave it at that and let the synthesis tools decide how best to use them. However upon closer inspection, the LUT can operate as a shift register through all 16bits, with a carry output and a programmable tap.
-- see http://toolbox.xilinx.com/docsan/xilinx7/books/data/docs/lib/lib0370_356.html
generic(INIT : bit_vector := X"0000");
port (D : in std_logic;
CE : in std_logic;
CLK: in std_logic;
A0 : in std_logic;
A1 : in std_logic;
A2 : in std_logic;
A3 : in std_logic;
Q : out std_logic);
Encapsulated in modules I have made LUTs into the following
* Small efficient fast counters: A single LUT can count (cycle) through 16 states. If the overall output of this is used to enable a second LUT counter, then overall one can count through 256 states. Three combined can do 1024, etc. This uses a fewer resources than if the counter were implemented in flip flops and also has the advantage that the fan in/out is less so can operate faster. To implement an arbitrary counter, a recursive module is used which takes k - the delay length and breaks it into k1*16+r1. k1 is a shorter counter that cycles 16 times, and when that is done triggers the final difference r1 (a short shift register). K1 then has the same algorithm applied to generate k2*16+r2, etc until kn is 0.
* Delay paths: Delay paths are simply shift registers. Depending on the delay path, a counting implementation above may be used if only one thing can be in the path at a time.
* FIFOs: the LUT shift registers form the basis of FIFOs. The FIFO length is the four address lines for the LUT
* Control logic: With various control logic, the timing interdependencies can be programmed into a shift register. These implement very efficiently in a LUT. The shift register can be cyclic so it repeats forever. This has been used in an I2C controller where two parallel shift registers were used, one for the data and one for the start and stop bits. The cycle was initiated by a LUT based counter. Similar control logic has also been used for SDRAM timing which requires low latency and fan in/out.
This LUT functionality can be hidden inside a module and if the code is moved to an architecture which doesn't support the LUTs being used as shift registers, a different implementation can be used