FPGAs

Discussion in 'Programming & Software Development' started by mikeyyy, Apr 13, 2009.

  1. mikeyyy

    mikeyyy Member

    Joined:
    Apr 7, 2005
    Messages:
    590
    Location:
    Sydney
    I don't think I've seen much talk at all about FPGAs and VHDL/Verilog, so I figured I'd start a thread. :)

    I'm in my final year of Computer Engineering and we've been using VHDL to build random stuff for group projects on a Digilent Nexys board (Spartan 3 FPGA). For those who don't know what an FPGA is, it is basically reconfigurable hardware. You can essentially program digital circuits onto the chip and make it do whatever you want. The FPGA also interfaces to I/O pins on the board and other things like DRAM, LEDS, 7seg display, flash memory, etc.


    Click to view full size!


    Last semester we built a temperatre + humidity sensor on the breadboard attachment, the board could sample sensor data at a rate of 50Hz and store the data in memory for later retrieval by downloading it to the PC. This semester we're doing motion detection, we have a PAL camera and have to output VGA to a screen. The goal is to stream the camera image along with some form of virtual object that can be manipulated by the detected motion.

    I find this low level to be lots of fun, and we're only using a 100MHz clock, none of that GHz processor crap. :)

    Is anyone else working with FPGAs? If you ever get the chance at uni, you should try it. Though it's pretty frustrating when things don't work, printf debugging consists of 8 LEDs and a 7seg display, but there is something rewarding about getting the board to display a bitmap from DRAM that I can't put my finger on. The best thing of all, one of these development boards is only like $100. The back of the box it comes in even brags it's cheaper than some textbooks.
     
  2. oupimiquo

    oupimiquo Member

    Joined:
    Sep 20, 2007
    Messages:
    520
    I've got the Digilent 3E starter board (http://www.digilentinc.com/Products/Detail.cfm?Prod=S3EBOARD) which is somewhat similar the the Nexys, but with a few more bells and whistles. Most of the stuff I do for myself is related to exploring and pushing the limits of the FPGA (go the 10 picosecond resolution TDC :) ) as opposed to "real" projects. The small amount of work that I've done in a getting-paid-for-it sense has been just fairly simple timing and stuff (30-channel 10 ns resolution TDC stuffing the results out the serial port).

    I've also played around with developing my own instruction set and CPU core, but have got distracted by something related to the Apollo Guidance Computer thread (specifically: determining the best instruction set if you're trying to minimise the number of 3-input NOR gates).
     
  3. GooSE

    GooSE New Member

    Joined:
    Jun 26, 2001
    Messages:
    6,679
    Location:
    Sydney
    You might get more interest on the topic over in the Electronics forum.

    I personally had a ball using CPLDs and then FPGAs at uni. I guess I'll probably never get to use one again during my career though..
     
  4. SLATYE

    SLATYE SLATYE, not SLAYTE

    Joined:
    Nov 11, 2002
    Messages:
    26,857
    Location:
    Canberra
    I'm just learning how to use them with Verilog. My current project is implementing a 16-bit floating point ALU. I've got a Nexys2 board with an XC3S1200E chip, but unfortunately the project has to fit on the old Pegasus boards (XC2S50). The Pegasus boards have VGA out, but no external RAM (just a few kilobytes in the FPGA) - so I suspect that getting video output working without filling the entire chip will be difficult.

    Still, I'm having a wonderful time. It really puts modern computing in perspective. I've just spent quite a few hours figuring out a decent way to do floating-point addition; and yet in Java/C/Matlab you just go "a + b" and it all works perfectly! Who would have thought that it was so complicated once you get down to the basics?

    The big problem I have is that I'm working with Java at the same time - and it's quite painful to switch between Java and Verilog. In Java the aim is to make code that runs fast; in Verilog the aim is to make code that doesn't take up much hardware (speed isn't an issue at this point).

    EDIT: If you want to make debugging fun, put a really big clock divider in there (slow it down to ~1Hz). Then you can watch it go through each stage on the LEDs, if you've connected them to appropriate places.

    I'm considering trying an asynchronous version, but I can't find much information on asynchronous design (the textbook we've got basically says "don't even think about it").
     
    Last edited: Apr 14, 2009
  5. OP
    OP
    mikeyyy

    mikeyyy Member

    Joined:
    Apr 7, 2005
    Messages:
    590
    Location:
    Sydney
    Yeah, I'm not sure where in industry they'd be used. They're really useful for prototyping stuff though so someone has to be using them. That or go into research perhaps? I remember reading a paper on guys that were trying out a new concept of having permissions associated down to the granularity of a memory word with an OS built to support its usage. They instantiated a MIPS soft-core onto an FPGA (or some form of configurable logic) and then added a permissions cache to it similar to the TLB to get performance back into the system.

    Yeah we have a 16Mbit Micron DRAM chip on our board. My previous code accessed the RAM in async mode, which was too slow to keep up with the VGA. So I had to rewrite it for burst access to get decent throughput.

    If you're using Xilinx ISE, you can instantiate a lot of common circuits optimally on the FPGA, but I guess that defeats the purpose if you're learning and having fun. :)

    For me right now, the aim is to write manageable VHDL that actually works. I had a horrible coding style with VHDL that made life really difficult until I was emailing the lecturer and he told me how to do things properly. I spent 2-3 days writing the burst access dram module, and ended up rewriting it again after the lecturer gave me tips. The rewrite took only a couple of hours to get things working. I guess it's always the case that doing things properly and understanding how the code translates into digital circuits will make things easier.

    We were never really taught how to use VHDL properly, it was really a digital circuits course with VHDL thrown in on the side. However it was the first time that course had been run. I guess we're learning now what we should've learnt previously, better late than never I suppose.

    Just a quick question on flip flops, it's been a year or so since we learnt them in lectures. We never went into detail of the standard D flip flop, but we did look at the master-slave configuration. From what I understood (if it is positive edge-triggered), on the rising edge of the clock the master component would latch the value on the D line, and store it in the master circuit. On the falling edge of the clock, the slave circuit would latch the value from the master, and then output its previous value.

    So I saw it as, rising edge = latch new value, falling edge = output old value. And I thought signals that needed "remembering" in VHDL directly corresponded to a flip flop. Though the lecturer has told me that the output of a signal that is synced to the rising edge of the clock changes soon after the triggered edge, I'm guessing he just means the propagation delay. This means what I thought was wrong, anyone keen to enlighten me? :p If not I might just send a few more emails to my lecturer.

    I ended up using the push buttons to cycle through the words I read from memory to verify them on the LEDs. Then after I got stick of that I grabbed our C# program from the previous semester that interfaced to the board (it could access the RAM via async mode), hacked it up real quick to do reading and writing at the touch of buttons, and then I could read/write memory from C# to verify my burst access code was working.

    What do you mean by asynchronous version? Of what?
     
    Last edited: Apr 14, 2009
  6. dakiller

    dakiller (Oscillating & Impeding)

    Joined:
    Jun 27, 2001
    Messages:
    8,347
    Location:
    3844
    It's a bit of a shame that my course (mechatronics) misses out on all the FPGA fun, I got friends who are doing electronics systems who are always bitching about it, and I just say I wish I was doing it

    So I've been left with trying to learn it all myself as I want to build a digital audio processor to perform on the fly filtering to create crossovers as well as some other fancy and crazy audiophile stuff at the same time.

    Bought a development board to try and learn on but so far all I have is some switch turning an LED on and off and straight wiring of inputs to outputs on my development platform I'm making

    Doesn't help that I haven't learnt any DSP theory either, so I'm being held back a bit by that as well

    Here is the platform for my DAC, spdif comes in at the top, decoded and fed to the FPGA, whichcould be migrated into the FPGA later, but I don't want to be dealing with that yet. DSP processing in the FPGA to have 4 bands of filtering done, buffered and synchronously reclocked with VCXO's into a clean clock fed from the DAC board, then sent to the 8 channels of DAC's and i/v analogue stage after that. AVR micro to do a bit of configuration and clock control of the system, also could be migrated, but less to learn in getting it up and running for now
    [​IMG]

    Whole project has been put aside, I started over the summer but other things came up and it's on hold during the semester, trying to learn a whole other language for uni (ADA, I hate it so far) that I don't need to be adding Verilog into the mix as well
     
  7. SLATYE

    SLATYE SLATYE, not SLAYTE

    Joined:
    Nov 11, 2002
    Messages:
    26,857
    Location:
    Canberra
    I don't think it can handle floating-point, although I'll have to check that. It'd be interesting to see the "right" way to do it.

    It doesn't really matter anyway. For this project, we're not even allowed to use +, -, *, <<, or >>. Using pre-written code would definitely be bad.

    Yes, my coding style is pretty chaotic at the moment. I tried the 'normal' method of writing a state machine, but that confused me and I couldn't see any easy way to do some of the things that I wanted.

    My current method (putting everything in a huge "always @(posedge clk)" block) seems to work, even if it's a bit hard to understand.

    Yes, that's what I've found too. Once you understand what's actually going to be created, it all makes more sense. However, I do have a few issues with the Xilinx ISE doing its own "optimisation" when I didn't really want it. For example, it found one of my state machines and decided that one-hot encoding was better than normal binary encoding. This might save space, but it means that I can't see what would have happened using binary encoding!


    The course I'm doing is mostly based around Verilog, but we haven't hit it in great detail yet. So far we've just been writing really simple things (4-bit adder displaying on the LEDs, for example).

    Isn't it "rising edge = new value in first element; falling edge = that value appears on the output"?

    I'm not sure about VHDL, but I don't think that this is the case in Verilog. It'll correspond to a flip-flop if Xilinx ISE thinks that's a good idea.

    I strongly suspect that ISE has been using the block RAM to store this data, because I didn't instantiate any of it and yet ISE reported that the block RAM was 100% utilised.

    If it just creates a single D flip-flop (not a master-slave system), then propagation delay would make sense there.


    I was considering an asynchronous version of the 16-bit floating-point ALU. From what I've been told, this would save space on the chip and increase performance. I gather that it's more difficult to understand, and at this point I can't see how I'd make it work.
     
  8. OP
    OP
    mikeyyy

    mikeyyy Member

    Joined:
    Apr 7, 2005
    Messages:
    590
    Location:
    Sydney
    That's exactly what I've been doing. One huge positive edge-triggered process block. I've since changed to putting FSMs into their own process, and then just having a state <= next_state in the clocked process. So the FSM transitions on the clock, but the outputs of the FSM are purely functions of the inputs, 100% combinatorial. This way the outputs I put in an FSM state are the outputs that occur at that state. Doing it all in the clocked process you need to put the outputs you want at the next state, one clock cycle before which sometimes is one state before and can get confusing.

    I guess a master-slave is, because that's how it works?

    So the edge-triggered D flip-flop seen here http://upload.wikimedia.org/wikiped...flop.png/180px-Edge_triggered_D_flip-flop.png actually operates differently to the master-slave configuration? I remember a while ago I was tracing the signals entering that circuit, maybe I should do it again to relearn how it works. I thought it "acted" the same, but just using less gates.

    Ah cool. I guess you'd need to build async circuits for all your ALU components, integer adders, multipliers, etc. It sounds like a lot of work. What does your ALU need exactly?

    edit: Just traced the D flip-flop circuit, it looks like it changes right after the triggered edge. It's a bit confusing to follow with the feedback loops and taking into account propagation delays. It seems when the clock goes from 0 to 1, the previous output of the 4 NAND array help to change the Q output, but because the clock has changed, the outputs of the 4 NAND array then also change soon after (prop delay) such that a change in D after this does not change anything. D needs to be held stable while this 2-stage process is happening until the new 4 NAND array outputs have arrived to lock things up. I guess this is the hold-time you have to honour, and the setup time is right before the clock triggers so that D is stready when it does. I think it all makes sense now.
     
    Last edited: Apr 14, 2009
  9. SLATYE

    SLATYE SLATYE, not SLAYTE

    Joined:
    Nov 11, 2002
    Messages:
    26,857
    Location:
    Canberra
    The problem I had was with a bit of code like this:

    Code:
    always @(*) begin
    	case (state)
    		1'b0: 	begin
    				// Do something
    			end
    		1'b1: 	begin
    				reg0 <= (reg0 << 1);
    				next_state <= 1'b0;
    			end
    	endcase
    end
    
    If I put that in a sequential always block (ie replace "*" with "posedge clk") it's fine. In a combinational block it fails, because the block may run several times (if the inputs change) before the next clock cycle. As a result, reg0 gets shifted several times before the new state occurs.

    I can build a shift register module (operating at posedge clk) to handle this (so the shift doesn't occur within the combinational block) - but it seems like a lot of extra code when I can just put the whole thing in a big sequential block instead.


    I'll have to go through the circuit schematic again; I haven't looked at these in great detail.

    It's the standard add/subtract/multiply/divide functions. At some point it also needs to handle binary-to-decimal conversion; I'm still not sure exactly how I'll do that.

    The big problem with asynchronous circuits is that there's no clock signal to force it to continue. In a clocked one, you just wait until the next clock edge and that puts the system into a new state. In asynchronous, what can you use to trigger that?
     
  10. OP
    OP
    mikeyyy

    mikeyyy Member

    Joined:
    Apr 7, 2005
    Messages:
    590
    Location:
    Sydney
    I guess some things belong in the clocked process. :p I guess my scenario was different. I needed to honour the dram burst read/write timing diagrams. So I basically created a state for each clock-stage of the operation, where a state was determined by a state variable + a counter to save me from having st_read0 to st_read32 as opposed to just st_read0 and st_read1 with a counter in st_read1.

    The fsm that controlled it was purely combinatorial, however it had a few signals that a clocked process would monitor to increment/reset the counter, and transition the fsm. I guess it's similar to or is a multi-cycle datapath with a control unit. Each state would have something like next_state <= st_idle or next_state <= st_read0, and the clocked control process would go state <= next_state on the clock to sync the fsm transitions to the positive edge of the clock.

    An async circuit would probably just operate off ripple-effects. For example, you can chain something like 31 full-adders and 1 half-adder (around those numbers) to make a 32-bit adder. Then you just select the add function via a mux and load up the inputs. Wait a certain amount of time (propagation delay through 32 adders or so as they are connected in series), then your output should be ready.

    They're purely combinatorial circuits. In VHDL this might look like:

    Code:
    
    type functions is (
       f_add,
       f_shiftright
    );
    
    -- code is async, not in a process block
    -- & is the append operator in the shift code
    
    out <= in0 + in1 when func = f_add else
           '0' & in0(31 downto 1) when func = f_shiftright else 
           (others => '0');
    
    
    It's pretty much a case switch on the mux/func select. If they're making you structurally build each function unit instead of describing it behaviourally like above, you can still do the same thing but just a little clunkier.
     
  11. SLATYE

    SLATYE SLATYE, not SLAYTE

    Joined:
    Nov 11, 2002
    Messages:
    26,857
    Location:
    Canberra
    That sounds like a good way to do it. I'll have to try both and see what takes less space on the chip; I'd like to keep this one as small as possible.

    How do you wait a certain amount of time, unless you have a clock to provide a constant time period?

    I did try to do a design where the final bit of the adder being set would cause the next iteration to start, but ISE didn't like that at all. I'll have to try it again and see if I can make it work.

    That's a good way of doing it for something like integer addition that can be described combinationally (without a really huge amount of hardware). I've been using lots of that in my code; the Verilog equivalent is along the lines of:
    Code:
    assign out = (func == f_add) ? in0 + in1 : ((func == f_shiftright) ? (in0 >> 1) : 0);
    
    The problem I have is that the floating-point adder has lots of stages. At the moment:
    • Read in values
    • Compare exponents
    • Take the mantissa associated with the smallest exponent. Shift it right and increment the exponent until the exponents are equal
    • Add the two mantissa values
    • Shift the result left and decrement the exponent until there's a 1 in the left-most bit.
    • Save result
    There's a little bit more in there to handle denormalised values that are required to properly represent zero, but it doesn't matter for now.

    The issue is that once the values get read in the first stage, what makes it progress to the next stage? In a clocked system you can just go "next_state <= S02" and it'll go there on the next clock edge; but here there's no clock edge.

    Just going "state <= S02" (setting the state directly, without a "next_state" reg) didn't work last time I tried. This isn't exactly unexpected (if "state" gets updated in each stage and every update causes the block to run again, it'll try to run the entire block in zero time) - but I can't see a better way to do it without getting a clock involved.
     
  12. oupimiquo

    oupimiquo Member

    Joined:
    Sep 20, 2007
    Messages:
    520
    You need to think hardware, not software :) How would you expect the above block to get implemented in hardware? For example, what would the "state" variable be? Although the code above doesn't quite make sense ("next_state" = "state"?) I think what you're aiming for would en up being a ring oscillator.

    What I *think* you want is that the input to whatever code is in the state==0 case is shifted to the left by 1 if some other input is 1. In that case, describe it as such.

    As mikeyyy mentioned, it's basically the input changes propagating through to the outputs. The big problem with doing async circuits in FPGAs is that the lower bounds of propagation time are not well defined and vary widely. This results in an uneven "propagation front" through the logic, causing glitches on the outputs. Since async logic quite often has combinatorial loops by design, these glitches cause huge problems.

    In a custom ASIC, you can, through careful design, have a relatively well defined propagation front through the logic so you can be sure that glitches won't occur.

    Also, in FPGAs, FFs are essentially free since every LUT has an associated FF. The performance gain from async logic is pretty small for the amount of work required, since most (all?) vendor synthesis tools support automatic pipelining. All you need to do is clock the output through 4 registers and the tools will automatically split the logic up into 4 stages.

    That said, async logic does have it's place. For a topical example, the Spartan 3's have no hardware support for the delays required to interface to DDR RAM (specifically, the DQS phase shift). So the Xilinx MIG essentially sets up a DLL using LUTs, all driven asynchronously. Then you've got more metastability problems because you're trying to interface logic with same speed but different phase clocks. It all gets very complicated very quick, and is why the higher end FPGAs have specific hardware to address this on some of the outputs (the more you pay, the DQS skew logic you get, so the more memory channels you can use).

    Personally, I find the new Spartan 6's pretty interesting. Xilinx have finally added the required hardware to do PCI-e to the (admittedly higher end only) Spartan line, and also added hardware to handle a lot of the low-level details for everything up to and including DDR3. If they price it right, it's going to make developing PCI-e add-in cards way cheaper.
     
  13. spinvector

    spinvector Member

    Joined:
    Apr 3, 2009
    Messages:
    23
    Please keep us updated! sounds awesome...
     
  14. dakiller

    dakiller (Oscillating & Impeding)

    Joined:
    Jun 27, 2001
    Messages:
    8,347
    Location:
    3844
    I've got a bit of a worklog over on another forum with a bit more detail, and updates will be posted there. Don't expect any updates soon, as tempting as it is to start playing with it again over the uni break, I really should be working on more important things
     
  15. SLATYE

    SLATYE SLATYE, not SLAYTE

    Joined:
    Nov 11, 2002
    Messages:
    26,857
    Location:
    Canberra
    "next_state" would be another reg. The full code would be:
    Code:
    always @(posedge clk) begin
    	state <= next_state;
    end
    
    always @(*) begin
    	case (state)
    		1'b0: 	begin
    				// Do something.
    			end
    		1'b1: 	begin
    				reg0 <= (reg0 << 1);
    				next_state <= 1'b0;
    			end
    	endcase
    end
    
    That way, it switches to the next state on the next positive clock edge.

    I know it's going to oscillate with this code; but I can't see a good way to prevent that without putting it all in a sequential block!

    I've been trying to think about it in terms of hardware, but I don't know enough about the hardware to see exactly what will be created.

    I just want it shifted left by 1 regardless of inputs, and then the state should go back to 0. How I'd use this is with a serial adder:
    • State 0: 1-bit half-adder takes the left-most bit of reg0, adds it to a stored carry value, and saves it to an output somewhere. State is updated to state 1.
    • State 1: reg0 gets shifted left by 1 space, making a new bit the left-most one in reg0. Then we return to state 0 for another addition operation.

    Basically it'd form a loop, where in each cycle it shifts reg0 by one space and then adds one bit of that to the carry bit. In a sequential block it's fine; in a combinational block it tries to do the entire loop in no time.
    Sounds like async might be more trouble than it's worth, at least for my purposes.
    This is one of the things that I have a lot of trouble with. I don't know just how smart the synthesis tools are. For example, if I write "a + b + 1" is it smart enough to recognise that this is a full adder with an initial carry input? Or will it go and make two adders?

    If I write "a + 1" will that result in a full adder, or will the synthesis tool recognise that a half adder is adequate?

    I'm surprised that flip-flops are almost free; I must be doing something wrong. I took out a single ~30-bit reg from the code and that saved about ten slices (in the XC3S1200E). No other code changed.

    Interesting. It sounds like analysing such a system would be really painful.
    That would be interesting. You could write a GPU that reports itself as a standard Intel/ATI/Nvidia one, and see if the drivers will accept it. No chance of fitting a complex GPU on there, but something like Intel GMA900 might work.
     
  16. oupimiquo

    oupimiquo Member

    Joined:
    Sep 20, 2007
    Messages:
    520
    The word you're looking for here is barrel shifter :) And it's the reason why FPGAs are generally slow at high-precision FP math unless you're willing to burn a lot of space. You can also opt for an iterative design, which is I think what you were aiming for in the first place. The downside is of course that the iterative design takes multiple (and not a fixed number of) cycles. There is, of course, a range in-between, and with some cunning you can do a 24-bit barrel shifter (ie: single precision FP) in 4 cycles using a 4-way mux (which are fairly efficient on most FPGAs):
    Code:
    next_out <= mux(unshifted_input, prev_out << 1, prev_out << 5, prev_out << 8)
    For the first few steps of FP addition, I'd do things slightly more conventionally. First, I'd make sure that the number with the larger exponent was always on the "left", for some definition of left. Then, use a saturating subtract to get the required shift to match the two exponents up. Put this number, plus the right hand mantissa, into some form of barrel shifter.

    There's a quite useful post about FP addition in FPGAs at: http://www.fpgacpu.org/usenet/fp.html


    I quickly hacked up a single-precision psuedo-adder over breakfast using the above method. By psuedo-adder I mean that it can only add non-negative numbers together, doesn't handle wierd numbers like INFs, and doesn't do rounding. But it adds :)

    Code:
    module fpadd(ExpLHS, ExpRHS, MantLHS, MantRHS, ExpOut, MantOut, Strobe, Rdy, Clk, Rst, Done);
      input wire [7:0] ExpLHS;
      input wire [7:0] ExpRHS;
      input wire [22:0] MantLHS;
      input wire [22:0] MantRHS;
      output reg [7:0] ExpOut;
      output reg [22:0] MantOut;
      input wire Strobe;
      output reg Rdy;
      input wire Clk;
      input wire Rst;
      output reg Done;
    
    // - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    // This is the combinatorial part of the input side to the first stage (swapping
    // and calculating the shift).
    // - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    
    // Make the exponent on the left the larger of the two.
    wire [7:0] ExpLHS_Prep;
    wire [7:0] ExpRHS_Prep;
    wire [22:0] MantLHS_Prep;
    wire [22:0] MantRHS_Prep;
    wire SwapInputs;
    
    assign SwapInputs = (ExpRHS > ExpLHS) ? 1'b1 : 1'b0;
    assign ExpLHS_Prep = SwapInputs ? ExpRHS : ExpLHS;
    assign ExpRHS_Prep = SwapInputs ? ExpLHS : ExpRHS;
    assign MantLHS_Prep = SwapInputs ? MantRHS : MantLHS;
    assign MantRHS_Prep = SwapInputs ? MantLHS : MantRHS;
    
    // Calculate the required shift.
    wire [7:0] ReqShift8;
    assign ReqShift8 = ExpLHS_Prep - ExpRHS_Prep;
    
    // Saturate the required shift to 5 bits.
    wire SaturateShift;
    wire [4:0] SaturateMask;
    wire [4:0] StrobeMask;
    wire [4:0] ReqShift;
    
    assign SaturateShift = |ReqShift8[7:5];
    assign SaturateMask = {5{SaturateShift}};
    assign StrobeMask = {5{Strobe}};
    assign ReqShift = (ReqShift8[4:0] | SaturateMask) & StrobeMask;
    
    // - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    // This is the clocked part of the first stage (shifting the RHS mantissa).
    // - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    
    reg [4:0] CurShift;
    reg [23:0] MantRHS_Shifted;
    reg [22:0] MantLHS_Latched;
    reg [7:0] ExpIn_Latched;
    reg Stage1_Active;
    
    // Shift selector.
    reg [1:0] ShiftSel;
    always @(*)
    begin
      case(CurShift)
        5'b00000: ShiftSel = 2'b00;
        5'b00001: ShiftSel = 2'b01;
        5'b00010: ShiftSel = 2'b01;
        5'b00011: ShiftSel = 2'b01;
        5'b00100: ShiftSel = 2'b01;
        5'b00101: ShiftSel = 2'b10;
        5'b00110: ShiftSel = 2'b10;
        5'b00111: ShiftSel = 2'b10;
        5'b01000: ShiftSel = 2'b11;
        5'b01001: ShiftSel = 2'b11;
        5'b01010: ShiftSel = 2'b10;
        5'b01011: ShiftSel = 2'b10;
        5'b01100: ShiftSel = 2'b10;
        5'b01101: ShiftSel = 2'b11;
        5'b01110: ShiftSel = 2'b11;
        5'b01111: ShiftSel = 2'b10;
        5'b10000: ShiftSel = 2'b11;
        5'b10001: ShiftSel = 2'b11;
        5'b10010: ShiftSel = 2'b11;
        5'b10011: ShiftSel = 2'b11;
        5'b10100: ShiftSel = 2'b10;
        5'b10101: ShiftSel = 2'b11;
        5'b10110: ShiftSel = 2'b11;
        5'b10111: ShiftSel = 2'b11;
        5'b11000: ShiftSel = 2'b11;
        5'b11001: ShiftSel = 2'b11;
        5'b11010: ShiftSel = 2'b11;
        default:  ShiftSel = 2'b11; // Shouldn't get here.
      endcase
    end
    
    always @(posedge Clk)
    begin
      if (Rst)
      begin
        CurShift <= 0;
        MantRHS_Shifted <= 0;
        MantLHS_Latched <= 0;
        ExpIn_Latched <= 0;
        Stage1_Active <= 0;
      end
      else
      begin
        // Do the shift
        case(ShiftSel)
          2'b00: MantRHS_Shifted <= {1'b1, MantRHS_Prep};
          2'b01: MantRHS_Shifted <= MantRHS_Shifted >> 1;
          2'b10: MantRHS_Shifted <= MantRHS_Shifted >> 5;
          2'b11: MantRHS_Shifted <= MantRHS_Shifted >> 8;
        endcase
      
        // Adjust the counter.
        case(ShiftSel)
          2'b00: CurShift <= ReqShift;
          2'b01: CurShift <= CurShift - 1;
          2'b10: CurShift <= CurShift - 5;
          2'b11: CurShift <= CurShift - 8;
        endcase
      
        // Handle latching and active indicator.
        if (Strobe)
        begin
          MantLHS_Latched <= MantLHS_Prep;
          ExpIn_Latched <= ExpLHS_Prep;
          Stage1_Active <= 1;
        end
        else
        begin
          Stage1_Active <= |ShiftSel;
        end
      end
    end
    
    wire Stage1Done;
    assign Stage1Done = Stage1_Active && (ShiftSel == 2'b00);
    
    // Handle the ready line.
    always @(*)
    begin
      if (ShiftSel == 2'b00)
      begin
        if (Strobe)
        begin
          if (ReqShift == 5'b00000)
            Rdy <= 1'b1;
          else
            Rdy <= 1'b0;
        end
        else
          Rdy <= 1'b1;
      end
      else
        Rdy <= 1'b0;
    end
    
    // - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    // Second stage: add the mantissas (combinatorial).
    // - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    
    wire [24:0] Mant_Stage2;
    assign Mant_Stage2 = {2'b01, MantLHS_Latched} + {1'b0, MantRHS_Shifted};
    
    // - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    // Third stage: calculate the new exponent and shift (combinatorial).
    // - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    
    wire [22:0] Mant_Stage3;
    wire [7:0] Exp_Stage3;
    assign Mant_Stage3 = Mant_Stage2[24] ? Mant_Stage2[23:1] : Mant_Stage2[22:0];
    assign Exp_Stage3 = Mant_Stage2[24] ? (ExpIn_Latched + 1) : ExpIn_Latched;
    
    // - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    // Fourth stage: have the synthesis tool pipeline the above, and then register
    // the output.
    // - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    
    reg [7:0] ExpOut_p1;
    reg [22:0] MantOut_p1;
    reg Done_p1;
    
    always @(posedge Clk)
    begin
      if (Rst)
      begin
        ExpOut <=0;
        MantOut <= 0;
        Done <= 0;
        ExpOut_p1 <=0;
        MantOut_p1 <= 0;
        Done_p1 <= 0;
      end
      else
      begin
        ExpOut <= ExpOut_p1;
        MantOut <= MantOut_p1;
        Done <= Done_p1;
        ExpOut_p1 <= Exp_Stage3;
        MantOut_p1 <= Mant_Stage3;
        Done_p1 <= Stage1Done;
      end
    end
    
    endmodule
    
    A quick analysis with XST shows it just scraping in at 200 MHz on a Spartan 3E (-5 speed grade). Without the pipelining in the last stage (24-bit adder) it tops out somewhere around 170 MHz, and it's still the adder bottlenecking it at 200 MHz. The limiting part from a throughput point of view is the shifter, which takes up to 5 cycles (1 to clock in, 4 to shift), so comes out at 40 MFlops. Resource usage is 107 slices (210 LUTs, 93 FFs), which would obviously go up once you add things like sign support, etc. Latency is 2-3 cycles for the swap and shift calculation (the 1 cycle difference comes from how much slack you've got in the logic feeding the adder), 4 cycles for the shift, and 2 for the add, so 8 cycles = 25 MFlops non-pipelined.

    Though you'll probably should add a cycle or two, since the coffee hasn't fully kicked in yet :)

    To up the throughput you'd have to burn more resources on the shifter. Simply chaining 4 of the current shifters together would obviously work and get you up to 200 MFlops at the price of significantly increased space. Going to an 8-way mux may or may not help (left as an exercise for the reader).

    How do you know where the next bit is? Basically, what would happen when you set state to 1 is that there'd be a brief blip at the output which would last until the loop closed. This is actually fine in an ASIC, where you can control the speeds (so you'd match the loop feedback time to the data rate) so that the blip is the right length. In fact, this is the half the idea behind the self-resetting domino logic of Northwood fame.

    On an Xilinx FPGA, it'll result in a de-facto full adder, because it's free. There's essentially a dedicated half-adder per LUT4/FF, so combined with the LUT4 it acts as a full adder.

    That said, XST has a bit of a reputation of sometimes choosing to use the carry chain when going with simple LUTs would be faster, and vice versa. So if you're trying to sqeeze the last bit of performance out of the chip, you need to keep a bit of an eye on it.

    I should have been a bit more specific - flip-flops come free with logic :) If you're doing something like a FIFO, where there's no logic between the FFs, then they obviously cost you space. But for pipelining, where in most cases you have logic between the stages, they're essentially free.
     
    Last edited: Apr 15, 2009
  17. nudge

    nudge Member

    Joined:
    Sep 5, 2001
    Messages:
    860
    Location:
    Amsterdam NL
    Ah FPGAs, so much fun! I was deeply involved with FPGAs in my first job straight out of university, where unfortunately there wasn't much in the way of FPGAs on the syllabus (I studied BE(Elec)/BSc(CompSci)(Maths)). I prototyped Canon's fourth generation raster image processor ASIC core (http://www.canon.com/technology/interview/irc/irc_p6.html) on multiple Virtex-4 FPGAs hosted on an PCI-X interface and interfacing to 1GB of DDR2 RAM. This was so we could get an early start on software integration and also speed up regression testing and verifciation of the core (in the end there was about 10,000x speedup over software HDL simulation ala ModelSim). The job involved writing SRAM models to use BlockRAM resources, various bus adpaters (to QL PCI controller and to various other proprietary busses), a DDR2 memory interface, debugging blocks (eg. pseudo random latency generators), serdes interfaces and clocking/reset sequencing logic. There were something like 5 clock domains in the design, and the core was quite large (multi-million ASIC gates), so it had be to partitioned across multiple FPGAs, which meant design partitioning, clock distribution and signal serialisation/deserialisation. One of the most difficult parts of the prototype (on the hardware side) was the DDR2 controller, but as oupimiquo points out, the Virtex series has hardware support to make life easier, 90 degree phase shift PLLs, IO blocks with ddr registers and adjustable delay lines for the DQS signals.

    I also developed a Linux kernel driver for the prototype and an API library, which was also alot of fun! I ended up automating the thing so tests could be queued and sent to the board (loaded onto the on-board 1GB DDR2 RAM), processed, then the results dumped back into a database for comparison against software and golden models, with a nice web interface.

    The most difficult part on the software side was dealing with buggy tools (they also get exponentially slower when you try to push for 100% utilisation :). As a result, the design was synthesized in a multi-vendor environment, I used a combination of Xilinx ISE, Synopsys DC-FPGA and Mentor Precision to sucessfully synthesize the chip. I hit quite a few bugs in Xilinx ISE (7.x), and also one in the otherwise awesome Synopsys DC-FPGA (now discontinued).

    The prototype board, 2x Virtex-4 LX200, 1x Virtex-4 FX100, 1GB DDR2 SODIMM, PCI-X interface via QuickLogic QL5064 PCI bridge,

    Click to view full size!



    I miss playing with FPGAs... I've changed jobs and now writing java code for a living :/ Maybe I should invest in one of those Spartan boards.
     
    Last edited: Apr 15, 2009

Share This Page

Advertisement: