Instruction scheduling in multiple-issue RISCV-processor

  • #1
bremenfallturm
68
11
Homework Statement
Exercise: Assume we have a VLIW processor
that can issue two instructions in the same clock
cycle. For the following code, schedule the
instructions into issue slots. Assume that no
hazards occurs.
Code:
addi s1, zero, 10
L1: lw t0, 0(s2)
ori t1, t0, 7
sw t1, 0(s2)
addi s2, s2, 4
addi s1, s1, -1
bne s1, zero, L1
Relevant Equations
N/A
Hello!

(this problem considers a multiple-issue RISCV-processor)

My attempt at a solution is identical to the solution key (screenshoted below), but I scheduled the sw instruction in slot 1 and the bne instruction in slot 2:
1735316769239.png

Since it is not given in the question that slot 1 and slot 2 are dedicated for different kinds of instructions (for example ALU operations), I would assume that it is completely arbitrary that the branch instruction is in slot 1 and the sw instruction is in slot 2? Am I correct or have I missed something?

Bonus question: If we look at clock cycle 5, then the bne and the sw instuctions will execute in parallell. I would assume that when the bne evaluates to true, it will update the PC to jump to L1:. From what I am used to, this will happen in the Execute stage or in the Decode stage, i.e. before the result from the sw would be written to memory. A properly functional multiple-issue processor would handle this, but I am a little confused about how this process works. Does the whole pipeline need to stall, and wait until the sw result is ready before the branch is performed? This does not match what I've seen in other questons where they also ask me to calculate IPC, considering control hazards. But if the branch is performed in slot 1 before slot 2 is finished, how does the processor handle the fact that flushing the pipeline in the decode stage, would, from what I assume, also flush the sw instruction, ending up in it not saving the data?
 
Physics news on Phys.org
  • #2
bremenfallturm said:
Since it is not given in the question that slot 1 and slot 2 are dedicated for different kinds of instructions (for example ALU operations), I would assume that it is completely arbitrary that the branch instruction is in slot 1 and the sw instruction is in slot 2? Am I correct or have I missed something?
This is implementation dependent: do your course materials refer to a specific implementation, or define this specifically?

bremenfallturm said:
Bonus question: If we look at clock cycle 5, then the bne and the sw instuctions will execute in parallell. I would assume that when the bne evaluates to true, it will update the PC to jump to L1:. From what I am used to, this will happen in the Execute stage or in the Decode stage, i.e. before the result from the sw would be written to memory.
This is implementation dependent, however one would normally expect that any instruction updating the Program Counter would not take effect until the fetch at the start of the next full cycle i.e. following completion of the full current fetch-decode-execute-memory-writeback cycle so in this question the PC is updated in the execute stage of Cycle 5 and the result from the sw is written to memory in the writeback stage of Cycle 5. At the start of Cycle 6 ##\equiv ## Cycle 1 the next instruction is fetched from [PC] = L1.
 
Last edited:
  • Like
Likes FactChecker and berkeman
  • #3
pbuk said:
This is implementation dependent: do your course materials refer to a specific implementation, or define this specifically?
No, it is not mentioned anywhere. In other assignments in the course where no restriction on what each slot can do has been done, branch instruction has sometimes ended up in slot 2. But the exercise question doesn't define it and we've not gone deeper than talked about "multiple-issue processors" as a general term. So no specific implementation.
pbuk said:
in this question the PC is updated in the execute stage of Cycle 5 and the result from the sw is written to memory in the writeback stage of Cycle 5
I see. I think something quite crucial which is missing from my conceptual understanding is how long a full F-D-E-M-W cycle takes. So assuming no stalls, going from F->W takes 1 cycle, and hence, the next instruction will not be fetched until all stages in cycle 5 has been executed?
 
  • #4
bremenfallturm said:
No, it is not mentioned anywhere. In other assignments in the course where no restriction on what each slot can do has been done, branch instruction has sometimes ended up in slot 2.
Then I think you can assume that your answer with the BNE in slot 2 is also correct.

bremenfallturm said:
I see. I think something quite crucial which is missing from my conceptual understanding is how long a full F-D-E-M-W cycle takes.
I think you are looking for tricks where there are none: a full cycle takes exactly one cycle.

bremenfallturm said:
So assuming no stalls, going from F->W takes 1 cycle, and hence, the next instruction will not be fetched until all stages in cycle 5 has been executed?
Yes.
 
  • #5
Thanks for your help! Just one last question to make sure I understand, since I read more about pipelining and CPI!
pbuk said:
Yes.
From my understanding what I have read online (after posting here), in a F-D-E-M-W cycle, I've understood it as it takes 5 cycles from one particular instruction to go from the F to the W stage. For example, executing "addi t0, s0, 3" would in cycle 1 be in F, cycle 2 be in D, and so on (assuming no stalls).

Wikipedia writes this about CPI:

> With pipelining, a new instruction is fetched every clock cycle by exploiting instruction-level parallelism, therefore, since one could theoretically have five instructions in the five pipeline stages at once (one instruction per stage), a different instruction would complete stage 5 in every clock cycle and on average the number of clock cycles it takes to execute an instruction is 1 (CPI = 1)

(source: https://en.wikipedia.org/wiki/Cycles_per_instruction)


If I understand your answer correctly, you say that it instead takes 1 cycle for an instruction to go through F-D-E-M-W. Did I misunderstand? (I also understand that this is implementation-dependent!)
 
  • #6
Follow-up question: why can not sw t1, 0(s) and ori t1, t0, 7 be in the same slot?

I would assume that the result of ori could be forwarded to the beginning of the M stage of the sw instruction? Trying to illustate this with a little arrow:
...........| ...........
F D E V M W #sw
F D E . M W # ori


Or can forwarding not happen within the same cycle on multiple-issue processors?
 
Last edited:
Back
Top