L4: Pipelining
L-4.1: Pipelining with real life example| Need of Pipelining | COA
Section titled “L-4.1: Pipelining with real life example| Need of Pipelining | COA”L-4.2: Pipelining Introduction and structure | Computer Organisation
Section titled “L-4.2: Pipelining Introduction and structure | Computer Organisation”Definition of Pipelining
- Pipelining is the Process of arrangement of hardware elements of CPU such that Hs Overall Performance is increased
- Simultaneous execution of more than one instructor takes place in pipelined processor
-
In pipelining multiple instructions are overlapped in execution
- Example – Reduced Instruction Set (RISC) Pipeline
Stage/Segment Interface Register (IR) / /Input / R1 R2 / R3 R4 ┌─────┐ ┌─┐ ┌─────┐ ┌─┐ ┌─────┐ ┌─┐ ┌─────┐ ┌─┐->│ S1 │-│ │->│ S2 │-│ │->│ S3 │-│ │->│ Sn │-│ │-> └─────┘ └─┘ └─────┘ └─┘ └─────┘ └─┘ └─────┘ └─┘ └──────┴──────┴─────┴──────┴─────┴──────┴─────┴ Clock-
IR (Interface Register): Latches placed after each stage to store intermediate results between stages.
- They isolate stages so each can work in parallel during the same clock cycle.
- Common in RISC architectures to achieve high instruction throughput.
-
Clock Pulse: Synchronizes all pipeline stages so each stage starts its operation at the same time. On each clock edge, IRs transfer stored data to the next stage, enabling overlapped execution of multiple instructions.
Standard:
- In a pipeline, the registers (pipeline registers) are placed after each stage, not before.
- S1 → R1 → S2 → R2 → S3 → R3 → Sn → Rn
- Each R stores the output of the previous stage so that the next stage can work on it in the next clock cycle.
- This ensures all stages operate in parallel without interfering.
Common Pipeline Stage Configurations ⭐
- 3-stage pipeline (simple microcontrollers, early ARM)
- IF – Instruction Fetch
- ID/EX – Instruction Decode + Execute (combined)
- WB – Write Back
- 4-stage pipeline (basic processors)
- IF – Instruction Fetch
- ID – Instruction Decode & Register Fetch
- EX – Execute / Memory Address Calculation ➕
- WB – Write Back
- 5-stage pipeline (classic RISC)
- IF – Instruction Fetch
- ID – Instruction Decode / Register Fetch
- EX – Execute / Address Calculation
- MEM – Memory Access ➕
- WB – Write Back
- 6-stage pipeline (higher clock speed) ⭐
- IF – Instruction Fetch
- ID – Instruction Decode
- OF – Operand Fetch ➕
- EX – Execute
- MEM – Memory Access
- WB – Write Back
- More than 6 stages – Used in high-performance CPUs by splitting stages further (e.g., multiple fetch, decode, and execute phases).
L-4.3: Pipelining Vs Non-Pipelining | Instruction Execution | Speedup, Efficiency, Utilization | COA
Section titled “L-4.3: Pipelining Vs Non-Pipelining | Instruction Execution | Speedup, Efficiency, Utilization | COA”5 Stages Pipeline (ex: RISE) :
S1- (IF) Instruction FetchS2- (ID) Instruction DecodeS3- (EX) ExecuteS4- (MEM) MemoryS5- (WB) Write Back
Given: Program with 8 Instructions (I1 to I8)
If for each Stage, time is not given , Consider Uniform Clock Cycle (Each Stage take 1 CC)
Non Pipelined Architecture
┌───┐| └──
1 Stage -> Take 1 CC (Clock Cycle)
1 Instruction -> 5 Stages -> Take 5 CCP = 8I , I1-I8
S1 S2 S3 S4 S5 S6 S7 S8┌───┬───┬───┬───┬───┬───┬───┬────┐|I1 |I1 |I1 |I1 |I1 |I1 |I1 |I1 |└───┴───┴───┴───┴───┴───┴───┴────┴→ Time 1 2 3 4 5 6 7 8
S1 S2 S3 S4 S5 S6 S7 S8┌───┬───┬───┬───┬───┬───┬───┬────┐|I2 |I2 |I2 |I2 |I2 |I2 |I2 |I2 |└───┴───┴───┴───┴───┴───┴───┴────┴→ Time 9 10 11 12 13 14 15 16
... and so I3, I4, I5, I6, I7,
S1 S2 S3 S4 S5 S6 S7 S8┌───┬───┬───┬───┬───┬───┬───┬────┐|I8 |I8 |I8 |I8 |I8 |I8 |I8 |I8 |└───┴───┴───┴───┴───┴───┴───┴────┴→ Time 33 34 35 36 37 38 39 40
Total Time = 40 CCFormula:
No. of Instructions (N) = 8No. of Stages (K) = 5Time per Stage (T) = 1CC (Uniform)
Total Time = N x K x T⭐ = 8 x 5 x 1 = 40 CCPipelined Architecture
Phase (Space) Time Diagram
P = 8I , I1-I8
Stage ↑ ├───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐S1 |I1 |I2 |I3 |I4 |I5 |I6 |I7 |I8 | | | | | ├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤S2 | |I1 |I2 |I3 |I4 |I5 |I6 |I7 |I8 | | | | ├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤S3 | | |I1 |I2 |I3 |I4 |I5 |I6 |I7 |I8 | | | ├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤S4 | | | |I1 |I2 |I3 |I4 |I5 |I6 |I7 |I8 | | ├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤S5 | | | | |I1 |I2 |I3 |I4 |I5 |I6 |I7 |I8 | └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴─→ Time 1 2 3 4 5 6 7 8 9 10 11 12
╰-──-───-──-──-───-╯╰-─╯╰-─╯╰-─╯╰-─╯╰-─╯╰-─╯╰-─╯ I1 I2 I3 I4 I5 I6 I7 I8
Total Time = 12 CC
I1 -> takes 5CCI2 to I7 -> takes 1CC Each
So,1 Instruction take -> All stage x 1CCN-1 Instruction take -> One Stage x 1 CC EachFormula:No. of Instructions (N) = 8No. of Stages (K) = 5Time per Stage (T) = 1CC (Uniform)
Total Time = (1.K + (N-1).1) = (K + N - 1) = (5 + 8 - 1) = 12 CCSpeedup & Efficiency
Speedup = Non Pipeline / Pipeline = 40 CC / 12 CC = 3 (aprrox 3 x times)
Efficiency = Cells Utilised / Totan no. of Cells = 40 / 60 = 2/3Goal of Pipeline: To make Cycles Per Instruction (CPI) ≈ 1
CPI formula:
$$CPI=\frac{\text{Total CPU Cycles}}{\text{Number of Instructions}}$$
- It’s practically not possible to achieve exactly 1 due to hazards and stalls.
- But theoretically, for ideal pipelining:
- Example 1: No. of instructions =
8, No. of stages =12→ Total cycles =12 + (8 - 1)=19
→ CPI =19/8≈2.3❓ → CPI >> 1 because pipeline fill time dominates. ⭐
- Example 1: No. of instructions =
-
- Example 2: No. of instructions =
1000, No. of stages =5→ Total cycles =5 + (1000 - 1)=1004≈1000→ CPI =1004/1000≈1.004→ CPI ≈ 1 for large instruction counts.
- Example 2: No. of instructions =
L-4.4: Stage Delay in Pipeline | Previous Year GATE Question | Computer Organisation & Architecture
Section titled “L-4.4: Stage Delay in Pipeline | Previous Year GATE Question | Computer Organisation & Architecture”Ques. A 4-stage pipeline has stage delays as 150, 120, 160, and 140 ns. Registers are used between stages and have a delay of 5 ns each. Assuming Const Clock rate, the total time taken to process 1000 data items on this pipeline will be?
**Theory **
- This is 4 Stage pipeline
- IR - If there is no uniform stage delay, then it is required to store Intermediate Result in Interface Register, because if one stage take more time than other, than there is speed mismatch and so we need the buffer
- Clock Pulse - Clock pusle is provide to each stage for work. Here we will asume constant clock rate, so will provide single clock pusle to all. so that all frequency will be same.
S1 R1 S2 R2 S3 R3 S4 R4 ┌─────┐ ┌─┐ ┌─────┐ ┌─┐ ┌─────┐ ┌─┐ ┌─────┐ ┌─┐ ->│ 150 │-│5│->│ 120 │-│5│->│ 160 │-│5│->│ 140 │-│5│-> └─────┘ └─┘ └─────┘ └─┘ └─────┘ └─┘ └─────┘ └─┘ └──────┴──────┴─────┴──────┴─────┴──────┴─────┴ Clock- But here each stages require different time (delay) to perform
- So we need to choose Maximum Stage Delay because
Solve
Max Stage Delay = 160 + 5 ns = 165 ns
so here, 1 Clock Cycle(CC) = 165 ns
Also, frequencey = 1/t = 1/165
here instructions/data to process = 1000
Pipeline:Total time = (K + N - 1) CC = (4 + 1000 - 1) CC = 1003 CC = 1003 x 165 ns = 165495 ns = 165.495 us