Bài giảng Computer architecture - Chapter 4: The processor
The Processor
Agenda
1. Introduction
2. Logic Design Convention
3. Building a Datapath
4. A Simple Implementation Scheme
5. An Overview of Pipelining
6. Pipelined Datapath and Control
7. Data Hazards: Forwarding versus Stalling
8. Control Hazards
9. Exception
Agenda
1. Introduction
2. Logic Design Convention
3. Building a Datapath
4. A Simple Implementation Scheme
5. An Overview of Pipelining
6. Pipelined Datapath and Control
7. Data Hazards: Forwarding versus Stalling
8. Control Hazards
9. Exception
Bạn đang xem 20 trang mẫu của tài liệu "Bài giảng Computer architecture - Chapter 4: The processor", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.
File đính kèm:
- bai_giang_computer_architecture_chapter_4_the_processor.pdf
Nội dung text: Bài giảng Computer architecture - Chapter 4: The processor
- CE Logic Design Convention To discuss the design of a computer, we must decide how the logic implementing the computer will operate and how the computer is clocked. This section reviews a few key ideas in digital logic that will be used extensively in this chapter. . Combinational: the elements that operate on data values (ALU) . State elements (sequential): the elements contains state if it has some internal storage (instruction, data memories and registers) The below terms are used in this subject: - Asserted (assert): the signal is logically high or true. - Deasserted (deassert): the signal is logically low or false. 11
- CE Logic Design Convention Clocking Methodology A clocking methodology defines when signals can be read or written. This approach used to determine when data valid and stable relative to the clock. Edge-triggered clocking methodology is a clocking scheme in which all state changes occur on a clock edge. That means that any values stored in a sequential logic element are updated only on a clock edge. Fig.3 Combinational logic, state elements, and the clock are closely related. The time necessary for the signals to reach state element 2 defines the length of the clock cycle. 12
- CE Logic Design Convention Clocking Methodology Control signal: A signal used for multiplexor selection or for directing the operation of a functional unit. Data signal: a signal contains information that is operated on by a functional unit. Bus: is signals wider than 1 bit, with thicker lines. Several buses combine to from a wider bus. For example, 32-bit bus is obtained by combining two 16-bit buses 13
- CE The Processor Agenda 1. Introduction 2. Logic Design Convention 3. Building a Datapath 4. A Simple Implementation Scheme 5. An Overview of Pipelining 6. Pipelined Datapath and Control 7. Data Hazards: Forwading versus Stalling 8. Control Hazards 9. Exception 14
- CE Building a Datapath Datapath element: A unit used to operate on or hold data within a processor. In the MIPS implementation, the datapath elements include the instruction and data, the register file, the ALU, and adders. Program Counter (PC): The register containing the address of the instruction in the program being executed. Register file: A state element that consists of a set of register that can be read and written by supplying a register number to be accessed. Fig.4 Two state elements are needed to store and access instructions, and an adder is needed to compute the next instruction address. 15
- CE Building a Datapath 1. Fetching Instruction Fig.5 A portion of the datapath used for fetching instructions and incrementing the program counter (PC). 16
- CE Building a Datapath 2. R-format instruction (arithmetic-logical instruction) (add, sub, AND, OR and slt) Example: add $t1, $t2, $t3 Fig.6 The two elements needed to implement R-format ALU operations are the register file and the ALU. 17
- CE Building a Datapath 3. Load word and store word instruction Example: lw $t1, offset_value($t2) sw $t1, offset_value($t2) Fig.7 The two units needed to implement loads and stores, in addition to the register file and ALU of Fig.6, are the data memory unit and the sign extension unit. 18
- CE Building a Datapath 4. branch instruction (beq) Example: beq $t1, $t2, offset Fig.8 The datapath for a branch uses the ALU to evaluate the branch condition and a separate adder to compute the branch target as the sum of the incremented PC and the sign-extended, lower 16 bits of the instruction (the branch displacement), shifted left 2 bit 19
- CE Building a Datapath Example 1: Fig.9 The datapath for the memory instructions and the R-type instructions. 20
- CE Building a Datapath Example 2: Fig.10 The simple datapath for the MIPS architecture combines the elements required by different instruction classes. The components come form Fig.5, 8, and 9. This datapath can execute the basic instructions (load-store word, ALU operation, and branches) in a single clock cycle. 21
- CE The Processor Agenda 1. Introduction 2. Logic Design Convention 3. Building a Datapath 4. A Simple Implementation Scheme 5. An Overview of Pipelining 6. Pipelined Datapath and Control 7. Data Hazards: Forwarding versus Stalling 8. Control Hazards 9. Exception 22
- CE A Simple Implementation Scheme The ALU Control The MIPS ALU defines the 6 combinations of 4 control inputs: Depending on the instruction class, the ALU will need to perform one of these first five functions. (NOR is needed for other parts) For load word and store word instructions, the ALU computes the memory address by addition. For the R-type instructions, the ALU needs to perform one of the five actions (AND, OR, subtract, add, or set on less than), depending on the value of the 6-bit funt (or function) field in the low-order bits of the instruction. For branch equal, the ALU must perform a subtraction 23
- CE A Simple Implementation Scheme The ALU Control To generate the 4-bit ALU control input, we can use a small control unit that has as inputs the function field of the instruction and a 2-bit control field called ALUOp. ALUOp indicates the operation performed by ALU as add (00) for loads and stores, subtract (01) for beq, or determinded by the operation encoded in the funct field (10). The 4-bit signal, the output of ALU control unit can control directly the operation of ALU Fig.11 How the ALU control bits are set depends on the ALUOp control bits and the different function codes for the R-type instruction. 24
- CE A Simple Implementation Scheme The ALU Control Truth Table: From logic, a representation of a logical operation by listing all values of the inputs and the in each case showing what the resulting outputs should be. Fig.12 The truth table for the ALU control bits (call Operation) Don’t-care term: An element of a logical function in which the output does not depend on the values of all the inputs. Don’t-care terms may be specified in different ways. 25
- CE A Simple Implementation Scheme Design the Main Control Unit Fig.13 The three instruction classes use to different instruction formats. 26
- CE A Simple Implementation Scheme There are several major observations about this instruction format that we will rely on: The op field, also called the opcode, is always contained bits 31:26. We will refer to this field as Op[5:0]. The two registers to be read are always specified by the rs and rt fields, at positions 25:21 and 20:26. This is true for the R-type instructions, branch equal, and store. The base register for load and store instructions is always in bit positions 25:21(rs). The 16-bit offset for branch equal, load, and store is always in positions 15:0. The destination register is in one of two places. For a load, it is in bit positions 20:16(rt), while for an R-type instruction it is in bit positions 15:11(rd). Thus, we will need to add a multiplexor to select which field of the instruction is used to indicate the register number to be written. 27
- CE A Simple Implementation Scheme Design the Main Control Unit Fig.14 The datapath of Fig.10 with all necessary multiplexors and all control lines identified. 28
- CE A Simple Implementation Scheme Design the Main Control Unit Fig.15 The effect of each of the seven control signals. 29
- CE A Simple Implementation Scheme Operation of the Datapath Fig.16 Simple datapath with the control unit. 30
- CE A Simple Implementation Scheme Operation of the Datapath Fig.17 The setting of the control lines is completely determined by the opcode fields of the instruction 31
- CE A Simple Implementation Scheme Operation of the Datapath Fig.18 The datapath in operation for an R-type instruction, such as add $t1, $t2, $t3. 32
- CE A Simple Implementation Scheme Operation of the Datapath Fig.19 The datapath in operation for a load instruction. 33
- CE A Simple Implementation Scheme Operation of the Datapath Fig.20 The datapath in operation for a branch-on-equal instruction. 34
- CE A Simple Implementation Scheme Finalizing Control Fig.21 The control function for the simple single-cycle implementation is completely specified by this truth table. Single-cycle implementation: Also called single clock cycle implementation. An implementation in which an instruction is executed in one clock cycle. 35
- CE A Simple Implementation Scheme Implementing JUMP instruction Fig.22 Instruction format for the jump instruction (opcode = 2) The 32-bit target PC address of the jump instruction can be computed as follows: - The low-order 2 bits are always 00two , like one of the branch instruction. - The next lower 26 bits come from the 26-bit immediate field in the instruction. - The upper 4 bits are obtained by replacing the one of the current PC+4 (31:28). 36
- CE A Simple Implementation Scheme Implementing JUMP instruction Fig.23 The simple control and datapath are extended to handle the jump instruction. 37
- CE A Simple Implementation Scheme Why a Single-Cycle Implementation is not used today? In this single-cycle design: The clock cycle must have the same length for every instruction the clock cycle is determined by the longest possible path in the processor (load instruction – using 5 function units in series: the instruction memory, the register file, the ALU, the data memory, and the register file) Clock cycle is too long, performance is poor. Note: A single-cycle design might be considered acceptable for the small instruction set. 38
- CE The Processor Agenda 1. Introduction 2. Logic Design Convention 3. Building a Datapath 4. A Simple Implementation Scheme 5. An Overview of Pipelining 6. Pipelined Datapath and Control 7. Data Hazards: Forwading versus Stalling 8. Control Hazards 9. Exception 39
- CE An Overview of Pipelining Pipelining is an implementation technique in which multiple instructions are overlapped in execution. Let’s start with the non-pipelined approach in laundry’s steps: 1. Place one dirty load of clothes in the washer. 2. When the washer is finished, place the wet load in the dryer. 3. When the dryer is finished, place the dry load on a table and fold. 4. When folding is finished, ask your roommate to put the clothes away. When your roommate is done, then start over with the next dirty load. Fig.24 The laundry analogy for non-pipelined. 40
- CE An Overview of Pipelining For the pipelined approach in laundry: Fig.25 The laundry analogy for pipelined - The pipelined approach takes much less time than non-pipelined one for many loads because everything is working in parallel, so more loads are finished per hour. - Pipelining would not decrease the time to complete one load of laundry, but when we have many loads of laundry to do, the improvement in throughput decreases the total time to complete the work. 41
- CE An Overview of Pipelining Fig.26 The laundry analogy for non-pipelining and pipelining Stage: the point on each step 42
- CE An Overview of Pipelining The same principles apply to processors where we pipeline instruction execution. MIPS instructions classically take fie steps: 1. Fetch instruction from memory 2. Read registers while decoding the instruction. The regular format of MIPS instructions allows reading and decoding to occur simultaneously. 3. Execute the operation or calculate an address. 4. Access and operand in data memory. 5. Write the result into a register. So, the MIPS pipeline we explore in this chapter have five stages. 43
- CE An Overview of Pipelining Example: - Let’s create a pipeline of a processor that only supports 8 instructions: load word (lw), store word (sw), add (add), subtract (sub), AND (and), OR (or), set less than (slt), and branch on equal (beq). - Compare the average time between instructions of a single-cycle implementation (all instructions take one clock cycle) and a pipelined implementation. - The operation times for the major functional units in this example are 200 ps for memory access, 200 ps for ALU operation, and 100 ps for register file read or write. 44
- CE An Overview of Pipelining Answer: Fig.27 Total time for each instruction calculated from the time for each component. The single-cycle design must allow for the slowest instruction –in Fig.27 it is lw - so the time required for every instruction is 800ps. 45
- CE An Overview of Pipelining Answer: Fig.28 Single-cycle, non-pipelined execution in top versus pipelined one in bottom. The time between the first and fourth instructions in the non-pipelined design is 3x800 = 2400ps. But, one in the pipelined design is 3x200 = 600ps. 46
- CE An Overview of Pipelining The Pipelining speed-up In the idea conditions: the stages are perfectly balanced, the time between instructions on the pipelined processor is equal to Timepipelined = Timenonpipelied : 5 = 800 : 5 = 160 ps clock cycle. The speed-up between (non- and pipelined) will be equal to the number of pipeline stages. In the real conditions: the stages are imperfectly balanced, the pipelining involves some overhead (in previous example, time for each stage is 200 ps) the time per instruction in the pipelined processor will exceed the minimum possible. In previous example: Speed-up ≈ Timenonpipelied : Timepipelined ≈ 800 : 200 = 4 < 5 (number pipeline stages) The speed-up between (non- and pipelined) will be less than the number of pipeline stages. 47
- CE An Overview of Pipelining The Pipelining speed-up Pipelining improves performance by increasing instruction throughput, as opposed to decreasing the execution time of an individual instruction. Increasing throughput is the important metric because real programs execute billions of instructions. 48
- CE An Overview of Pipelining Designing Instruction Sets for Pipelining First, all MIPS instructions are the same length. This restriction makes it much easier to fetch instructions in the first pipeline stage and to decode them in the second stage. Second, MIPS has only a few instruction formats, with the source register fields begin located in the same place in each instruction. This symmetry means that the second stage can begin reading the register file at the same time that the hard ware is determining what type of instruction was fetched. If MIPS instruction formats were not symmetric, we would need to split stage 2, resulting in six pipeline stages. Third, memory operands only appear in loads or stores in MIPS. This restriction means we can use the execute stage to calculate the memory address and then access memory in the following stage. If we could operate on the operands in memory, as in the x86, stage 3 and 4 would expand to an address stage, memory stage, and then execute stage. Fourth, operands must be aligned in memory in MIPS (Chapter 2). So, we need not worry about a single data transfer instruction requiring two data memory accesses; the requested data can be transferred between processor and memory in a single pipeline stage. 49
- CE An Overview of Pipelining Pipeline Hazards Hazards: The situations in pipelining when the next instruction cannot execute in the following clock cycle. There are three different types of hazards: structural, data, and control hazards. Structural hazard: when a planned instruction cannot execute in the proper clock cycle because the hardware does not support the combination of instructions that are set to execute. Data hazard (pipeline data hazard): when a planned instruction cannot execute in the proper clock cycle because the data that is needed to execute the instruction is not yet available. Control hazard (branch hazard): when the proper instruction cannot execute in the proper pipeline clock cycle because the instruction that was fetched is not the one is needed; that is, the flow of instruction addresses is not what the pipeline expected. 50
- CE An Overview of Pipelining Structural Hazard Suppose that we have a single memory instead of two memories (instruction and date memories). If the above pipeline had a fourth instruction, we would see that in the same clock cycle the first instruction is accessing data from memory while the fourth instruction is fetching an instruction from that same memory. Without two memories, our pipeline could have a structural hazard. 51
- CE An Overview of Pipelining Data Hazard Suppose we have an add instruction followed immediately by a subtract instruction as follows The add instruction doesn’t write its result until the fifth stage, meaning that we would have two waste clock cycles as the below illustration 52
- CE An Overview of Pipelining Data Hazard The primary solution is based on that we don’t need to wait for the instruction to complete before trying to resolve the data hazard. For the code in the previous slide, as soon as the ALU creates the sum for the add, we can supply it as an input for the subtract. Adding extra hardware to retrieve the missing item early from the internal resources is called forwarding (bypassing). Forwarding (bypassing): A method of resolving a data hazard by retrieving the missing data element from internal buffers rather than waiting for it to arrive from programmer- visible register or memory. Fig.29 Graphical representation of forwarding. 53
- CE An Overview of Pipelining Data Hazard The forwarding paths are valid only if the destination stage is later in time than the source stage. For example, there cannot be a valid forwarding path from the output of the memory access stage in the first instruction to the input of the execution stage of the following one. The forwarding works very well; however, it cannot prevent all pipeline stalls (bubble) – a stall initiated in order to resolve a hazard. For example: lw $s0, 20($t1) sub $t2, $s0, $t3 For lw instruction, the desired data would be available only after the fourth stage of the first instruction in the dependence, which is too late for the input of the third stage of sub (the second instruction). Even with forwarding, we would have to stall one stage for a load-use data hazard – a specific form of data hazard in which the data being loaded by a load instruction has not yet become available when it is needed by another instruction. 54
- CE An Overview of Pipelining Control Hazard Some instructions in MIPS (branches, jump) create the control hazard. Fig.30 pipeline showing stalling on every conditional branch as solution to control hazard. - If the branch cannot be resolve in the second stage, it is often the case for longer pipelines. It results an even larger slowdown if we stall on branches. - The cost of this option is too high for most computers to use and motivates a second solution to the control hazard. 55
- CE An Overview of Pipelining Control Hazard Fig .31 Predicting that branches are not taken as a solution to control hazard. The top drawing shows the pipeline when the branch is not taken, and the bottom when the branch is taken. - Branch prediction: A method of resolving a branch hazard that assumes a given outcome for the branch and proceeds from that assumption rather than waiting to ascertain the actual outcome . - Latency (pipeline): The number of stages in the pipeline or the number of stages between two instructions during execution. The Control Hazard will be studied deeply in the section 8 of this chapter. 56