Bài giảng Kiến trúc máy tính - Chương 9: Các kiến trúc song song - Nguyễn Kim Khánh

9.1. Phân loại kiến trúc máy tính

Phân loại kiến trúc máy tính (Michael Flynn -1966)

n SISD - Single Instruction Stream, Single Data Stream

n SIMD - Single Instruction Stream, Multiple Data Stream

n MISD - Multiple Instruction Stream, Single Data Stream

n MIMD - Multiple Instruction Stream, Multiple Data Stream

Download

Trang 1

Trang 2

Trang 3

Trang 4

Trang 5

Trang 6

Trang 7

Trang 8

Trang 9

Trang 10

Tải về để xem bản đầy đủ

32 trang xuanhieu 3400

Download

Bạn đang xem 10 trang mẫu của tài liệu "Bài giảng Kiến trúc máy tính - Chương 9: Các kiến trúc song song - Nguyễn Kim Khánh", để tải tài liệu gốc về máy hãy click vào nút Download ở trên

Tóm tắt nội dung tài liệu: Bài giảng Kiến trúc máy tính - Chương 9: Các kiến trúc song song - Nguyễn Kim Khánh

ors
digital sensors for high-accuracy die temperature measurements. Each core can
be defined as an independent thermal zone. The maximum temperature for each
Thermal control Thermal control
APIC APIC
32
-k
B
L1
C
ac
he
s
32
-k
B
L1
C
ac
he
s
Ex
ec
ut
io
n
re
so
u
rc
es
Ex
ec
ut
io
n
re
so
u
rc
es
A
rc
h.
st
at
e
A
rc
h.
st
at
e
Power management logic
2 MB L2 shared cache
Bus interface
Front-side bus
Figure 18.9 Intel Core Duo Block Diagram
2017 Kiến trúc máy tính 499
NKK-HUST
Intel Core i7-990X
678 CHAPTER 18 / MULTICORE COMPUTERS
The general structure of the Intel Core i7-990X is shown in Figure 18.10. Each
core has its own dedicated L2 cache and the four cores share a 12-MB L3 cache.
One mechanism Intel uses to make its caches more effective is prefetching, in which
the hardware examines memory access patterns and attempts to fill the caches spec-
ulatively with data that’s likely to be requested soon. It is interesting to compare the
performance of this three-level on chip cache organization with a comparable two-
level organization from Intel. Table 18.1 shows the cache access latency, in terms of
clock cycles for two Intel multicore systems running at the same clock frequency.
The Core 2 Quad has a shared L2 cache, similar to the Core Duo. The Core i7
improves on L2 cache performance with the use of the dedicated L2 caches, and
provides a relatively high-speed access to the L3 cache.
The Core i7-990X chip supports two forms of external communications to
other chips. The DDR3 memory controller brings the memory controller for the
DDR main memory2 onto the chip. The interface supports three channels that
are 8 bytes wide for a total bus width of 192 bits, for an aggregate data rate of
up to 32 GB/s. With the memory controller on the chip, the Front Side Bus is
eliminated.
Core 0
32 kB
L1-I
32 kB
L1-D
32 kB
L1-I
32 kB
L1-D
32 kB
L1-I
32 kB
L1-D
32 kB
L1-I
32 kB
L1-D
32 kB
L1-I
32 kB
L1-D
32 kB
L1-I
32 kB
L1-D
256 kB
L2 Cache
Core 1
256 kB
L2 Cache
Core 2
256 kB
L2 Cache
Core 3
256 kB
L2 Cache
Core 4
256 kB
L2 Cache
Core 5
256 kB
L2 Cache
12 MB
L3 Cache
DDR3 Memory
Controllers
QuickPath
Interconnect
3 ! 8B @ 1.33 GT/s 4 ! 20B @ 6.4 GT/s
Figure 18.10 Intel Core i7-990X Block Diagram
Table 18.1 Cache Latency (in clock cycles)
CPU Clock Frequency L1 Cache L2 Cache L3 Cache
Core 2 Quad 2.66 GHz 3 cycles 15 cycles —
Core i7 2.66 GHz 4 cycles 11 cycles 39 cycles
2The DDR synchronous RAM memory is discussed in Chapter 5.
2017 Kiến trúc máy tính 500
NKK-HUST
9.3. Đa xử lý bộ nhớ phân tán
n Máy tính qui mô lớn (Warehouse Scale Computers
or Massively Parallel Processors – MPP)
n Máy tính cụm (clusters)
SEC. 8.4 MESSAGE-PASSING MULTICOMPUTERS 617
As a consequence of these and other factors, there is a great deal of interest in
building and using parallel computers in which each CPU has its own private mem-
ory, not directly accessible to any other CPU. These are the multicomputers. Pro-
grams on multicomputer CPUs interact using primitives like send and receive to
explicitly pass messages because they cannot get at each other’s memory with
LOAD and STORE instructions. This difference completely changes the pro-
gramming model.
Each node in a multicomputer consists of one or a few CPUs, some RAM
(conceivably shared among the CPUs at that node only), a disk and/or other I/O de-
vices, and a communication processor. The communication processors are con-
nected by a high-speed interconnection network of the types we discussed in Sec.
8.3.3. Many different topologies, switching schemes, and routing algorithms are
used. What all multicomputers have in common is that when an application pro-
gram executes the send primitive, the communication processor is notified and
transmits a block of user data to the destination machine (possibly after first asking
for and getting permission). A generic multicomputer is shown in Fig. 8-36.
CPU Memory Node
Communication
processor
Local interconnect
Disk
and
I/O
Local interconnect
Disk
and
I/O
High-performance interconnection network
Figure 8-36. A generic multicomputer.
8.4.1 Interconnection Networks
In Fig. 8-36 we see that multicomputers are held together by interconnection
networks. Now it is time to look more closely at these interconnection networks.
Interestingly enough, multiprocessors and multicomputers are surprisingly similar
in this respect because multiprocessors often have multiple memory modules that
must also be interconnected with one another and with the CPUs. Thus the mater-
ial in this section frequently applies to both kinds of systems.
The fundamental reason why multiprocessor and multicomputer intercon-
nection networks are similar is that at the very bottom both of them use message
2017 Kiến trúc máy tính 501
NKK-HUST
Mạng liên kếtSEC. 8.4 MESSAGE-PASSING MULTICOMPUTERS 619
(a)
(c)
(e)
(g)
(b)
(d)
(f)
(h)
Figure 8-37. Various topologies. The heavy dots represent switches. The CPUs
and memories are not shown. (a) A star. (b) A complete interconnect. (c) A tree.
(d) A ring. (e) A grid. (f) A double torus. (g) A cube. (h) A 4D hypercube.
Interconnection networks can be characterized by their dimensionality . For
our purposes, the dimensionality is determined by the number of choices there are
to get from the source to the destination. If there is never any choice (i.e., there is
only one path from each source to each destination), the network is zero dimen-
sional. If there is one dimension in which a choice can be made, for example, go
2017 Kiến trúc máy tính 502
NKK-HUST
Massively Parallel Processors
n Hệ thống qui mô lớn
n Đắt tiền: nhiều triệu USD
n Dùng cho tính toán khoa học và các bài
toán có số phép toán và dữ liệu rất lớn
n Siêu máy tính
2017 Kiến trúc máy tính 503
NKK-HUST
IBM Blue Gene/P
624 PARALLEL COMPUTER ARCHITECTURES CHAP. 8
coherency between the L1 caches on the four CPUs. Thus when a shared piece of
memory resides in more than one cache, accesses to that storage by one processor
will be immediately visible to the other three processors. A memory reference that
misses on the L1 cache but hits on the L2 cache takes about 11 clock cycles. A
miss on L2 that hits on L3 takes about 28 cycles. Finally, a miss on L3 that has to
go to the main DRAM takes about 75 cycles.
The four CPUs are connected via a high-bandwidth bus to a 3D torus network,
which requires six connections: up, down, north, south, east, and west. In addition,
each processor has a port to the collective network, used for broadcasting data to
all processors. The barrier port is used to speed up synchronization operations, giv-
ing each processor fast access to a specialized synchronization network.
At the next level up, IBM designed a custom card that holds one of the chips
shown in Fig. 8-38 along with 2 GB of DDR2 DRAM. The chip and the card are
shown in Fig. 8-39(a)–(b) respectively.
1 Chip
4 CPUs
2 GB
4 processors
8-MB L3 cache
2-GB
DDR2
DRAM
32 Cards
32 Chips
128 CPUs
64 GB
32 Boards
1024 Cards
1024 Chips
4096 CPUs
2 TB
72 Cabinets
73728 Cards
73728 Chips
294912 CPUs
144 TB
SystemCabinetBoardCardChip:
(b) (c) (d) (e)(a)
Figure 8-39. The BlueGene/P: (a) chip. (b) card. (c) board. (d) cabinet.
(e) system.
The cards are mounted on plug-in boards, with 32 cards per board for a total of
32 chips (and thus 128 CPUs) per board. Since each card contains 2 GB of
DRAM, the boards contain 64 GB apiece. One board is illustrated in Fig. 8-39(c).
At the next level, 32 of these boards are plugged into a cabinet, packing 4096
CPUs into a single cabinet. A cabinet is illustrated in Fig. 8-39(d).
Finally, a full system, consisting of up to 72 cabinets with 294,912 CPUs, is
depicted in Fig. 8-39(e). A PowerPC 450 can issue up to 6 instructions/cycle, thus
2017 Kiến trúc máy tính 504
NKK-HUST
Cluster
n Nhiều máy tính được kết nối với nhau bằng
mạng liên kết tốc độ cao (~ Gbps)
n Mỗi máy tính có thể làm việc độc lập (PC
hoặc SMP)
n Mỗi máy tính được gọi là một node
n Các máy tính có thể được quản lý làm việc
song song theo nhóm (cluster)
n Toàn bộ hệ thống có thể coi như là một máy
tính song song
n Tính sẵn sàng cao
n Khả năng chịu lỗi lớn
2017 Kiến trúc máy tính 505
NKK-HUST
PC Cluster của Google SEC. 8.4 MESSAGE-PASSING MULTICOMPUTERS 635
hold exactly 80 PCs and switches can be larger or smaller than 128 ports; these are
just typical values for a Google cluster.
128-port Gigabit
Ethernet switch
128-port Gigabit
Ethernet switch
Two gigabit
Ethernet links
80-PC rack
OC-48 FiberOC-12 Fiber
Figure 8-44. A typical Google cluster.
Power density is also a key issue. A typical PC burns about 120 watts or about
10 kW per rack. A rack needs about 3 m2 so that maintenance personnel can in-
stall and remove PCs and for the air conditioning to function. These parameters
give a power density of over 3000 watts/m2. Most data centers are designed for
600–1200 watts/m2, so special measures are required to cool the racks.
Google has learned three key things about running massive Web servers that
bear repeating.
1. Components will fail so plan for it.
2. Replicate everything for throughput and availability.
3. Optimize price/performance.
2017 Kiến trúc máy tính 506
NKK-HUST
9.4. Bộ xử lý đồ họa đa dụng
n Kiến trúc SIMD
n Xuất phát từ bộ xử lý đồ họa GPU (Graphic
Processing Unit) hỗ trợ xử lý đồ họa 2D và
3D: xử lý dữ liệu song song
n GPGPU – General purpose Graphic
Processing Unit
n Hệ thống lai CPU/GPGPU
n CPU là host: thực hiện theo tuần tự
n GPGPU: tính toán song song
2017 Kiến trúc máy tính 507
NKK-HUST
Bộ xử lý đồ họa trong máy tính
2017 Kiến trúc máy tính 508
NKK-HUST
GPGPU: NVIDIA Tesla
nStreaming
multiprocessor
n8 × Streaming
processors
2017 Kiến trúc máy tính 509
NKK-HUST
GPGPU: NVIDIA Fermi
7
Hardware Execution
CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes
one or more kernel grids; a streaming multiprocessor (SM) executes one or more thread blocks;
and CUDA cores and other execution units in the SM execute threads. The SM executes
threads in groups of 32 threads called a warp. While programmers can generally ignore warp
execution for functional correctness and think of programming one thread, they can greatly
improve performance by having threads in a warp execute the same code path and access
memory in nearby addresses.
An Overview of the Fermi Architecture
The first Fermi based GPU, implemented with 3.0 billion transistors, features up to 512 CUDA
cores. A CUDA core executes a floating point or integer instruction per clock for a thread. The
512 CUDA cores are organized in 16 SMs of 32 cores each. The GPU has six 64-bit memory
partitions, for a 384-bit memory interface, supporting up to a total of 6 GB of GDDR5 DRAM
memory. A host interface connects the GPU to the CPU via PCI-Express. The GigaThread
global scheduler distributes thread blocks to SM thread schedulers.
Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical
rectangular strip that contain an orange portion (scheduler and dispatch), a green portion
(execution units), and light blue portions (register file and L1 cache). 2017 Kiến trúc máy tính 510
NKK-HUST
NVIDIA Fermi
8
Third Generation Streaming
Multiprocessor
The third generation SM introduces several
architectural innovations that make it not only the
most powerful SM yet built, but also the most
programmable and efficient.
512 High Performance CUDA cores
Each SM features 32 CUDA
processors—a fourfold
increase over prior SM
designs. Each CUDA
processor has a fully
pipelined integer arithmetic
logic unit (ALU) and floating
point unit (FPU). Prior GPUs used IEEE 754-1985
floating point arithmetic. The Fermi architecture
implements the new IEEE 754-2008 floating-point
standard, providing the fused multiply-add (FMA)
instruction for both single and double precision
arithmetic. FMA improves over a multiply-add
(MAD) instruction by doing the multiplication and
addition with a single final rounding step, with no
loss of precision in the addition. FMA is more
accurate than performing the operations
separately. GT200 implemented double precision FMA.
In GT200, the integer ALU was limited to 24-bit precision for multiply operations; as a result,
multi-instruction emulation sequences were required for integer arithmetic. In Fermi, the newly
designed integer ALU supports full 32-bit precision for all instructions, consistent with standard
programming language requirements. The integer ALU is also optimized to efficiently support
64-bit and extended precision operations. Various instructions are supported, including
Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, and population
count.
16 Load/Store Units
Each SM has 16 load/store units, allowing source and destination addresses to be calculated
for sixteen threads per clock. Supporting units load and store the data at each address to
cache or DRAM.
Dispatch Unit
Warp Scheduler
Instruction Cache
Dispatch Unit
Warp Scheduler
Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
SFU
SFU
SFU
SFU
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
Interconnect Network
64 KB Shared Memory / L1 Cache
Uniform Cache
Core
Register File (32,768 x 32-bit)
CUDA Core
Operand Collector
Dispatch Port
Result Queue
FP Unit INT Unit
Fermi Streaming Multiprocessor (SM)
8
l i i i i i i
i l l i i
i i l i l ( )
i i i l l i i
i i . i l i l
( ) i i i l i li i
iti it i l fi l i t , it
l f r i i i t iti . i r
r t t rf r i t r ti
r t l . i l t l r i i .
I , t i t r li it t - it r i i f r lti l r ti ; r lt,
lti-i tr ti l ti r r ir f r i t r rit ti . I r i, t l
designed integer ALU supports full 32-bit precision fo r ll i tr ti , i t t it t r
r r i l r ir t . i t r i l ti i t ffi i tl rt
- it t r i i r ti . ri i tr ti r rt , i l i
l , ift, , r , rt, it-fi l tr t, it-r r i rt, l ti
t.
/ t r it
ach has 16 l a /st re units, all in s urce an estinati n a resses t e calculate
for sixteen threa s er clock. u orting units loa an store the ata at each a ress to
cache or .
r r r r
/
/
/
/
/
/
/
/
/
I t r t t r
64 are e ry / 1 ac e
if r ac e
lt
i I i
r i tr i lti r r ( )
n Có 16 Streaming
Multiprocessors (SM)
n Mỗi SM có 32 CUDA
cores.
n Mỗi CUDA core
(Cumpute Unified
Device Architecture) có
01 FPU và 01 IU
2017 Kiến trúc máy tính 511
NKK-HUST
Hết
2017 Kiến trúc máy tính 512

File đính kèm:

bai_giang_kien_truc_may_tinh_chuong_9_cac_kien_truc_song_son.pdf