IBM's POWER10 Processor on Samsung 7nm (10:00am PT)

June 2024 ยท 6 minute read

01:08PM EDT - Time for Power10! Bill Starke and Brian Thompto

01:09PM EDT - Bill is chief architect of POWER10

01:09PM EDT - Brian is chief core architect

01:09PM EDT - Power roadmap - power is about the enterprise

01:09PM EDT - It's the building block for the world's most powerful supercomputers

01:10PM EDT - Financial systems, commercial, healthcare, governments

01:10PM EDT - Power10 is made smarter for everyone

01:10PM EDT - First hardware back in the laps

01:10PM EDT - On track to deliver systems in 12 months

01:10PM EDT - New abilities, ground-up rearchitecting for power efficiency

01:10PM EDT - maturing AI landscape

01:11PM EDT - AI acceleration in the processor core

01:11PM EDT - Integrating into enterprise workflows

01:11PM EDT - 18B transistors on Samsung 7nm, 602B transistors

01:11PM EDT - Two versions of the core: SMT4 and SMT8. This chip is the SMT8 version

01:12PM EDT - 16 physical cores, but 15 will be enabled. Improves economics of yield

01:12PM EDT - High bandwidth PHYs, OMI and PowerAXON and PCIe G5

01:12PM EDT - Two packaging options: Single and Dual chip modules

01:12PM EDT - SCM allows for 16-socket, DCM is 4-socket

01:12PM EDT - Dual chip module is two 602 mm2 chips into one package

01:13PM EDT - 16-socket for big iron systems

01:13PM EDT - PowerAXON and OMI support 1TB/sec each

01:13PM EDT - 150 micron bumps

01:13PM EDT - optimized placement for packaging

01:14PM EDT - PowerAXON is for chip-to-chip connectivity

01:14PM EDT - Several new scaling capabilities

01:14PM EDT - OMI is OpenCAPI Memory Interface

01:14PM EDT - Grandchild of Centaur memory

01:15PM EDT - Tech agnostic - supports any media with OMI buffer

01:15PM EDT - Supports DDR4 at 410 GB/sec bandwidth per Power10 CPU

01:15PM EDT - Will support DDR5 when DDR5 is ready - no new system, just need new OMI buffer chip

01:15PM EDT - Also supports GDDR for up to 800 GB/sec

01:16PM EDT - Also supports storage class memory up to 2 TB

01:16PM EDT - PowerAXON supports direct attach SCM or ASIC/FPGA

01:17PM EDT - Memory Inception comes to Power10 - access memory from any socket in the cluster

01:17PM EDT - Full hardware load/store access to other server memory

01:17PM EDT - Only +150ns compared to accessing far memory within the same server

01:18PM EDT - Supports up to 2 PB of memory

01:18PM EDT - Connect multiple 16-socket systems with Memory Inception

01:18PM EDT - Or servers without memory borrowing from a big server

01:19PM EDT - Paging tables as routing tables

01:19PM EDT - Robust virtual channel management

01:19PM EDT - Allows 1000s of nodes to access memory across the whole system

01:19PM EDT - Pod-level memory resource pooling with extra gear

01:19PM EDT - Memory disaggregation becomes a reality.

01:19PM EDT - Also 64 lanes of PCIe G5

01:20PM EDT - 2.2-4.4x socket performance compared to Power9

01:20PM EDT - *602mm2, correction from earlier

01:21PM EDT - Up to 8 threads per core

01:21PM EDT - +30% average perf against POWER9, +20% in ST

01:21PM EDT - 2.6x perf/watt improvement

01:21PM EDT - DCM is more efficient

01:22PM EDT - In SMT8 mode, 15 cores per chip. In SMT4 mode, 30 cores per chip

01:22PM EDT - Core is modular

01:22PM EDT - Container based stack support over PowerVM hypervisor

01:23PM EDT - High performance nested hypervisors with enhanced security

01:23PM EDT - Power ISA 3.1

01:23PM EDT - 64-bit prefix instructions in a RISC-friendly away

01:23PM EDT - New op-code space for instruction instruction

01:24PM EDT - Optimizations for memory tiers

01:24PM EDT - Security and isolation

01:24PM EDT - Crypto perf for future algorithms already accelerated

01:24PM EDT - Secure containers supported at hardware and virtualization layers

01:24PM EDT - Full memory encryption

01:25PM EDT - Active management for enhanced performance and avoids side channel

01:25PM EDT - Here's a core diagram - this is half an SMT8 core

01:25PM EDT - Each SMT4 segment can do 2x512b and 4x128b per cycle

01:26PM EDT - 4x in mixed math acceleration

01:26PM EDT - 1.5x L1-cache, 4x L2, 4x TLB

01:26PM EDT - 1000 instructions in flight per SMT8 core

01:26PM EDT - L2 is 13.5 cycle

01:26PM EDT - L2 is 13.5 cycle

01:26PM EDT - L3 is 27.5 cycle

01:26PM EDT - New tag predictors

01:26PM EDT - Branch execution has been improvement

01:27PM EDT - New instruction fusion opportunities

01:27PM EDT - Eliminates dependencies

01:27PM EDT - Fuse consecutive load/store instructions, double wide load/store bw

01:27PM EDT - Improved clock gaiting

01:27PM EDT - each design element was redesigned for performance and efficiency

01:28PM EDT - Redesigned major structures such as queues

01:28PM EDT - 1.3x perf at 0.5x power vs Power9

01:28PM EDT - = 2.6x perf/watt overall at the core level

01:28PM EDT - 3x perf/watt at socket level

01:29PM EDT - Also improved memory bandwidth

01:29PM EDT - 2x bytes from all sources: L1, L2, L3, OMI

01:29PM EDT - 4x 32B loads, 2x 32B stores per SMT8 core (Fusion required)

01:29PM EDT - OMI to one core - 256 GB/sec peak, 120 GB/s sustained, 3x L3 prefetch and mem prefetch extensions

01:30PM EDT - 8 SIMD 128-bit engines per SMT8 core

01:30PM EDT - supports fixed, float, permute

01:30PM EDT - 4 512b engines per SMT8 core

01:30PM EDT - supports FP64, FP32, FP16, BF16, INT16, INT8, INT4

01:31PM EDT - New MMA enhanced infernece acceleration

01:32PM EDT - Simple library update needed in most cases

01:32PM EDT - Implements data-reuse efficiency

01:32PM EDT - 3x inference latency reduction

01:32PM EDT - Improvements over POWER9

01:33PM EDT - Time scale for Power10 is that initial systems for IBM partners will be available Q4 2021

01:33PM EDT - (IBM usually does this - announce a core/product 12 months in advance)

01:33PM EDT - To allow for customers and developers to adjust

01:34PM EDT - Q&A time

01:35PM EDT - Q: PCIe Gen6? Will future Power10 enable this? A: No talk about our future products. We're glad that PCIe is speeding up, we always look at market conditions to create chips.

01:36PM EDT - Q: Read latency increase with OMI DIMM? A: less than +10ns

01:37PM EDT - Q: Did power delivery get upgraded, or still on-die LDOs? A: Go into detail at ISSCC. Still similar delivery platform of Power9

01:39PM EDT - Q: Does POWER and z work together? A: Yes, all the time. Peer review each other. We get questions about arch differences - each product is suited for each client bases. Extremely justified. We do the peer review, so we becomes experts in both. We share IP as well, like OMI, as well as other features. Also physical design etc. Lots of synergy, but also lots of differences

01:39PM EDT - That's a wrap. Next talk is ThunderX3 from Marvell

ncG1vNJzZmivp6x7orrAp5utnZOde6S7zGiqoaenZH52hZduZqGnpGKwqbXPrGRraGJleq211Z5km6SfnHqqrsysZKmnp5q%2FcnyMqamom5WowLC%2BjKilZquRosC2usZmbqelXWZ9cXzApmSprA%3D%3D