# VIPS: SIMPLE, EFFICIENT, AND SCALABLE CACHE COHERENCE

Alberto Ros<sup>1</sup> Stefanos Kaxiras<sup>2</sup> Kostis Sagonas<sup>2</sup> Mahdad Davari<sup>2</sup> Magnus Norgen<sup>2</sup> David Klaftenegger<sup>2</sup>



<sup>2</sup>Uppsala University

Dec 17, 2015





- Cache coherence protocols ease programming
- Coherence overhead is an important issue
- But, coherence is sporadically needed
  - $\rightarrow$  Why pay always?

- Cache coherence protocols ease programming
- Coherence overhead is an important issue
- But, coherence is sporadically needed
  - $\rightarrow$  Why pay always?
- $\bullet \ \ Our \ goal \rightarrow Simplify \ coherence$ 
  - And enforce it only when needed

- Cache coherence protocols ease programming
- Coherence overhead is an important issue
- But, coherence is sporadically needed
  - $\rightarrow$  Why pay always?
- Our goal  $\rightarrow$  Simplify coherence
  - And enforce it only when needed
- How? VIPS family of cache coherence protocols
  - Simple, Efficient, Scalable



| Δ1 | hort | 0 | Ros |
|----|------|---|-----|
| ~  | Dell | 0 | nus |

< ロ > < 回 > < 回 > < 回 > < 回 >



| Δ1 | hort | 0 | Ros |
|----|------|---|-----|
| ~  | Dell | 0 | nus |

< < >> < <</>



| AI | bert | 0 | Ros |
|----|------|---|-----|
|    |      |   |     |

< 문



| AI | ber | to | Ro | s |
|----|-----|----|----|---|
|    |     |    |    |   |

-

A B > A B >



| Alberto Ros | BSC, Spain | Dec 17, 2015 | 3 / 54 |
|-------------|------------|--------------|--------|

<ロ> <同> <同> < 同> < 同> < 三> < 三> <

3



|             |            | <br>         |        |
|-------------|------------|--------------|--------|
| Alberto Ros | BSC, Spain | Dec 17, 2015 | 3 / 54 |

## OUTLINE

|      |     |   | D   |
|------|-----|---|-----|
| - AI | ber | 0 | Ros |
|      |     |   |     |

<ロ> <同> <同> < 回> < 回>

## OUTLINE

| A 1 | hor | 6  | Ros |
|-----|-----|----|-----|
| AI  | ber | 10 | ROS |

<ロ> <同> <同> < 回> < 回>

- Write-through protocols are simple
  - Only Valid and Invalid states

< < >> < <</>

- Write-through protocols are simple
  - Only Valid and Invalid states
  - But they are not efficient because of write misses
- Which write misses?

- Write-through protocols are simple
  - Only Valid and Invalid states
  - But they are not efficient because of write misses
- Which write misses?
  - Private data in a write-back policy
    - $\rightarrow$  evicted due to capacity/conflict misses
  - Shared data in a write-back policy
    - $\rightarrow~$  evicted due to capacity/conflict/coherence misses

- Write-through protocols are simple
  - Only Valid and Invalid states
  - But they are not efficient because of write misses
- Which write misses?
  - Private data in a write-back policy
    - $\rightarrow$  evicted due to capacity/conflict misses
  - Shared data in a write-back policy
    - $\rightarrow~$  evicted due to capacity/conflict/coherence misses
- Mostly private data misses  $\approx$  90%



### Dynamic write policy in the L1s (private caches, in general)

- Write-back for Private blocks
  - Simple (no coherence required) as in uniprocessors
  - Efficient  $\rightarrow$  no extra misses
- Write-through for Shared blocks
  - Simple (only two states, VI)
  - $\bullet \ \ \text{Efficient} \rightarrow \text{coherence misses}$
- VIPS: Valid/Invalid Private/Shared

- Classify data (cache blocks) into private and shared
  - A-priori: Before issuing the coherence transaction we know if it is for a private or for a shared block
    - i.e., OS/TLB, compiler, application
- Page-level classification using the OS and the TLBs
  - Both page table and TLB entries have a P/S bit
  - The first TLB miss by a core sets the page to P
  - Subsequent TLB misses set the page to S

- Simplifies the protocol to just two states (VI)
- Write-throughs eliminate the need of tracking writers at the directory
  - $\rightarrow$  Area reduction
- No indirection for read misses
  - ightarrow Correct shared data always at the LLC
- Supports sequential consistency for every application
  - Same consistency model as the more complex MESI
- But we still have invalidations and directory blocking...

- We provide sequential consistency for DRF programs
- Self-Invalidation of shared data from L1s
  - Selective Flush (SF) upon synchronization points
  - We eliminate invalidations
  - The directory is gone!
- Multiple writers allowed for shared data
  - Self-downgrade
  - No need to request write permission
  - Write-through of diffs

- Selective flushing eliminates the need to track readers at the directory
  - No need to send invalidations
  - The directory is gone!
- Indirection completely removed
- Private and DRF protocols practically the same
  - ightarrow They differ only in when data is written back in the LLC
- Provides correct semantics for synchronization instructions
- Supports sequential consistency for DRF programs

## **EXECUTION TIME**

- Hammer increases execution time w.r.t. MESI, and the performance of a WT policy is prohibitive
- VIPS performs similar to MESI
- VIPS-M improves MESI by 4.8%, on average

#### NORMALIZED EXECUTION TIME W.R.T. DIRECTORY



### **ENERGY CONSUMPTION**

- Hammer and WT consumption is undesirable
- VIPS consumes similar energy to MESI
- VIPS-M reduces consumption by 14.2% mainly due to its lower traffic requirements

#### NORMALIZED ENERGY CONSUMPTION W.R.T. DIRECTORY



## OUTLINE

| AI | bert | o | Ros |
|----|------|---|-----|
|    |      |   | 100 |

æ

<ロ> <同> <同> < 回> < 回>

- What is virtual-cache coherence?
  - Keeping cache coherence in a system with virtual caches

-

- What is virtual-cache coherence?
  - Keeping cache coherence in a system with virtual caches
- Coherence is maintained for physical addresses (e.g., shared cache)



-

< A >

-

- What is virtual-cache coherence?
  - Keeping cache coherence in a system with virtual caches
- Coherence is maintained for physical addresses (e.g., shared cache)



#### CONTRIBUTION

Simple and efficient approach that supports virtual caches in a cache coherence multicore system, thus saving most of the energy consumed by the TLBs

## VIRTUAL VS. PHYSICAL CACHES

- Simple: Physically-indexed, physically-tagged (PIPT) caches
  - Address translation before accessing the cache
  - BUT: high latency and high energy consumption due to TLB accesses



## VIRTUAL VS. PHYSICAL CACHES

- Simple: Physically-indexed, physically-tagged (PIPT) caches
  - Address translation before accessing the cache
  - BUT: high latency and high energy consumption due to TLB accesses
- Performance: Virtually-indexed, physically-tagged (VIPT) caches
  - Translation before comparing the tags
  - $\bullet~$  TLB and cache accessed in parallel  $\rightarrow$  latency OK
  - BUT STILL: high energy consumption





## VIRTUAL VS. PHYSICAL CACHES

- Simple: Physically-indexed, physically-tagged (PIPT) caches
  - Address translation before accessing the cache
  - BUT: high latency and high energy consumption due to TLB accesses
- Performance: Virtually-indexed, physically-tagged (VIPT) caches
  - Translation before comparing the tags
  - $\bullet~\mbox{TLB}$  and cache accessed in parallel  $\rightarrow$  latency OK
  - BUT STILL: high energy consumption
- Efficient: Virtually-indexed, virtually-tagged (VIVT) caches
  - No TLB translation required on cache hits
  - NO extra latency or energy on cache hits
  - Larger TLBs, shared TLBs
  - Problem: synonyms





- Synonyms: Different virtual addresses mapping to the same physical address
  - Address mapping changes or sharing among processes
- IN VIVT CACHES: Multiple copies of the same (physical) block in cache → inconsistency
  - Hardware solutions (complex, expensive): Upon a miss check if there are synonyms
    - Cache search: Looks in all possible sets
- IN MULTIPROCESSORS: Reverse translation for messages going from the physical to the virtual domain
  - Reverse map (R-tag memory) [Goodman, ASPLOS'87]
    - Hardware and memory requirements, and design complexity

- We address this problem by focusing on the coherence protocol
- When is reverse translation performed?
  - For every coherence message sent from the physical domain (shared cache) to the virtual domain (private cache)
  - In traditional coherence protocols:
    - Invalidations, downgrades, and forwardings: Not expected by the cache controller (no MSHR entry)
    - Data and acks: expected by the cache controller (MSHR entry)

- We address this problem by focusing on the coherence protocol
- When is reverse translation performed?
  - For every coherence message sent from the physical domain (shared cache) to the virtual domain (private cache)
  - In traditional coherence protocols:
    - Invalidations, downgrades, and forwardings: Not expected by the cache controller (no MSHR entry)
    - Data and acks: expected by the cache controller (MSHR entry)

 We address this problem by focusing on the coherence protocol

Virtual-cache coherence without reverse translations is possible with a protocol that does not have invalidations, downgrades, or forwardings, towards the L1s

• Data and acks: expected by the cache controller (MSHR entry)

- We address this problem by focusing on the coherence protocol
- When is reverse translation performed?
  - For every coherence message sent from the physical domain (shared cache) to the virtual domain (private cache)
  - In traditional coherence protocols:
    - Invalidations, downgrades, and forwardings: Not expected by the cache controller (no MSHR entry)
    - Data and acks: expected by the cache controller (MSHR entry)
- Can coherence protocols satisfy the previous condition while being efficient?

- We address this problem by focusing on the coherence protocol
- When is reverse translation performed?
  - For every coherence message sent from the physical domain (shared cache) to the virtual domain (private cache)
  - In traditional coherence protocols:
    - Invalidations, downgrades, and forwardings: Not expected by the cache controller (no MSHR entry)
    - Data and acks: expected by the cache controller (MSHR entry)
- Can coherence protocols satisfy the previous condition while being efficient?
  - Yes, VIPS-M!

- Self-Invalidation eliminates directory invalidations
  - No invalidations issued from the LLC to the L1s
- Write-throughs keep data clean in the L1 caches
  - No downgrades issued from the LLC to the L1s
- Write-throughs keep data updated in the LLC caches
  - No forwardings issued from the LLC to the L1s
  - Indirection completely removed

- Self-Invalidation eliminates directory invalidations
  - No invalidations issued from the LLC to the L1s
- Write-throughs keep data clean in the L1 caches
  - No downgrades issued from the LLC to the L1s
- Write-throughs keep data updated in the LLC caches
  - No forwardings issued from the LLC to the L1s
  - Indirection completely removed

VIPS-M can work with virtual caches without requiring reverse translation and in the presence of synonyms

| <br>la a sul | - | Dee |
|--------------|---|-----|
| Dieri        | 0 | Ros |
|              |   |     |



- CT1NL: Physically-tagged L1 caches
- C1TNL: Virtual L1 caches, private TLBs
- C1NTL: Virtual L1 caches, shared TLBs

-

< 同 > < ∃ >

### **ENREGY CONSUMPTION**



Alberto Ros

 Around 17% in energy reduction thanks to the use of virtual caches, mainly because of TLBs lookups



- Around 17% in energy reduction thanks to the use of virtual caches, mainly because of TLBs lookups
- VIPS-M keeps its advantage w.r.t. MESI (savings of 20% in total)



- Around 17% in energy reduction thanks to the use of virtual caches, mainly because of TLBs lookups
- VIPS-M keeps its advantage w.r.t. MESI (savings of 20% in total)
- The VIPS-M with virtual caches consume similar energy



• Reverse translation is a problem for MESI protocols, especially for the shared TLB configuration



## **EXECUTION TIME**

- Reverse translation is a problem for MESI protocols, especially for the shared TLB configuration
- VIPS-M obtains improvements by sharing the TLB



## **EXECUTION TIME**

- Reverse translation is a problem for MESI protocols, especially for the shared TLB configuration
- VIPS-M obtains improvements by sharing the TLB
- VIPS-M with virtual caches improves execution time by 5.4% w.r.t MESI with physical caches



## CONCLUSIONS

- Virtual cache coherence can be implemented without reverse translations and without increasing complexity
- Our approach obtains execution time, energy, and area improvements w.r.t. MESI Execution time (normalized)



# OUTLINE

|      |     |   | D   |
|------|-----|---|-----|
| - AI | ber | 0 | Ros |
|      |     |   |     |

æ.

<ロ> <同> <同> < 回> < 回>

#### • Need of simple, scalable, and efficient cache coherence

• Many-core systems, GPUs?, accelerators??

- Need of simple, scalable, and efficient cache coherence
  - Many-core systems, GPUs?, accelerators??
- Traditional directory protocols
  - Explicit invalidation/downgrades on writes/reads
     ⇒ Complex
  - Directory to track copies ⇒ Non-scalable
  - Indirection  $\Rightarrow$  Inefficient

- Simple cache coherence: VIPS-M [Ros & Kaxiras, PACT'12]
  - Strictly request-response ⇒ Simple
  - Coherence distributed across cores ⇒ Scalable
  - No directory  $\Rightarrow$  Simple and scalable
  - Efficient for data-race-free code

- Simple cache coherence: VIPS-M [Ros & Kaxiras, PACT'12]
  - Strictly request-response ⇒ Simple
  - Coherence distributed across cores ⇒ Scalable
  - No directory  $\Rightarrow$  Simple and scalable
  - Efficient for data-race-free code
- How? Self-invalidation and self-downgrade (SISD)
  - Synchronization exposed to the protocol



| AI | ber | to | Ros |
|----|-----|----|-----|
|    |     |    |     |

- Simple cache coherence: VIPS-M [Ros & Kaxiras, PACT'12]
  - Strictly request-response ⇒ Simple
  - Coherence distributed across cores ⇒ Scalable
  - No directory  $\Rightarrow$  Simple and scalable
  - Efficient for data-race-free code
- How? Self-invalidation and self-downgrade (SISD)
  - Synchronization exposed to the protocol



- ₹ 🖬 🕨

- Simple cache coherence: VIPS-M [Ros & Kaxiras, PACT'12]
  - Strictly request-response ⇒ Simple
  - Coherence distributed across cores ⇒ Scalable
  - No directory  $\Rightarrow$  Simple and scalable
  - Efficient for data-race-free code
- How? Self-invalidation and self-downgrade (SISD)
  - Synchronization exposed to the protocol
- Release: Self-downgrade
  - $\bullet \ \Rightarrow \ Write-through \ dirty \ blocks$
- Acquire: Self-invalidation
  - $\bullet \ \Rightarrow \text{Empty the cache}$

| EXAMPLE OF DRF CODE     |                             |  |  |
|-------------------------|-----------------------------|--|--|
| /* Initially $X = 0 */$ |                             |  |  |
| X = 1;<br>SIGNAL(cond); | WAIT(cond); SI<br>\$r1 = X; |  |  |

- Simple cache coherence: VIPS-M [Ros & Kaxiras, PACT'12]
  - Strictly request-response ⇒ Simple
  - Coherence distributed across cores ⇒ Scalable
  - No directory  $\Rightarrow$  Simple and scalable
  - Efficient for data-race-free code
- How? Self-invalidation and self-downgrade (SISD)
  - Synchronization exposed to the protocol
- Release: Self-downgrade
  - $\bullet \ \Rightarrow \ Write-through \ dirty \ blocks$
- Acquire: Self-invalidation
  - $\bullet \ \Rightarrow \text{Empty the cache}$

#### EXAMPLE OF DRF CODE

/\* Initially X = 0 \*/

X = 1;SIGNAL(cond);  $W_{r}$ 

• □ ▶ • □ ▶ • □ ▶ • < □ ▶ •</p>

WAIT(cond); r1 = X:

Sequential consistency (SC) for data-race-free (DRF) code

- Even DRF applications contain races!
  - Synchronization is inherently racy

# EXAMPLE OF DRF CODE /\* Initially X = 0 \*/ X = 1; SIGNAL(cond); \$r1 = X;

| bert |  |
|------|--|
|      |  |

- (E)

- Even DRF applications contain races!
  - Synchronization is inherently racy
  - Implemented performing spin-waiting



・ロト ・ 一 ト ・ ヨ ト ・ ヨ ト

-

| Alberto Ros | BSC, Spain | Dec 17, 2015 | 27 / 54 |
|-------------|------------|--------------|---------|

Alberto Ros

- Even DRF applications contain races!
  - Synchronization is inherently racy
  - Implemented performing spin-waiting
- Spin-waiting is not efficient under SISD
  - Writes require fast propagation
    - Write-through and repeated self-invalidation

| EXAMPLE OF DRF CODE      |                            |  |
|--------------------------|----------------------------|--|
| /* Initially $X = 0 * /$ |                            |  |
| X = 1;<br>cond = 1;      | while(!cond);<br>\$r1 = X; |  |

・ロッ ・ 一 ・ ・ ー ・ ・ ・ ・ ・

|--|

- Even DRF applications contain races!
  - Synchronization is inherently racy
  - Implemented performing spin-waiting
- Spin-waiting is not efficient under SISD
  - Writes require fast propagation
    - Write-through and repeated self-invalidation
  - Repeated self-invalidation ⇒ spin on last level cache (LLC)
    - Increases network traffic and LLC accesses ⇒ energy



< 4 → < 3 →

- Even DRF applications contain races!
  - Synchronization is inherently racy
  - Implemented performing spin-waiting
- Spin-waiting is not efficient under SISD
  - Writes require fast propagation
    - Write-through and repeated self-invalidation
  - Repeated self-invalidation ⇒ spin on last level cache (LLC)
    - Increases network traffic and LLC accesses ⇒ energy
- VIPS-M solution
  - $\Rightarrow$  Exponential back-off
    - © Reduces SI, network traffic, and LLC accesses
    - Slows down propagation of writes

- Even DRF applications contain races!
  - Synchronization is inherently racy
  - Implemented performing spin-waiting
- Spin-waiting is not efficient under SISD
  - Writes require fast propagation
    - Write-through and repeated self-invalidation
  - Repeated self-invalidation  $\Rightarrow$  spin on last level cache (LLC)
    - Increases network traffic and LLC accesses  $\Rightarrow$  energy
- VIPS-M solution
   ⇒ Exponential back-off
  - © Reduces SI, network traffic, and LLC accesses
  - Slows down propagation of writes

#### Energy-performance trade-off!

|  | Ros |
|--|-----|
|  |     |



| AI | bert | o I | Ros |
|----|------|-----|-----|
|    |      |     |     |

<ロ> <回> <回> <回> < 回</p>



| Δ1 | bert | 0 | Do |   |
|----|------|---|----|---|
| ~  | Dell | 0 | пu | 5 |



#### THE CHALLENGE

Fast and efficient write propagation...

- without explicit invalidations/downgrades
- keeping a simple request-response protocol

|  | rto |  |
|--|-----|--|
|  |     |  |
|  |     |  |

- A mechanism with a directory just for races involved in spin-waiting
  - Only special loads (or atomics) called LOAD\_CALLBACK (LD\_CB) can allocate an entry in the directory

- A mechanism with a directory just for races involved in spin-waiting
  - Only special loads (or atomics) called LOAD\_CALLBACK (LD\_CB) can allocate an entry in the directory
- A LD\_CB is similar to a load instruction, but
  - By-passes the private caches
  - May block at the shared cache waiting for a write to happen

## CALLBACK EXAMPLE



| A1   | hor  | lo. | Ros |  |
|------|------|-----|-----|--|
| - AI | Dell | U   | nus |  |

э

## CALLBACK EXAMPLE



|  | Ros |
|--|-----|
|  |     |
|  |     |

< ∃ > <</li>



| bert | io I | Ros |
|------|------|-----|
| <br> |      | 100 |

< E

< D > < A > < B >



| bert | io I | Ros |
|------|------|-----|
| <br> |      | 100 |



| bert | io I | Ros |
|------|------|-----|
| <br> |      | 100 |

< E

< D > < A > < B >



| ber |  |  |
|-----|--|--|
|     |  |  |



| ber |  |  |
|-----|--|--|
|     |  |  |

э

### Execution time

- As good as the best BACKOFF case
- 5% better than BACKOFF-10 (VIPS-M)
- 11% better than INVALIDATION



### Execution time

- As good as the best BACKOFF case
- 5% better than BACKOFF-10 (VIPS-M)
- 11% better than INVALIDATION

### Energy consumption

- INVALIDATION spins in L1
- BACKOFF-0 spins in the LLC
- CALLBACKS removes spinning (40% and 5% reduction)



# TATAS vs. CLH



э

\* 臣

・ロト ・日下・ ・ ヨト

- T&T&S + Callbacks allows only one of the threads to race for acquiring the lock
- T&T&S + Callbacks provides fairness



- T&T&S + Callbacks allows only one of the threads to race for acquiring the lock
- T&T&S + Callbacks provides fairness



- CALLBACKS: special loads for races in spin-waiting
   Requires a very small directory
- Simpler and more efficient than explicit invalidation!
- Transparent to the coherence protocol
- Makes efficient simple synchronization algorithms, such as T&T&S

## OUTLINE

| beri | to. | Ros |
|------|-----|-----|
| <br> |     |     |

æ.

<ロ> <同> <同> < 回> < 回>

 Clustered cache hierarchies are a natural strategy for reducing the overhead introduced by cache coherence protocols (e.g., storage and traffic)<sup>1</sup>

<sup>1</sup> Martin, Hill, and Sorin. "Why on-chip cache coherence is here to stay", CACM, 2012.

| AI | ber | to | Ro | s |
|----|-----|----|----|---|
|    |     |    |    | ~ |

## MOTIVATION

- Clustered cache hierarchies are a natural strategy for reducing the overhead introduced by cache coherence protocols (e.g., storage and traffic)<sup>1</sup>
- But clustered cache hierarchies bring another problem
  - ⇒ Design complexity: Keep the SWMR invariant in a clustered cache hierarchy
    - A root node sends invalidations and waits for acks
    - A leaf node receives an invalidations and answers with acks
    - An intermediate node in a hierarchy performs both actions ⇒ cross-product of states!

(E.g., MOESI in GEMS: L1  $\rightarrow$  16; L2  $\rightarrow$  59; memory  $\rightarrow$  13)

<sup>1</sup> Martin, Hill, and Sorin. "Why on-chip cache coherence is here to stay", CACM, 2012.

- Simplify the source of complexity: invalidation/downgrade
  - No write-invalidation ⇒ self-invalidation (SI) on synchronization points
  - No read-downgrade ⇒ self-downgrade (SD) on synchronization points
  - Provide sequential consistency for data-race-free (SC for DRF) applications

- Our approach for simplifying the protocol is SI/SD
- A naïve implementation has to SI/SD all the data in the cache hierarchy
  - Not efficient!
- A new approach for restricting SI/SD in a clustered cache hierarchy is required

## SOLUTION

- We solve this problem by introducing the concept of hierarchical P/S classification
  - A block can be shared inside a cluster but be private outside
  - The level where this transition happens is the common sharing level (CSL)
  - Restrict SI/SD to shared blocks within a cluster
- Result
  - The protocol remains simple  $\Rightarrow$  NO hierarchical complexity
  - Hierarchical complexity transferred to classification
  - In this paper we do classification at page level by adding information to the page tables
    - So all complexity is transferred to software



SI/SD only for blocks in shared pages



SI/SD only for blocks in shared pages

э

< < >> < <</>



SI/SD only for blocks in shared pages

э

< < >> < <</>



SI/SD only for blocks in shared pages

э

< < >> < <</>



SI/SD only for blocks in shared pages

э

< E

A B > A B >



- SI/SD only for blocks in shared pages
- Page table entry (global hierarchy knowledge) stores:
  - First requester of a page (*log*<sub>2</sub>*N*)
  - CSL ([log<sub>2</sub>[log<sub>2</sub> N/log<sub>2</sub> d]]): Root of the cluster containing all sharers
- TLB entry (local hierarchy knowledge) stores the CSL of the page
  - CSL is known before the cache miss takes place (a-priori)

- H-MOESI: Hierarchical full-map
- VIPS-H: P/S bit

### NUMBER OF STATES AND EXTRA BITS REQUIRED (16X4)

|            | H-MOESI             |      |        | VIP       | S-H   |      |
|------------|---------------------|------|--------|-----------|-------|------|
|            | States Bitmap Total |      | States | P/S       | Total |      |
| Controller | Tot./Base           | bits | bits   | Tot./Base | bit   | bits |
| L1 cache   | 16/5                | 0    | 3      | 9/3       | 1     | 3    |
| L2 cache   | 59 / 13             | 16   | 20     | 5/3       | 1     | 3    |
| L3 cache   | 13 / 4              | 4    | 6      | 4/3       | 1     | 3    |
| Total cost | 8                   | 44KB |        | 204       | 1KB   |      |

76% memory reduction compared to H-MOESI

|      |     |    | D   |  |
|------|-----|----|-----|--|
| - AI | ber | ю. | Ros |  |
|      |     |    |     |  |

▲ 同 ▶ → 三 ▶

## COMPARISON TO H-MOESI AND VIPS-M: TIME

- Flat VIPS-M degrades performance by 10% w.r.t. H-MOESI for 4x4, get similar performance for 16x4
- VIPS-H improves execution time by about 11% for 16x4
- VIPS scales better than H-MOESI in time



Alberto Ros

## COMPARISON TO H-MOESI AND VIPS-M: TIME

- Flat VIPS-M degrades performance by 10% w.r.t.
   H-MOESI for 4x4, get similar performance for 16x4
- VIPS-H improves execution time by about 11% for 16x4
- VIPS scales better than H-MOESI in time



## COMPARISON TO H-MOESI AND VIPS-M: TRAFFIC

- VIPS increases *Response\_data* ⇒ more cache misses
- But less control traffic ⇒ no invalidations acks
- It scales better than H-MOESI in traffic (5%-7% for 16x4)



## COMPARISON TO H-MOESI AND VIPS-M: TRAFFIC

- VIPS increases *Response\_data* ⇒ more cache misses
- But less control traffic ⇒ no invalidations acks
- It scales better than H-MOESI in traffic (5%-7% for 16x4)



## CONCLUSIONS

- Simple and efficient cache coherence for clustered cache architectures
- Keys:
  - Self-invalidation and self-downgrade and the assumption of SC for DRF semantics
  - Hierarchical private/shared classification
- Results:
  - Simpler than H-MOESI
    - Less states memory overhead (from 94 to 18)
    - Less memory overhead (76%)
  - Better performance (11%, on average for 16x4)
  - Reduced network traffic (7%, on average for 16x4)
  - Better scalability

## OUTLINE

| AI | bert | o | Ros |
|----|------|---|-----|
|    |      |   | 100 |

æ

<ロ> <同> <同> < 回> < 回>

### VIPS-M ⇒ Self-Invalidation & Self-Downgrade

- VIPS coherence is truly distributed.
- Coherence decisions are taken independently without any inter-core interaction
  - ⇒ Simplifies whole system design
- Request-Response from the L1s to the LLC
  - No requests from LCC to L1s
  - No traffic among L1s, only L1  $\Leftrightarrow$  LLC

### VIPS-M ⇒ Self-Invalidation & Self-Downgrade

- VIPS coherence is truly distributed.
- Coherence decisions are taken independently without any inter-core interaction
  - ⇒ Simplifies whole system design
- Request-Response from the L1s to the LLC
  - No requests from LCC to L1s
  - No traffic among L1s, only L1 ⇔ LLC

Can this be the answer to distributed coherence?

|  | Ros |
|--|-----|
|  |     |
|  |     |

Alber



#### **CPU, DRAM and Network Trends**

<ロ> <同> <同> < 同> < 同> < 同> < 同> = 三

| erto Ros | BSC, Spain | Dec 17, 2015 | 46 / 54 |
|----------|------------|--------------|---------|
|          |            |              |         |

### VIPS-DSM for distributed systems

- User-space implementation
- Runs Pthreads (DRF programs)
  - Small porting effort to fully exploit new synchronization system and optimize synchronization performance
- Page-based DSM (uses virtual memory faults for misses)
- Pages have a home node (limitation: naïve distribution)
- MPI is the "network layer" (limitation: only need RDMA)

## COMPONENTS OF ARGO



- CARINA: VIPS-DSM coherence
- PYXIS: Classification directories
- VELA: Hierarchical Queue Delegation Locking system

## CARINA & PYXIS: COHERENCE & DIRECTORIES

- Modified VIPS: SI & SD
  - Strictly request response for DRF accesses
- Pyxis classification directories cached at nodes
  - NO message handlers to classify pages and propagate classification changes
  - Requestors are responsible to update classification at remote nodes (P→S, requestor updates private owner)

# CARINA & PYXIS: COHERENCE & DIRECTORIES

### Classification:

- Only for Global shared memory (Gmalloc'ed)
- Adds classification for writers
- Private, Shared-NW (No Writers), Shared-SW (Single Writer), Shared-MW (Multiple Writers)



50 / 54

## VELA: ARGO'S SYNCHRONIZATION SYSTEM

- The trouble with distributed critial section (CS) execution: Serialized execution that migrates from node to node!
  - Forces data accessed in CS to migrate too
  - Must SI on Lock, SD on Unlock
- Solution: Queue Delegation Locking [SPAA'14, EuroPar'14]
  - Delegate the execution of the CS to the current holder of the lock (up to a point)
- Hierarchical QDL: Delegate only locally



## RESULTS



### **BLACKSCHOLES**



### EP CLASS D



### CG CLASS C



Alberto Ros

Dec 17, 2015 52 / 54

## OUTLINE

|      |     |   | D   |
|------|-----|---|-----|
| - AI | ber | 0 | Ros |
|      |     |   |     |

æ.

<ロ> <同> <同> < 回> < 回>

## RECAP

- In this talk:
  - VIPS-M: Simple Request-Response Protocols [PACT'12]
  - VIPS-V: Virtual Cache Coherence [ISCA'13]
  - VIPS-H: Clustered Hierarchies [HPCA'15]
  - Callbacks: Efficient Spin-Waiting [ISCA'15]
  - Argo: Distributed Shared Memory [HPDC'15]
- Other VIPS works:
  - VIPS-B: Bus coherence [SoCC'12]
  - Fast&Furious: Data-Race Detector [PARMA-DITAM'15]
  - VIPS-GC: Generational Coherence [TACO'15]
  - Dir<sub>1</sub>-SISD: Self-Contained Directories [PACT'15]
  - VIPS-G: CPU-GPU Coherence [TACO'16]

< ロ > < 同 > < 三 >

# VIPS: SIMPLE, EFFICIENT, AND SCALABLE CACHE COHERENCE

Alberto Ros<sup>1</sup> Stefanos Kaxiras<sup>2</sup> Kostis Sagonas<sup>2</sup> Mahdad Davari<sup>2</sup> Magnus Norgen<sup>2</sup> David Klaftenegger<sup>2</sup>

> <sup>1</sup>Universidad de Murcia aros@ditec.um.es

> > <sup>2</sup>Uppsala University

Dec 17, 2015





### SIMULATION ENVIRONMENT

- SIMICS (functional simulation) + GEMS (memory timing) + GARNET (network)
- CACTI 6.5 for 32nm technology
- Simulated a 16-tile multicore
  - 32KB 4-way I&D L1s, 8MB (512KB/bank) 16-way L2 (LLC)
  - 16-entry MSHRs with 1000-cycle timeout
- SPLASH-2, scientific, and PARSEC benchmarks.

| Protocol      | Invalidations | Directory | Indirection           | L1 base states |
|---------------|---------------|-----------|-----------------------|----------------|
| Hammer        | Broadcast     | None      | Yes                   | 5 (MOESI)      |
| Directory     | Multicast     | Full-map  | Yes                   | 4 (MESI)       |
| Write-Through | Multicast     | Full-map  | Only write misses     | 2 (VI)         |
| VIPS          | Multicast     | Full-map  | Only for write misses | 2 (VI)         |
| VIPS-M        | None          | None      | No                    | 2 (VI)         |

### EVALUATION L1 SENSITIVITY ANALYSIS

### PERFORMANCE 16KB-64KB L1



### ENERGY 16KB-64KB L1



<ロ> <同> <同> < 同> < 同>

э



- Cold-cap-conf misses decrease due to the lack of write misses for DRF blocks
- Misses due to write throughs are not significant



Alberto Ros

### Works very well for small critical sections



- Exponential back-off required for power reasons for large critical sections
- Considering hardware synchronization all protocols will be reduced to request-response transactions

Alberto Ros