| Consistency | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>○ |
|-------------|--------------|-------------|--------------|---------|------------------|
|             |              |             |              |         |                  |

# NON-SPECULATIVE REORDERING OF MEMORY OPERATIONS WITH STRONG CONSISTENCY

### **Alberto Ros**

Universidad de Murcia

November 29th, 2017

|  | Ros |
|--|-----|
|  |     |
|  |     |

| Consistency | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-------------|--------------|---------|------------------|
| OUTLINE     |              |             |              |         |                  |

- **1** MEMORY CONSISTENCY AND PROGRAM ORDER
- 2 RELAXING PROGRAM ORDER WITH A STORE BUFFER
- **3** KEEPING PROGRAM ORDER VIA SPECULATION
- **4** A NON-SPECULATIVE SOLUTION: WRITERSBLOCK
- **5** EVALUATION RESULTS
- **6** CONCLUSIONS

| Consistency | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>o |
|-------------|--------------|-------------|--------------|---------|------------------|
| OUTLINE     |              |             |              |         |                  |

- **1** MEMORY CONSISTENCY AND PROGRAM ORDER
- 2 RELAXING PROGRAM ORDER WITH A STORE BUFFER
- **3** KEEPING PROGRAM ORDER VIA SPECULATION
- A NON-SPECULATIVE SOLUTION: WRITERSBLOCK
- **5** EVALUATION RESULTS
- 6 CONCLUSIONS

4 3 5 4 3 5 5



• Programmer intuition: instructions execute in the order they appear in the program

| THREAD 1               |                    |  |
|------------------------|--------------------|--|
| \$r0 = X;<br>\$r1 = Y; | // load<br>// load |  |

| ber |  |  |
|-----|--|--|
|     |  |  |

3 3

< ロ > < 同 > < 回 > < 回 > .



 Programmer intuition: instructions execute in the order they appear in the program

| THREAD 1               |                    |   |
|------------------------|--------------------|---|
| \$r0 = X;<br>\$r1 = Y; | // load<br>// load |   |
|                        |                    | ł |

• What happens if the core/memory changes this order?

|   | hert | in. | Ros |
|---|------|-----|-----|
| ~ | Deri | .0  | 103 |

(\* ) \* (\* ) \* )

< A >



 Programmer intuition: instructions execute in the order they appear in the program

| THREAD 1                           | THREAD 2                                             |
|------------------------------------|------------------------------------------------------|
| r0 = X; // load<br>r1 = Y; // load | $\begin{array}{llllllllllllllllllllllllllllllllllll$ |

• What happens if the core/memory changes this order?



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 4 / 37 |
|-------------|----------------------|---------------------|--------|
|             |                      |                     |        |

▲□▶ ▲□▶ ▲□▶ ▲□▶ 三回日 のの⊙

| Consistency<br>○●○○○ | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|----------------------|--------------|-------------|--------------|---------|------------------|
| POSSIBLE             | RESULTS      | ASSUMING    | PROGRAM      | ORDER   |                  |

| INITIALLY X=0            | , Y=0      |
|--------------------------|------------|
| <pre>lx: \$r0 = X;</pre> | sy: Y = 1; |
| ly: \$r1 = Y;            | sx: X = 1; |

|   | her  | to. | Ros |
|---|------|-----|-----|
| _ | Dell |     | 103 |

◆□▶ ◆□▶ ◆□▶ ◆□▶ ◆□▼ ◆○

| Consistency<br>○●○○○ | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|----------------------|--------------|-------------|--------------|---------|------------------|
| POSSIBLE             | RESULTS      | ASSUMING    | PROGRAM      | ORDER   |                  |

| INITIALLY X=0, Y=0 |            |  |  |  |  |
|--------------------|------------|--|--|--|--|
| lx: \$r0 = X;      | sy: Y = 1; |  |  |  |  |
| ly: \$r1 = Y;      | sx: X = 1; |  |  |  |  |

#### SIX POSSIBLE INTERLEAVINGS AND VALUES FOR (\$R0, \$R1)

| lx<br>ly<br>sy<br>sx | lx sy<br>ly sx | lx sy<br>sx<br>ly | sy<br>lx<br>ly<br>sx | sy<br>lx<br>sx<br>ly | sy<br>sx<br>lx<br>ly |
|----------------------|----------------|-------------------|----------------------|----------------------|----------------------|
| (0,0)                | (0,1)          | (0,1)             | (0,1)                | (0,1)                | (1,1)                |

• (1,0) is not possible if operations execute in program order

| ber     | to | Ros  |
|---------|----|------|
| <br>Dei | .0 | 1103 |

| Consistency<br>○○●○○ | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|----------------------|--------------|-------------|--------------|---------|------------------|
| RELAXIN              | G PROGRA     | M ORDER (   | LOADS)       |         |                  |

| INITIALLY X=0, Y=0 |            |  |  |  |
|--------------------|------------|--|--|--|
| lx: \$r0 = X;      | sy: Y = 1; |  |  |  |
| ly: \$r1 = Y;      | sx: X = 1; |  |  |  |

#### SIX POSSIBLE INTERLEAVINGS AND VALUES FOR (\$R0, \$R1)

| ly<br>lx<br>sy<br>sx | ly sy<br>lx sx | ly<br>sy<br>sx<br>lx | sy<br>ly<br>lx<br>sx | sy<br>ly<br>sx<br>lx | sy<br>sx<br>ly<br>lx |
|----------------------|----------------|----------------------|----------------------|----------------------|----------------------|
| (0,0)                | (0,0)          | (1,0)                | (0,1)                | (1,1)                | (1,1)                |

 (1,0) is possible by relaxing the order in which loads execute

(\* ) \* (\* ) \* )

< 🗇 🕨

| Consistency<br>○○●○○ | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|----------------------|--------------|-------------|--------------|---------|------------------|
| RELAXIN              | G PROGRA     | M ORDER (   | LOADS)       |         |                  |

| INITIALLY X=0, Y=0 |            |  |  |  |
|--------------------|------------|--|--|--|
| lx: \$r0 = X;      | sy: Y = 1; |  |  |  |
| ly: \$r1 = Y;      | sx: X = 1; |  |  |  |

#### SIX POSSIBLE INTERLEAVINGS AND VALUES FOR (\$R0, \$R1)

| ly<br>lx<br>sy<br>sx | ly sy<br>lx sx | ly<br>sy<br>sx<br>lx | sy<br>ly<br>lx<br>sx | sy<br>ly<br>sx<br>lx | sy<br>sx<br>ly<br>lx |
|----------------------|----------------|----------------------|----------------------|----------------------|----------------------|
| (0,0)                | (0,0)          | (1,0)                | (0,1)                | (1,1)                | (1,1)                |

- (1,0) is possible by relaxing the order in which loads execute
  - The same result can be achieved by relaxing the stores

| Consistency<br>○○○●○ | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|----------------------|--------------|-------------|--------------|---------|------------------|
| THE MEN              | MORY CON     | SISTENCY    | MODEL        |         |                  |

- The memory consistency model defines the behavior of the programs
  - In particular, the behavior of the memory operations: load and store

| Consistency<br>○○○●○ | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|----------------------|--------------|-------------|--------------|---------|------------------|
| THE MEN              | MORY CON     | SISTENCY    | MODEL        |         |                  |

- The memory consistency model defines the behavior of the programs
  - In particular, the behavior of the memory operations: load and store



|  | Ros |
|--|-----|
|  |     |
|  |     |

| Consistency<br>○○○●○ | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|----------------------|--------------|-------------|--------------|---------|------------------|
| THE MEN              | MORY CON     | SISTENCY    | MODEL        |         |                  |

- The memory consistency model defines the behavior of the programs
  - In particular, the behavior of the memory operations: load and store



|  | Ros |  |
|--|-----|--|
|  |     |  |

| Consistency<br>○○○●○ | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|----------------------|--------------|-------------|--------------|---------|------------------|
| THE MEN              | MORY CON     | SISTENCY    | MODEL        |         |                  |

- The memory consistency model defines the behavior of the programs
  - In particular, the behavior of the memory operations: load and store



| Consistency<br>○○○●○ | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|----------------------|--------------|-------------|--------------|---------|------------------|
| THE MEN              | MORY CON     | SISTENCY    | MODEL        |         |                  |

- The memory consistency model defines the behavior of the programs
  - In particular, the behavior of the memory operations: load and store



| Ros | Multicore Day, Kista | November 29 |
|-----|----------------------|-------------|
|-----|----------------------|-------------|

Alberto

< 🗇 > < 🖻 > <

| Consistency<br>○○○●○ | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|----------------------|--------------|-------------|--------------|---------|------------------|
| THE MEN              | MORY CON     | SISTENCY    | MODEL        |         |                  |

- The memory consistency model defines the behavior of the programs
  - In particular, the behavior of the memory operations: load and store



| Consistency<br>○○○○● | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|----------------------|--------------|-------------|--------------|---------|------------------|
| CORREC               | TNESS/PE     | REORMANC    | TE ISSUE     |         |                  |

- Correctness
  - The programmer intuition is program order

|    | 1   |   | D   |
|----|-----|---|-----|
| AI | ber | 0 | Ros |
|    |     |   |     |

-

| Consistency<br>○○○○● | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|----------------------|--------------|-------------|--------------|---------|------------------|
| CORREC               | TNESS/PEF    | RFORMANC    | CE ISSUE     |         |                  |

# Correctness

• The programmer intuition is program order

### Performance

- Waiting for a memory operation to finish in order to start the execution of the next operation is very inefficient
- Processors execute multiple memory operations simultaneously
  - Memory level parallelism

| Consistency<br>○○○○● | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|----------------------|--------------|-------------|--------------|---------|------------------|
| CORREC               | INESS/PE     | RFORMANC    | CE ISSUE     |         |                  |

# Correctness

- The programmer intuition is program order
- Performance
  - Waiting for a memory operation to finish in order to start the execution of the next operation is very inefficient
  - Processors execute multiple memory operations simultaneously
    - Memory level parallelism
  - Operations can be reordered by the memory hierarchy, or even be issued out-of-order

| Consistency<br>○○○○● | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|----------------------|--------------|-------------|--------------|---------|------------------|
| CORREC               | INESS/PE     | RFORMANC    | CE ISSUE     |         |                  |

# • Correctness

- The programmer intuition is program order
- Performance
  - Waiting for a memory operation to finish in order to start the execution of the next operation is very inefficient
  - Processors execute multiple memory operations simultaneously
    - Memory level parallelism
  - Operations can be reordered by the memory hierarchy, or even be issued out-of-order
  - This is correct for single-core processors, but not in multicores

| Consistency | Store Buffer | Speculation | WritersBlock | Results | Conclusions |
|-------------|--------------|-------------|--------------|---------|-------------|
| 00000       |              |             |              |         |             |
|             |              |             |              |         |             |

### CORRECTNESS/PERFORMANCE ISSUE

- Correctness
  - The programmer intuition is program order
- Performance
  - Waiting for a memory operation to finish in order to start the execution of the next operation is very inefficient
  - Processors execute multiple memory operations simultaneously
    - Memory level parallelism
  - Operations can be reordered by the memory hierarchy, or even be issued out-of-order
  - This is correct for single-core processors, but not in multicores
- Solution: Store Buffer and Speculation

| Consistency | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-------------|--------------|---------|------------------|
| OUTLINE     |              |             |              |         |                  |

- **1** Memory consistency and program order
- **2** Relaxing program order with a store buffer
- **3** KEEPING PROGRAM ORDER VIA SPECULATION
- A NON-SPECULATIVE SOLUTION: WRITERSBLOCK
- **5** EVALUATION RESULTS
- 6 CONCLUSIONS

**B N 4 B N** 

| Consistency | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-------------|--------------|---------|------------------|
| THE STO     | RE BUFFE     | R           |              |         |                  |

- A store operation requires write permission to perform
- Write permission request
  - Cache coherence protocol
  - Unique copy: may require invalidating other copies
  - A long-latency operation

| Consistency | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-------------|--------------|---------|------------------|
| THE STO     | RE BUFFE     | R           |              |         |                  |

- A store operation requires write permission to perform
- Write permission request
  - Cache coherence protocol
  - Unique copy: may require invalidating other copies
  - A long-latency operation
- Solution implemented in x86 processors (Intel, AMD)

⇒ The store buffer

| Consistency | Store Buffer<br>○●○○ | Speculation | WritersBlock           | Results | Conclusions<br>○ |
|-------------|----------------------|-------------|------------------------|---------|------------------|
| THE STOR    | RE BUFFER            | R BREAKS    | $STORE \rightarrow LC$ | AD      |                  |

I ≥ > <</p>

-

| Consistency | Store Buffer<br>○●○○ | Speculation | WritersBlock           | Results | Conclusions<br>O |
|-------------|----------------------|-------------|------------------------|---------|------------------|
| THE STOR    | RE BUFFEI            | R BREAKS    | $STORE \rightarrow LC$ | DAD     |                  |

#### Program

 $st_1$  $Id_2$ 

-

| Consistency | Store Buffer<br>○●○○ | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|----------------------|-------------|--------------|---------|------------------|
| THE STO     | RE BUFFEI            | R BREAKS    | STORE→LC     | DAD     |                  |



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 11 / 37 |
|-------------|----------------------|---------------------|---------|

| Consistency | Store Buffer<br>○●○○ | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|----------------------|-------------|--------------|---------|------------------|
| THE STO     | RE BUFFEI            | R BREAKS    | STORE→LC     | DAD     |                  |



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 11 / 37 |
|-------------|----------------------|---------------------|---------|

| Consistency | Store Buffer<br>○●○○ | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|----------------------|-------------|--------------|---------|------------------|
| THE STO     | RE BUFFEI            | R BREAKS    | STORE→LC     | DAD     |                  |



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 11 / 37 |
|-------------|----------------------|---------------------|---------|

| Consistency | Store Buffer<br>○●○○ | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|----------------------|-------------|--------------|---------|------------------|
| THE STO     | RE BUFFEI            | R BREAKS    | STORE→LC     | DAD     |                  |



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 11 / 37 |
|-------------|----------------------|---------------------|---------|

| Consistency | Store Buffer<br>○●○○ | Speculation | WritersBlock | Results | Conclusions<br>o |
|-------------|----------------------|-------------|--------------|---------|------------------|
| THE STO     | RE BUFFEI            | R BREAKS    | STORE→LC     | DAD     |                  |



|             |                      |      | _      |           |         |
|-------------|----------------------|------|--------|-----------|---------|
| Alberto Ros | Multicore Day, Kista | Nove | mber 2 | 9th, 2017 | 11 / 37 |

・ロト・日本・モート キャー ショー ショー

| Consistency | Store Buffer<br>○●○○ | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|----------------------|-------------|--------------|---------|------------------|
| THE STO     | RE BUFFEI            | R BREAKS    | STORE→LC     | DAD     |                  |



• The store buffer breaks the  $\mathtt{store} \rightarrow \mathtt{load}$  rule

| Alberto Ros  | Multicore Day, Kista | Nove  | mber 2 | 9th, 2017         | 7 | 11/37 |
|--------------|----------------------|-------|--------|-------------------|---|-------|
| Alberto 1103 | wallieore bay, Rista | 11076 |        | 301, <b>2</b> 011 |   | 11/01 |

| Consistency | Store Buffer<br>○○●○ | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|----------------------|-------------|--------------|---------|------------------|
| TOTAL S     | TORE ORE             | DER (TSO)   |              |         |                  |

 x86 processors (Intel, AMD) provide a Total Store Order (TSO) memory consistency model



 x86 processors (Intel, AMD) provide a Total Store Order (TSO) memory consistency model



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 12 / 37 |
|-------------|----------------------|---------------------|---------|



 x86 processors (Intel, AMD) provide a Total Store Order (TSO) memory consistency model



- TSO does not enforce store→load
- Performance over programmer intuition

| <br> |   | -   |  |
|------|---|-----|--|
| bert | 0 | Ros |  |
|      |   |     |  |

| Consistency | Store Buffer<br>○○○● | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|----------------------|-------------|--------------|---------|------------------|
| THE STO     | ore Buffe            | R: CONSE    | OUENCES      |         |                  |

- store $\rightarrow$ load
  - $\Rightarrow$  Relaxed

| bei |  |  |
|-----|--|--|
|     |  |  |
|     |  |  |

| Consistency | Store Buffer | Speculation | WritersBlock   | Results | Conclusions |
|-------------|--------------|-------------|----------------|---------|-------------|
|             | 0000         |             |                |         |             |
|             |              |             | 0.1.T.). 0.T.0 |         |             |

#### THE STORE BUFFER: CONSEQUENCES

- $\bullet \ \texttt{store} {\rightarrow} \texttt{load}$ 
  - $\Rightarrow$  Relaxed
- load $\rightarrow$ store
  - ⇒ No need to execute stores before the loads since stores are out of the critical path

|  | Ros |
|--|-----|
|  |     |
|  |     |

| Consistency | Store Buffer<br>○○○● | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|----------------------|-------------|--------------|---------|------------------|
| <b>T C</b>  | ~                    | ~           |              |         |                  |

## THE STORE BUFFER: CONSEQUENCES

- $\bullet \ \texttt{store} {\rightarrow} \texttt{load}$ 
  - $\Rightarrow$  Relaxed
- load $\rightarrow$ store
  - ⇒ No need to execute stores before the loads since stores are out of the critical path
- store $\rightarrow$ store<sup>1</sup>
  - ⇒ Less critical than without a store buffer, unless the store buffer fills

<sup>1</sup> A. Ros and S. Kaxiras, "Racer: TSO Consistency via Race Detection". MICRO, 2016.

|   | hor  | 0 | Ros |  |
|---|------|---|-----|--|
| ~ | Dell | 0 | nus |  |

| Consistency | Store Buffer<br>○○○● | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|----------------------|-------------|--------------|---------|------------------|
| _ ~         | _                    | ~           |              |         |                  |

## THE STORE BUFFER: CONSEQUENCES

- $\bullet \ \texttt{store} {\rightarrow} \texttt{load}$ 
  - $\Rightarrow$  Relaxed
- load $\rightarrow$ store
  - ⇒ No need to execute stores before the loads since stores are out of the critical path
- store→store<sup>1</sup>
  - ⇒ Less critical than without a store buffer, unless the store buffer fills
- load $\rightarrow$ load
  - $\Rightarrow$  It is now the **bottleneck**
- <sup>1</sup> A. Ros and S. Kaxiras, "Racer: TSO Consistency via Race Detection". MICRO, 2016.

| Consistency | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-------------|--------------|---------|------------------|
| OUTLIN      | E            |             |              |         |                  |

- MEMORY CONSISTENCY AND PROGRAM ORDER
- 2 RELAXING PROGRAM ORDER WITH A STORE BUFFER
- **3** KEEPING PROGRAM ORDER VIA SPECULATION
- A NON-SPECULATIVE SOLUTION: WRITERSBLOCK
- **5** EVALUATION RESULTS
- 6 CONCLUSIONS

|  | Ros |  |
|--|-----|--|
|  |     |  |
|  |     |  |

ヨトィヨト

| Consistency               | Store Buffer | Speculation<br>●○○○○○ | WritersBlock | Results | Conclusions<br>O |
|---------------------------|--------------|-----------------------|--------------|---------|------------------|
| $\text{Load} \rightarrow$ | LOAD REO     | RDERING               |              |         |                  |

| Consistency | Store Buffer | Speculation<br>●○○○○○ | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-----------------------|--------------|---------|------------------|
| LOAD→I      | LOAD REO     | RDERING               |              |         |                  |



| Consistency | Store Buffer | Speculation<br>●○○○○○ | WritersBlock | Results | $\mathbf{Conclusions}_{\odot}$ |
|-------------|--------------|-----------------------|--------------|---------|--------------------------------|
| LOAD        | LOAD REO     | RDERING               |              |         |                                |



| ber  | to | Ro | s |
|------|----|----|---|
| <br> |    |    | - |

| Consistency | Store Buffer | Speculation<br>●○○○○○ | WritersBlock | Results | $\mathbf{Conclusions}_{\odot}$ |
|-------------|--------------|-----------------------|--------------|---------|--------------------------------|
| LOAD        | LOAD REO     | RDERING               |              |         |                                |



-

3 N

| Consistency | Store Buffer | Speculation<br>●○○○○○ | WritersBlock | Results | $\mathbf{Conclusions}_{\odot}$ |
|-------------|--------------|-----------------------|--------------|---------|--------------------------------|
| LOAD        | LOAD REO     | RDERING               |              |         |                                |



|   | hert | in. | Ros |
|---|------|-----|-----|
| ~ | Deri | .0  | 103 |

| Consistency | Store Buffer | Speculation<br>●○○○○○ | WritersBlock | Results | $\mathbf{Conclusions}_{\odot}$ |
|-------------|--------------|-----------------------|--------------|---------|--------------------------------|
| LOAD        | LOAD REO     | RDERING               |              |         |                                |



|  | Ros |
|--|-----|
|  |     |
|  |     |

| Consistency        | Store Buffer | Speculation<br>●○○○○○ | WritersBlock | Results | Conclusions<br>○ |
|--------------------|--------------|-----------------------|--------------|---------|------------------|
| $LOAD \rightarrow$ | LOAD REO     | RDERING               |              |         |                  |



|   | hert | in. | Ros |
|---|------|-----|-----|
| ~ | Deri | .0  | 103 |

| Consistency | Store Buffer | Speculation<br>○●○○○○ | WritersBlock | Results | Conclusions<br>o |
|-------------|--------------|-----------------------|--------------|---------|------------------|
| LOAD→L      | OAD REO      | RDERING               |              |         |                  |

- In multicore processors reordering loads can affect the expected result
  - But always?

| A1 | hort | 0 | Ros  |
|----|------|---|------|
| ~  | Den  |   | 1103 |

| Consistency | Store Buffer | Speculation<br>○●○○○○ | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-----------------------|--------------|---------|------------------|
| LOAD→       | LOAD REO     | RDERING               |              |         |                  |

- In multicore processors reordering loads can affect the expected result
  - But always?

| hert    | in. | Ros |
|---------|-----|-----|
| <br>Den | .0  | 103 |

| Consistency | Store Buffer | Speculation<br>○●○○○○ | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-----------------------|--------------|---------|------------------|
| Load        | LOAD REO     | RDERING               |              |         |                  |

- In multicore processors reordering loads can affect the expected result
  - But always?

POSSIBLE EXECUTION \$r0 = Y; \$r1 = X; Y = 1; X = 1; /\* (0, 0) allowed \*/

| Alberto Ros Multicore Day, Kista November 29th, 2017 16 / 37 |
|--------------------------------------------------------------|
|--------------------------------------------------------------|

| Consistency        | Store Buffer | Speculation<br>○●○○○○ | WritersBlock | Results | Conclusions<br>O |
|--------------------|--------------|-----------------------|--------------|---------|------------------|
| $LOAD \rightarrow$ | LOAD REO     | RDERING               |              |         |                  |

- In multicore processors reordering loads can affect the expected result
  - But always?

 POSSIBLE EXECUTION

 \$r0 = Y;

 \$r1 = X;

 /\* (1, 0) not allowed \*/

16/37

| Alberto Ros | November 29th, 2017 |
|-------------|---------------------|
| Alberto Ros | November 29th,      |

| Consistency                        | Store Buffer | Speculation<br>○●○○○○ | WritersBlock | Results | Conclusions<br>O |
|------------------------------------|--------------|-----------------------|--------------|---------|------------------|
| LOAD $\rightarrow$ LOAD REORDERING |              |                       |              |         |                  |

- In multicore processors reordering loads can affect the expected result
  - But always?

 POSSIBLE EXECUTION

 \$r0 = Y;

 \$r1 = X;

 /\* (1, 0) not allowed \*/

No, if the other cores do not see the reordering

| Alberto Ros |
|-------------|
|-------------|

4 B 6 4 B 6

| Consistency           | Store Buffer | Speculation<br>○○●○○○ | WritersBlock | Results | Conclusions<br>o |
|-----------------------|--------------|-----------------------|--------------|---------|------------------|
| LOAD→LOAD SPECULATION |              |                       |              |         |                  |

- Solution: To allow speculative load→load reordering
- Some definitions<sup>2</sup>: performed, ordered, source of speculation (SoS)

| Alberto Ros |  |  |
|-------------|--|--|
|             |  |  |
|             |  |  |

| Consistency        | Store Buffer | Speculation<br>○○●○○○ | WritersBlock | Results | Conclusions<br>o |
|--------------------|--------------|-----------------------|--------------|---------|------------------|
| $LOAD \rightarrow$ | LOAD SPEC    | CULATION              |              |         |                  |

- Solution: To allow speculative load→load reordering
- Some definitions<sup>2</sup>: performed, ordered, source of speculation (SoS)



| Alberto Ros | Multicore Day, Kista | November | 29th, 2017 | 17 / 37 |
|-------------|----------------------|----------|------------|---------|

| Consistency        | Store Buffer | Speculation<br>○○●○○○ | WritersBlock | Results | $\circ$ |
|--------------------|--------------|-----------------------|--------------|---------|---------|
| $LOAD \rightarrow$ | LOAD SPEC    | CULATION              |              |         |         |

- Solution: To allow speculative load→load reordering
- Some definitions<sup>2</sup>: performed, ordered, source of speculation (SoS)



|             |                      |                     | - 2.40  |
|-------------|----------------------|---------------------|---------|
| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 17 / 37 |

| Consistency        | Store Buffer | Speculation<br>○○●○○○ | WritersBlock | Results | $\circ$ |
|--------------------|--------------|-----------------------|--------------|---------|---------|
| $LOAD \rightarrow$ | LOAD SPEC    | CULATION              |              |         |         |

- Solution: To allow speculative load→load reordering
- Some definitions<sup>2</sup>: performed, ordered, source of speculation (SoS)



|             |                      |                     | 2.46    |
|-------------|----------------------|---------------------|---------|
| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 17 / 37 |

| Consistency        | Store Buffer | Speculation<br>○○●○○○ | WritersBlock | Results | $\circ$ |
|--------------------|--------------|-----------------------|--------------|---------|---------|
| $LOAD \rightarrow$ | LOAD SPEC    | CULATION              |              |         |         |

- Solution: To allow speculative load→load reordering
- Some definitions<sup>2</sup>: performed, ordered, source of speculation (SoS)



| Alberto Ros |
|-------------|
|-------------|

| Consistency        | Store Buffer | Speculation<br>○○●○○○ | WritersBlock | Results | Conclusions<br>○ |
|--------------------|--------------|-----------------------|--------------|---------|------------------|
| $LOAD \rightarrow$ | LOAD SPEC    | CULATION              |              |         |                  |

- Solution: To allow speculative load→load reordering
- Some definitions<sup>2</sup>: performed, ordered, source of speculation (SoS)



| AI | bert | o | Ro | s |
|----|------|---|----|---|
|    |      |   |    |   |

| Consistency | Store Buffer | Speculation<br>○○○●○○ | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-----------------------|--------------|---------|------------------|
| SOUASH      | AND RE-E     | XECUTE U              | PON INVAL    | IDATION |                  |

- Current multicore avoid incorrect results
  - With the help of the cache coherence protocol



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 |
|-------------|----------------------|---------------------|

< □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □ > < □

18/37

| Consistency | Store Buffer | Speculation<br>○○○●○○ | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-----------------------|--------------|---------|------------------|
| SOUASH      | AND RE-E     | XECUTE U              | PON INVAL    | IDATION |                  |

• Current multicore avoid incorrect results

Alberto Ros

• With the help of the cache coherence protocol





Current multicore avoid incorrect results

Alberto Ros

With the help of the cache coherence protocol



A B > A B >



- Current multicore avoid incorrect results
  - With the help of the cache coherence protocol



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 18 / 37 |
|-------------|----------------------|---------------------|---------|



- Current multicore avoid incorrect results
  - With the help of the cache coherence protocol



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 18 / 37 |
|-------------|----------------------|---------------------|---------|



- Current multicore avoid incorrect results
  - With the help of the cache coherence protocol
  - Squashing and re-executing on remote writes



| Multicore Day, Kista | November 29th, 2017  |
|----------------------|----------------------|
|                      | Multicore Day, Kista |

18/37

| Consistency | Store Buffer | Speculation<br>○○○○●○ | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-----------------------|--------------|---------|------------------|
| SOUASH      | AND RE-E     | XECUTE U              | PON EVICT    | IONS    |                  |

- What happens when a cache block loaded by an M-spec load is evicted?
  - If the directory stops tracking the block, the M-spec load will not receive an invalidation



| Alberto Ros Multicore Day, Nista November 29th, 2017 19/37 | Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 19 / 37 |
|------------------------------------------------------------|-------------|----------------------|---------------------|---------|
|------------------------------------------------------------|-------------|----------------------|---------------------|---------|

| Consistency | Store Buffer | Speculation<br>○○○○●○ | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-----------------------|--------------|---------|------------------|
| SOUASH      | AND RE-F     | XECUTE U              | PON EVICT    | IONS    |                  |

- What happens when a cache block loaded by an M-spec load is evicted?
  - If the directory stops tracking the block, the M-spec load will not receive an invalidation



ly

lx

A

(1,0)

sy

sx

| Iberto Ros | Multicore Day, Kista | November 29th, 2017 | 19 / 37 |
|------------|----------------------|---------------------|---------|

| Consistency | Store Buffer | Speculation<br>○○○○●○ | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-----------------------|--------------|---------|------------------|
| SOULSU      |              | VECUTE II             | DON EVICT    | IONS    |                  |

- What happens when a cache block loaded by an M-spec load is evicted?
  - If the directory stops tracking the block, the M-spec load will not receive an invalidation
- Solution: Squash and re-execute upon evictions

Mult

• This impacts the performance of sequential applications!



Alberto Ros





| ticore Day, Kista | November 29th |
|-------------------|---------------|
|-------------------|---------------|

| Consistency | Store Buffer | Speculation<br>○○○○○● | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-----------------------|--------------|---------|------------------|
| PROBLEM     | IS OF SPEC   | CULATION              |              |         |                  |

 Memory-related speculation is the current solution to have MLP and load→load

| Consistency | Store Buffer | Speculation<br>○○○○● | WritersBlock | Results | Conclusions<br>o |
|-------------|--------------|----------------------|--------------|---------|------------------|
| PROBLE      | MS OF SPE    | CULATION             |              |         |                  |

- Memory-related speculation is the current solution to have MLP and load→load
- Why is good?
  - Squashing is not frequent!

| AI | ber | to | Ros |
|----|-----|----|-----|
|    |     |    |     |

| Consistency | Store Buffer | Speculation<br>○○○○● | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|----------------------|--------------|---------|------------------|
| PPORIE      | MS OF SPE    | CULATION             |              |         |                  |

- Memory-related speculation is the current solution to have MLP and load→load
- Why is good?
  - Squashing is not frequent!
- Why is bad?
  - Speculative loads hold critical resources (LQ, RoB)
  - The processor needs to keep continuously the rollback path

| bert |  |
|------|--|
|      |  |
|      |  |

| Consistency | Store Buffer | Speculation<br>○○○○○● | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-----------------------|--------------|---------|------------------|
|             | MS OF SDE    | CULATION              |              |         |                  |

- Memory-related speculation is the current solution to have MLP and load→load
- Why is good?
  - Squashing is not frequent!
- Why is bad?
  - Speculative loads hold critical resources (LQ, RoB)
  - The processor needs to keep continuously the rollback path

#### QUESTION

Can we execute loads out of order, non-speculatively and guaranteeing  $load \rightarrow load$ ?

|  | Ros |
|--|-----|
|  |     |

| Consistency | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-------------|--------------|---------|------------------|
| OUTI IN     | F            |             |              |         |                  |

- **1** MEMORY CONSISTENCY AND PROGRAM ORDER
- 2 RELAXING PROGRAM ORDER WITH A STORE BUFFER
- **3** KEEPING PROGRAM ORDER VIA SPECULATION
- **4** A NON-SPECULATIVE SOLUTION: WRITERSBLOCK
- **5** EVALUATION RESULTS

# 6 CONCLUSIONS

| AI | ber | to | Ros | 8 |
|----|-----|----|-----|---|
|    |     |    |     |   |

312

4 3 5 4 3 5 5

| Consistency | Store Buffer | Speculation | WritersBlock<br>●○○○○○○ | Results | Conclusions<br>O |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| WRITER      | SBLOCK       | N A NUTSH   | EII2                    |         |                  |

#### • WHAT?

- Multiple loads executing simultaneously
- Load $\rightarrow$ load
- Without memory-related speculation
- How?
  - Blocking write requests
  - With the help of the cache coherence protocol

<sup>2</sup> A. Ros, T. E. Carlson, M. Alipour, and S. Kaxiras, "Non-Speculative Load-Load Reordering in TSO". ISCA, 2017.

| All | berto | Ros |
|-----|-------|-----|
|-----|-------|-----|

| Consistency | Store Buffer | Speculation | WritersBlock<br>○●○○○○○ | Results | Conclusions<br>o |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| How?        |              |             |                         |         |                  |



| Consistency | Store Buffer | Speculation | WritersBlock<br>○●○○○○○ | Results | Conclusions<br>O |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| How?        |              |             |                         |         |                  |



| Consistency | Store Buffer | Speculation | WritersBlock<br>○●○○○○○ | Results | Conclusions<br>O |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| How?        |              |             |                         |         |                  |



| Consistency | Store Buffer | Speculation | WritersBlock<br>○●○○○○○ | Results | Conclusions<br>O |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| How?        |              |             |                         |         |                  |



| Consistency | Store Buffer | Speculation | WritersBlock<br>○●○○○○○ | Results | Conclusions<br>o |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| How?        |              |             |                         |         |                  |

Alberto Ros



| November 29th, 2017 | 23 / 37 |
|---------------------|---------|
|---------------------|---------|

| Consistency | Store Buffer | Speculation | WritersBlock<br>○●○○○○○ | Results | Conclusions<br>o |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| How?        |              |             |                         |         |                  |



| Multicore Day, Kista | N |
|----------------------|---|
|----------------------|---|

| Consistency | Store Buffer | Speculation | WritersBlock<br>○●○○○○○ | Results | Conclusions<br>o |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| How?        |              |             |                         |         |                  |

- With the help of the cache coherence protocol
  - Blocking and delaying the remote write (WritersBlock)



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 23 / 37 |
|-------------|----------------------|---------------------|---------|
|-------------|----------------------|---------------------|---------|

▲母 > ▲ 臣 > ▲ 臣 > ― 臣 ⊨ の Q @

| Consistency | Store Buffer | Speculation | WritersBlock<br>○●○○○○○ | Results | Conclusions<br>O |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| How?        |              |             |                         |         |                  |

Alberto Ros

- Blocking and delaying the remote write (WritersBlock)
- Until when? Until the load stop being M-spec



| Multicore Day, Kista | November 29th, 2017 | 23 / 37 |
|----------------------|---------------------|---------|
|----------------------|---------------------|---------|

< ロ > < 同 > < 回 > < 回 > .

| Consistency | Store Buffer | Speculation | WritersBlock<br>○●○○○○○ | Results | Conclusions<br>O |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| How?        |              |             |                         |         |                  |

Alberto Ros

- Blocking and delaying the remote write (WritersBlock)
- Until when? Until the load stop being M-spec



Multicore Day, Kista

-

23/37

| Consistency | Store Buffer | Speculation | WritersBlock<br>○●○○○○○ | Results | Conclusions<br>O |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| How?        |              |             |                         |         |                  |

- With the help of the cache coherence protocol
  - Blocking and delaying the remote write (WritersBlock)
  - Until when? Until the load stop being M-spec



Alberto Ros

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○●○○○○ | Results | Conclusions<br>○ |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| EVICTIO     | NS           |             |                         |         |                  |

• What happens upon an eviction? Do we squash loads?



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 24 / 37 |
|-------------|----------------------|---------------------|---------|
|             |                      | ,,,                 |         |

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○●○○○○ | Results | Conclusions<br>o |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| EVICTION    | VS           |             |                         |         |                  |

- What happens upon an eviction? Do we squash loads?
  - No, just need to guarantee that the invalidation will arrive upon a remote write



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 24 / 37 |
|-------------|----------------------|---------------------|---------|
|             |                      |                     |         |

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○●○○○○ | Results | Conclusions<br>o |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| EVICTION    | NS           |             |                         |         |                  |

- What happens upon an eviction? Do we squash loads?
  - No, just need to guarantee that the invalidation will arrive upon a remote write
- Solution:
  - Clean blocks implement silent evictions<sup>3</sup>



<sup>3</sup> R. Fernandez-Pascual, A. Ros, and M. E. Acacio, "To Be Silent or Not: On the Impact of Evictions of Clean Data in Cache-Coherent Multicores", Journal of Supercomputing, 2017.

| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 24 / 37 |
|-------------|----------------------|---------------------|---------|

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○●○○○○ | Results | Conclusions<br>○ |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| EVICTION    | NS           |             |                         |         |                  |

- What happens upon an eviction? Do we squash loads?
  - No, just need to guarantee that the invalidation will arrive upon a remote write
- Solution:
  - Clean blocks implement silent evictions<sup>3</sup>
  - Dirty blocks write back the data but the directory still keeps track



<sup>3</sup> R. Fernandez-Pascual, A. Ros, and M. E. Acacio, "To Be Silent or Not: On the Impact of Evictions of Clean Data in Cache-Coherent Multicores", Journal of Supercomputing, 2017.

|             |                      |                     | 2.55    |
|-------------|----------------------|---------------------|---------|
| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 24 / 37 |

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○○●○○○ | Results | Conclusions<br>o |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| DEADLO      | CK           |             |                         |         |                  |

- Blocking writes can cause deadlocks
  - If x and y are two words within the same cache line



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 |
|-------------|----------------------|---------------------|
|             |                      |                     |

25/37

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○○●○○○ | Results | Conclusions<br>o |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| DEADLO      | CK           |             |                         |         |                  |

- Blocking writes can cause deadlocks
  - If x and y are two words within the same cache line



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 25 / 37 |
|-------------|----------------------|---------------------|---------|
|             |                      |                     |         |

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○○●○○○ | Results | Conclusions<br>o |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| DEADLO      | СК           |             |                         |         |                  |

- Blocking writes can cause deadlocks
  - If x and y are two words within the same cache line
  - Solution: Blocked writes allow reads to be resolved



| Alberto Ros Multicore Day, Kista November 29th, 2017 25 |
|---------------------------------------------------------|
|---------------------------------------------------------|

<ロ> <同> <同> <目> <目> <目> の

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○○●○○○ | Results | Conclusions<br>o |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| DEADLO      | СК           |             |                         |         |                  |

- Blocking writes can cause deadlocks
  - If x and y are two words within the same cache line
  - Solution: Blocked writes allow reads to be resolved



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 25 / |
|-------------|----------------------|---------------------|------|
|             |                      |                     |      |

<ロ> <同> <同> <目> <目> <目> の

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○○●○○○ | Results | Conclusions<br>o |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| DEADLO      | СК           |             |                         |         |                  |

- Blocking writes can cause deadlocks
  - If x and y are two words within the same cache line
  - Solution: Blocked writes allow reads to be resolved



| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 25 / |
|-------------|----------------------|---------------------|------|
|             |                      |                     |      |

<ロ> <同> <同> <目> <目> <目> の

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○○○●○○ | Results | Conclusions<br>o |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| LIVELOC     | CK           |             |                         |         |                  |

- Resolving reads while blocking writes can cause livelock
  - Resolving a read once the data has been invalidated will cause a second invalidation
  - Blocked<sub>i</sub>, Read<sub>j</sub>, Unblock<sub>i</sub>, Invalidate<sub>j</sub>, Blocked<sub>j</sub>, ...

| bert |  |  |
|------|--|--|
|      |  |  |
|      |  |  |

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○○○●○○ | Results | Conclusions<br>o |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| LIVELOC     | K            |             |                         |         |                  |

- Resolving reads while blocking writes can cause livelock
  - Resolving a read once the data has been invalidated will cause a second invalidation
  - Blocked<sub>i</sub>, Read<sub>j</sub>, Unblock<sub>i</sub>, Invalidate<sub>j</sub>, Blocked<sub>j</sub>, ...
- Solution
  - Reads resolved through WritersBlock are non-cacheable
    - $\Rightarrow$  No invalidations needed
  - and cannot resolve M-spec loads
    - $\Rightarrow$  No invalidation will be received

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○○○○●○ | Results | Conclusions<br>O |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| DEADLC      | OCK AVOID    | ANCE        |                         |         |                  |

- WRITERSBLOCK cause writes to be blocked
  - Until a load stop being M-speculative

| bert |  |  |
|------|--|--|
|      |  |  |

-

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○○○○●○ | Results | Conclusions<br>O |
|-------------|--------------|-------------|-------------------------|---------|------------------|
| DEADLO      | OCK AVOID    | ANCE        |                         |         |                  |

- WRITERSBLOCK cause writes to be blocked
  - Until a load stop being M-speculative
- Deadlock-free condition:
  - $\Rightarrow$  Loads are not stopped by pending write misses

| hert    | in. | Ros |
|---------|-----|-----|
| <br>Den | .0  | 103 |

| Consistency        | Store Buffer | Speculation | WritersBlock<br>○○○○○●○ | Results | Conclusions<br>O |
|--------------------|--------------|-------------|-------------------------|---------|------------------|
| DEADLOCK AVOIDANCE |              |             |                         |         |                  |

- WRITERSBLOCK cause writes to be blocked
  - Until a load stop being M-speculative
- Deadlock-free condition:
  - $\Rightarrow$  Loads are not stopped by pending write misses
- Other blocking causes and solutions:
  - MSHR address occupied by write miss
    - ⇒ Duplicate read-write MSHR allocation
  - Full directory/LLC
    - $\Rightarrow$  Non-cacheable loads
  - Atomic Read-Modify-Write
    - $\Rightarrow$  Non-speculative (ordered)

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○○○○● | Results | Conclusions<br>O |
|-------------|--------------|-------------|------------------------|---------|------------------|
| CASEO       |              |             | P COMMIT               |         |                  |

- Out-of-order commit<sup>4</sup> allows the processor to retire instructions from the reorder buffer (RoB) even if they are not at the head
- It cannot retire instructions that can be squashed



<sup>4</sup> G. B. Bell and M. H. Lipasti, "Deconstructing Commit", ISPASS, 2004.

| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 28 / 37 |
|-------------|----------------------|---------------------|---------|
|             |                      |                     |         |

31= 990

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○○○○● | Results | Conclusions<br>O |
|-------------|--------------|-------------|------------------------|---------|------------------|
| CASE OI     |              |             | P COMMIT               |         |                  |

- Out-of-order commit<sup>4</sup> allows the processor to retire instructions from the reorder buffer (RoB) even if they are not at the head
- It cannot retire instructions that can be squashed
- WRITERSBLOCK allows the retirement of out-of-order loads
- Better RoB/LQ usage



<sup>4</sup> G. B. Bell and M. H. Lipasti, "Deconstructing Commit", ISPASS, 2004.

| Alberto Ros Multicore Day, Kista November 29th, 2017 | 28 / 37 |
|------------------------------------------------------|---------|
|------------------------------------------------------|---------|

| Consistency | Store Buffer | Speculation | WritersBlock<br>○○○○○● | Results | Conclusions<br>O |
|-------------|--------------|-------------|------------------------|---------|------------------|
| CASE OI     |              |             | P COMMIT               |         |                  |

- Out-of-order commit<sup>4</sup> allows the processor to retire instructions from the reorder buffer (RoB) even if they are not at the head
- It cannot retire instructions that can be squashed
- WRITERSBLOCK allows the retirement of out-of-order loads
- Better RoB/LQ usage



<sup>4</sup> G. B. Bell and M. H. Lipasti, "Deconstructing Commit", ISPASS, 2004.

| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 28 / 37 |
|-------------|----------------------|---------------------|---------|
|             |                      |                     |         |

| Consistency | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>O |
|-------------|--------------|-------------|--------------|---------|------------------|
| OUTLINE     | 2            |             |              |         |                  |

- **1** Memory consistency and program order
- 2 RELAXING PROGRAM ORDER WITH A STORE BUFFER
- **3** KEEPING PROGRAM ORDER VIA SPECULATION
- A NON-SPECULATIVE SOLUTION: WRITERSBLOCK
- **5** EVALUATION RESULTS

# **6** CONCLUSIONS

| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>●○○○○ | Conclusions<br>o |
|-------------|--------------|-------------|--------------|------------------|------------------|
| SIMULAT     | ION ENVIE    | RONMENT     |              |                  |                  |

• Simulator: GEMS + OoO processor (TSO)

| •    | berto | Pee |
|------|-------|-----|
| - AI | Derto | nus |

| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>●○○○○ | Conclusions<br>O |
|-------------|--------------|-------------|--------------|------------------|------------------|
| SIMULA      | TION ENVI    | RONMENT     |              |                  |                  |

- Simulator: GEMS + OoO processor (TSO)
- 16-core multicore
- Silvermont (32-entry RoB), Nehalem (128-entry RoB), and Haswell (192-entry RoB)

| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>●○○○○ | Conclusions<br>O |
|-------------|--------------|-------------|--------------|------------------|------------------|
| SIMULA      | TION ENVI    | RONMENT     |              |                  |                  |

- Simulator: GEMS + OoO processor (TSO)
- 16-core multicore
- Silvermont (32-entry RoB), Nehalem (128-entry RoB), and Haswell (192-entry RoB)
- Benchmarks: Splash-3 <sup>5</sup> and Parsec-3.0

<sup>5</sup> C. Sakalis, C. Leonardsson, S. Kaxiras, and A. Ros, "Splash-3: A Properly Synchronized Benchmark Suite for Contemporary Research", ISPASS, 2016.

| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 30 / 37 |
|-------------|----------------------|---------------------|---------|

| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>●○○○○ | Conclusions<br>O |
|-------------|--------------|-------------|--------------|------------------|------------------|
| SIMULA      | TION ENVI    | RONMENT     |              |                  |                  |

- Simulator: GEMS + OoO processor (TSO)
- 16-core multicore
- Silvermont (32-entry RoB), Nehalem (128-entry RoB), and Haswell (192-entry RoB)
- Benchmarks: Splash-3 <sup>5</sup> and Parsec-3.0
- Protocols
  - DIRECTORY: Directory-based MESI protocol
  - WRITERSBLOCK: Extensions to DIRECTORY

<sup>5</sup> C. Sakalis, C. Leonardsson, S. Kaxiras, and A. Ros, "Splash-3: A Properly Synchronized Benchmark Suite for Contemporary Research", ISPASS, 2016.

| Alberto Ros | Multicore Day, Kista | November 29th, 2017 | 30 / 37 |
|-------------|----------------------|---------------------|---------|

| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>●○○○○ | Conclusions<br>O |
|-------------|--------------|-------------|--------------|------------------|------------------|
| SIMULA      | TION ENVI    | RONMENT     |              |                  |                  |

- Simulator: GEMS + OoO processor (TSO)
- 16-core multicore
- Silvermont (32-entry RoB), Nehalem (128-entry RoB), and Haswell (192-entry RoB)
- Benchmarks: Splash-3 <sup>5</sup> and Parsec-3.0
- Protocols
  - DIRECTORY: Directory-based MESI protocol
  - WRITERSBLOCK: Extensions to DIRECTORY
- Commit technique
  - INORDERCOMMIT
  - OOOCOMMIT

<sup>5</sup> C. Sakalis, C. Leonardsson, S. Kaxiras, and A. Ros, "Splash-3: A Properly Synchronized Benchmark Suite for Contemporary Research", ISPASS, 2016.

| ber |  |  |
|-----|--|--|
|     |  |  |
|     |  |  |

4 3 5 4 3 5 5

| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>○●○○○ | Conclusions<br>O |
|-------------|--------------|-------------|--------------|------------------|------------------|
| WRITER      | SBLOCK:      | BLOCKED     | WRITES       |                  |                  |

- Results for INORDERCOMMIT
- Normalized to DIRECTORY



Alberto Ros

| Consistency                  | Store Buffer | Speculation | WritersBlock | Results<br>○●○○○ | Conclusions<br>O |
|------------------------------|--------------|-------------|--------------|------------------|------------------|
| WDITEDS BLOCK BLOCKED WDITES |              |             |              |                  |                  |

- Results for INORDERCOMMIT
- Normalized to DIRECTORY
- The larger the RoB, the more loads executed out-of-order, and the more blocked writes



Alberto Ros

| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>○●○○○ | Conclusions<br>O |
|-------------|--------------|-------------|--------------|------------------|------------------|
| WRITER      | SBLOCK: ]    | BLOCKED     | WRITES       |                  |                  |

- Results for INORDERCOMMIT
- Normalized to DIRECTORY
- The larger the RoB, the more loads executed out-of-order, and the more blocked writes
- Less that 5 blocks per 10,000 stores, on average



| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>○○●○○ | Conclusions<br>o |
|-------------|--------------|-------------|--------------|------------------|------------------|
| Weiner      |              |             |              |                  |                  |

#### WRITERSBLOCK: NON-CACHEABLE DATA

- Results for INORDERCOMMIT
- Normalized to DIRECTORY



Alberto Ros

| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>○○●○○ | Conclusions<br>O |
|-------------|--------------|-------------|--------------|------------------|------------------|
| WRITER      | SBLOCK: ]    | NON-CACE    | IEABLE DA    | ГА               |                  |

- Results for INORDERCOMMIT
- Normalized to DIRECTORY
- The larger the RoB, the more writes blocked, and the more non-cacheable data



| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>○○●○○ | Conclusions<br>O |
|-------------|--------------|-------------|--------------|------------------|------------------|
| WDITED      | a Di o av    | NON CACI    |              |                  |                  |

#### WRITERSBLOCK: NON-CACHEABLE DATA

- Results for INORDERCOMMIT
- Normalized to DIRECTORY
- The larger the RoB, the more writes blocked, and the more non-cacheable data
- $\bullet~\approx$  1 non-cacheable data per 100,000 loads, on average



| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>○○○●○ | Conclusions<br>o |
|-------------|--------------|-------------|--------------|------------------|------------------|
| OUT-OF-     | -ORDER CO    | ommit: Pr   | OCESSOR S    | TALLS            |                  |

• Normalized to DIRECTORY + INORDERCOMMIT



Alberto Ros

| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>○○○●○ | Conclusions<br>O |
|-------------|--------------|-------------|--------------|------------------|------------------|
| OUT-OF-     | -ORDER CO    | ommit: Pr   | OCESSOR S    | TALLS            |                  |

- Normalized to DIRECTORY + INORDERCOMMIT
- INORDERCOMMIT
  - WRITERSBLOCK does not increases SQ stalls



| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>○○○●○ | Conclusions<br>O |
|-------------|--------------|-------------|--------------|------------------|------------------|
| OUT-OF-     | ORDER CO     | ommit: Pr   | OCESSOR S    | TALLS            |                  |

- Normalized to DIRECTORY + INORDERCOMMIT
- INORDERCOMMIT
  - WRITERSBLOCK does not increases SQ stalls
- OOOCOMMIT
  - WRITERSBLOCK reduces RoB and LQ stalls on average respect to DIRECTORY



| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>○○○○● | Conclusions<br>O |
|-------------|--------------|-------------|--------------|------------------|------------------|
| OUT-OF      | -ORDER CO    | омміт. Ех   | FOUTION 1    | TIME             |                  |

• Normalized to DIRECTORY + INORDERCOMMIT



| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>○○○○● | Conclusions<br>O |
|-------------|--------------|-------------|--------------|------------------|------------------|
| OUT-OF      |              | ολαλιτό Εν  | ECUTION 7    | TME              |                  |

- Normalized to DIRECTORY + INORDERCOMMIT
- INORDERCOMMIT
  - WRITERSBLOCK does not harm performance on average respect to DIRECTORY



| Consistency | Store Buffer | Speculation | WritersBlock | Results<br>○○○○● | Conclusions<br>O |
|-------------|--------------|-------------|--------------|------------------|------------------|
|             |              |             | ECUTION 7    |                  |                  |

- Normalized to DIRECTORY + INORDERCOMMIT
- INORDERCOMMIT
  - WRITERSBLOCK does not harm performance on average respect to DIRECTORY
- OOOCOMMIT
  - WRITERSBLOCK improves performance by 11% on average respect to DIRECTORY



| Consistency | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>o |
|-------------|--------------|-------------|--------------|---------|------------------|
| OUTLIN      | E            |             |              |         |                  |

- **1** Memory consistency and program order
- 2 RELAXING PROGRAM ORDER WITH A STORE BUFFER
- **3** KEEPING PROGRAM ORDER VIA SPECULATION
- A NON-SPECULATIVE SOLUTION: WRITERSBLOCK
- **5** EVALUATION RESULTS



| A 1 | hor  | 6   | Ros |
|-----|------|-----|-----|
| ~   | Dell | lU. | nus |

315

4 3 5 4 3 5 5

| Consistency | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>• |
|-------------|--------------|-------------|--------------|---------|------------------|
| CONCLU      | JSIONS       |             |              |         |                  |

# With the help of the cache coherence protocol, and without harming performance, we can execute loads out of order and without speculation, and obtaining results as if the loads were executed in order $(LOAD \rightarrow LOAD)$

|    | la and | - | Ros |
|----|--------|---|-----|
| AI | peri   | 0 | ROS |

| Consistency | Store Buffer | Speculation | WritersBlock | Results | Conclusions<br>• |
|-------------|--------------|-------------|--------------|---------|------------------|
| CONCLU      | JSIONS       |             |              |         |                  |

# With the help of the cache coherence protocol, and without harming performance, we can execute loads out of order and without speculation, and obtaining results as if the loads were executed in order $(LOAD \rightarrow LOAD)$

## Non-speculative loads can increase performance of out-of-order commit by 11%

| bert |  |
|------|--|
|      |  |
|      |  |

# NON-SPECULATIVE REORDERING OF MEMORY OPERATIONS WITH STRONG CONSISTENCY

### **Alberto Ros**

Universidad de Murcia

November 29th, 2017

| AI | ber | to I | Ros |
|----|-----|------|-----|
|    |     |      |     |