CFP: Coherence-Free Processor

Monday, November 4, 2024

Thread Safe Computers

The invention redefines thread safe.

For comparison, two current definitions are shown below.

Due to blog incompatibility current version is located here.

Obsolete

Thread Safe Computers

Overview

Quote from Discover Magazine Jan 18, 2024

Ehrlich discovered that bacteria had different properties.

There were different types that he could identify and treat differently.

This led to antibiotics.

The computer also has different types of data.

Optimally, different types can be handled differently.

Program logic determines how data is used and software optimizes processing.

But hardware does not know how data is used, resulting in inefficiencies.

As a consequence, the hardware must use an algorithm different from the software.

Software is thread safe and minimizes the transfer of shared information.

Hardware is not and can not.

One inefficiency occurs in how computers synchronize their caches.

The synchronization is known as cache coherence. (Fig. 1 Current)

Another inefficiency is related to the memory bus which is controlled by the Bus Arbiter.

The bus arbiter is serial since it processes one request at a time.

The current algorithm shown in Fig.1 is simple.

Each time a processor stores a change in its cache,

the coherence unit must check all caches prior to changing main memory.

The cache coherence algorithm used by the hardware resolves integrity but is complicated.

Physical connections are required to synchronize hardware.

This complicates adding processors and memory buses.

The proposed thread safe computer is simpler. (Fig. 2 Proposed)

In contrast the current software thread safe algorithm reduces the amount of data requiring

synchronization to just one memory address (Fig. 3 Thread Safe). Memory is synchronized by

serializing control through one atomic swap instruction or mutex. The mutex is logical.

It is virtual.

The thread safe computer in Fig. 2 implements the current software thread safe memory logic.

The Bus Arbiter provides serialization and the mutex is executed in the memory controller.

The computer is thread safe.

This compares to the current mutex which is executed in each cache resulting in coherence.

This design requires an allocation instruction that enables hardware to differentiate
private data from shared data. This supplies the missing information that prevents the hardware
from being thread safe. The hardware needs to know whether data requires synchronization.

Benefits:

This new architecture extends the software's existing algorithm to the hardware making it thread safe.

This eliminates many hardware inefficiencies which duplicate function that the software already provides.

The entire coherence unit is duplicative and redundant.
In particular this design facilitates adding additional processors and memory buses.

It uses the existing software algorithm, so it can be applied to any multitasking computer system.

Current multiprocessors were designed before the mutex. They should be redesigned for the mutex.

Frank Yang

FrankYang43338 at acm dot org

Monday, August 5, 2024

Mutex

1 - Software serializes all shared updates through a mutex.

2 - The serialization provides each process with update integrity.

3 - Current multiprocessor hardware does not serialize the mutex. The mutex is executed in each cache.

4 - The non-serialized implementation of the mutex requires binomial coherence to synchronize caches.

5 - Serializing the mutex synchronizes the processors and enables a parallel multi-core architecture.

The software functions because it expects the mutex to be serialized. The result is Thread Safe Hardware.

Those who study Java concurrency or thread safety will understand 1 & 2.

Those who study computer architecture will understand 3 & 4.

This is an algorithm. With a few other changes, it extends the benefit of the mutex to the hardware.

It modifies memory management but not software, therefore it works for any multitasking system.

This computer design does not need to be modeled to be understood.

It requires two things:

1 - The mutex must be atomic and it is currently pseudo-atomic.

2 - The hardware must be able to identify shared data.

The result is:

A cache containing private data; does not require coherence.

Shared data that is read-through and write-through; does not require coherence.

The mutex is execute-through, does not require coherence.

Different Algorithms for the Same Problem

Tuesday, July 30, 2024

Snoopy Protocol

current snoopy protocol (performed on every instruction for ALL data, circa 1983)
on read - cache-hit - read from cache.
on read - cache miss - read from memory and store in cache.

on write - write-through and invalidate in EVERY cache.

Overhead increases as a binomial with the number of caches.

Invalidation occurs on writes for ALL data (because shared data is not identified).

Thread Safe Protocol:

Private data (mutex not required) - private cache - no protocol.

Shared data - read-through and write-through - no protocol

Swap data - execute-through - serialized in memory

Saturday, July 27, 2024

Thread Safe Hardware

Two algorithms that synergize:

First algorithm (software):

The thread safety algorithm enables multiple software processes to share data.

The same algorithm could enable multiple processors with caches to share memory.

There are three types of data needed by thread safety - private data, shared data, and swap data.

1 - Thread safety allocates private data so it will not be altered by other processes. This is data that can be changed without a mutex.

2 - Thread safety finalizes shared data changes while it is under mutex control.

3 - Thread safety serializes updates to a swap address with an atomic swap that performs a mutex, data swap (counter), or pointer swap.

First Approach:

1 - Thread safety requires the ability to recognize private data. This could be done at allocation. The data can be placed in a virtual cache for identification. This data is not shared and does not require coherence.

2 - The current software logic finalizes shared data prior to mutex release. However finalizing must be in main memory and not in a cache. Finalizing in a cache requires coherence.

3 - The current atomic swap is pseudo-atomic because it occurs in a cache. The swap must serialize in main memory. This creates a true atomic swap.

This approach requires only one software change. It requires an allocation instruction to identify private data. This change is for performance only. The approach requires changes to hardware instructions which must access cache memory or main memory depending upon whether data is private or shared. Results must be stored at the corresponding output memory address. The atomic swap must be execute-through. It must be atomic and serialized by single threading through the swap address.

Second algorithm (hardware):

Cache coherence provides the consistency of data. The software delivers logical consistency.

Cache coherence merely delivers the values that the software expects; hardware consistency.

Cache coherence provides no software logical consistency.

When the hardware is thread safe, it provides hardware consistency and cache coherence disappears.

Current Snoopy Protocol (performed on every instruction for ALL data, circa 1983)
on read - cache-hit - read from cache.
on read - cache miss - read from memory and store in cache.

on write - write-through and invalidate in EVERY cache.

Overhead increases as a binomial with the number of caches.

Invalidation occurs on writes for ALL data (because shared data is not identified).

Proposed Thread Safe Hardware Protocol:

Private data (updated without mutex) - private cache - no protocol.

Shared data (protected by mutex) - read-through and write-through - no protocol.

Swap data (used by mutex) - execute-through - serialized in memory.

A) The software serializes all shared updates through a mutex. B) Coherence synchronizes the multiple caches that could store the data. C) If the hardware stores all data in only one location, then the mutex will serialize all updates without coherence.

All java concurrency programmers know A. All computer architects know B. C requires 1) an allocation instruction that identifies shared data and 2) a mutex that serializes in main memory without coherence.

No cache for shared data means no cache coherence.

All shared updates must serialize through a swap to ensure currency.

An Alternative to Coherence

Different Algorithms for the Same Problem

Wednesday, July 24, 2024

The Same Problem

The Same Problem (detailed)

The algorithms are different so it is not apparent that the problem is the same.

Cache Coherence (current definition):

"Cache coherence is the regularity or consistency of data stored in cache memory. Maintaining cache and memory consistency is imperative for multiprocessors."

Cache Coherence problem restated:

Enables 1) multiple core processors 2) using cache copies to 3) update shared data when 4) other processors have different cache copies of the same data.

Part 4 is explicit in the current definition. Parts 1-3 are implicit. In particular, the hardware has no way to determine what is a shared data resource (3). The restatement is identical to the Thread safety problem below.

Thread safety problem:

Enables 1) multiple processes 2) to use copies of data to 3) update a shared resource when 4) other processes have different copies of the same resource.

The coherence algorithm is to invalidate other caches. This algorithm increases as a binomial with the number of caches. This algorithm resolves (4) which is the current definition of cache coherence.

The thread safety algorithm is to use an atomic uninterruptible swap to ensure update currency. This algorithm increases linearly with the number of processes. This algorithm renders (4) irrelevant. The software resolves the data race in (3).

Thread safe software provides update integrity for multiple records via either single thread or via mutex. Coherence synchronizes the caches with invalidation. However if you bypass the cache for shared data, then you do not need the cache invalidation which is coherence. Thread safe hardware requires bypassing the cache and performing atomic instructions in memory instead of in the cache. However this removes the cache. To create a cache that does not require coherence, merely designate at allocation that data IS private. The private cache does not require coherence and it does not even require write-through.

Coherence has many issues that thread safety resolves. The most important is that coherence is binomial with the number of processors while thread safety has linear contention for one record. Binomial overhead is why current multi-core is restricted to four. (*The fifth processor increases coherence overhead by 67%.) Hardware thread safety can not be implemented because the uninterruptible swap occurs in a cache. It is pseudo-atomic.

The thread safety algorithm requires that the uninterruptible swap is single threaded. This is ensured by executing at a single location which is the shared resource location to ensure update currency. A lock will suffice, because the lock is an uninterruptible swap. The software single threads through one shared resource location but the hardware can not. All current instructions are executed in the cache.

The change is to the hardware implementation of instructions. The software does not change.

* Each conversation is a conduit for invalidation. The fifth processor adds 4 conversations to the system while a 4 processor system has 6 conversations.

An Alternative to Coherence

Different Algorithms for the Same Problem

Monday, July 22, 2024

JVM

Each JVM is thread-safe.

Thread safety allows multiple JVMs to share system resources, such as memory allocation, on a uniprocessor.

Thread safety uses an atomic instruction to synchronize those processes on a uniprocessor.

To synchronize those processes on a multiprocessor requires coherence.

Paradox:

Multiple simultaneous processes on different processors create a data race.

Think of coherence as a processor race to invalidation.

Thread safety resolves any data race. So why coherence?

Why Coherence?

The current implementation of the atomic instruction is not thread-safe.
Currently all instructions execute in the cache.
The swap occurs in the cache and the caches require coherence.

If the swap bypasses the cache and occurs in main memory then, with some additional changes, coherence vanishes. If you bypass the cache, there is no cache coherence.

This is why infinite JVMs can execute concurrently, but multiprocessors are limited to 4.
Because JVMs are thread-safe but current hardware is not.

Thursday, July 18, 2024

An Alternative to Coherence

Thread-Safe Hardware

The current Thread-Safe software algorithm:

The atomic swap (CMPXCHG) finalizes associated changes.

Before the swap is executed, the software finalizes all data protected by that swap.

Thread-safe algorithm: Lock, finalize all changes, then unlock.

Thread-safe hardware part 1 (update timing):

Problem 1: Write-through occurs prior to execution of the swap.

Explanation 1: Multitasking software is thread-safe. The software resolves the data race. It resolves all timing issues. Therefore write-through can occur prior to the atomic swap that finalizes the change. The software prevents partial changes from being overwritten by another process.

Result: Current write-through of protected data preserves thread-safe software logic.

Thread-safe hardware part 2 (coherence):

Problem 2: Currently writes in a multi-core processor require binomial invalidation of all other caches.

Solution 2: This can be prevented if reads of protected shared data bypass the cache.

Result: Read-through of protected data eliminates coherence.

Read-through eliminates coherence because thread-safe software resolves timing issues.

However there are four hardware issues and one software issue.

Issue 1 - The computer can not identify protected shared data. This is data protected by thread safety from concurrent updates by multiple processes. It is data protected by an atomic swap.

Issue 2 - Currently all instructions, including the swap, occur in the cache. The swap is pseudo-atomic.

Issue 3 - Cache line currency.

Issue 4 - Software update currency.

Issue 5 - Performance impact.

Solution to 1 - A) Allocate shared data separately. Make all shared data bypass the cache. This eliminates the need to invalidate other caches. Since re-accessing shared data has unpredictable results, storing shared data in a cache is of no benefit. It could only benefit another process using the same cache. Because that process is also thread-safe, it resolves any timing issues.

Solution to 2 - B) Modify atomic swaps to serialize in main memory. The combination of (A)&(B) eliminates cache invalidation. No shared data resides in a cache.

Solution to 3 - Step 1 (A) enables a private cache that has neither write-through nor transmits invalidation nor receives invalidation. Step 2 (B) requires that cache-lines only write-through changes. Reads of shared data fetch for one instruction.

Solution to 4 - Current software is thread-safe. Thread-safe resolves timing issues without coherence.

Solution to 5 - See (A). Re-accessing shared data has unpredictable results, storing shared data in a cache is of no benefit to each process.

Thread-safe hardware part 3 (atomic):

Problem 3: Currently the pseudo-atomic swap is executed in the cache.

Solution 3: Serialize the swap in main memory. This makes the hardware thread-safe.

Result: Execute-through of the atomic swap preserves main memory integrity without coherence.

Thread-Safe Computer:

1 - Differentiate shared from private data at allocation.

2 - Private data cache uses neither write-through nor invalidation.

3 - Shared data cache is pass-through.

4 - Serialize the atomic swap in main memory. It is execute-through.

If you bypass the cache, there is no cache coherence.

Only place private data in the cache.

Benefit 1 - No coherence. No processor limit.

Benefit 2 - Private data requires neither write-thru, nor cast out, nor give invalidation nor receive invalidation.

Benefit 3 - Shared data becomes contiguous eliminating write-thru for private data.

Benefit 4 - Runs existing multitasking software without any modification albeit no cache.

Benefit 5 - Enables simultaneous execution of the concurrent multitasking queue, reducing elapsed time.

Benefit 6 - Cores access only main memory, eliminating physical connections between caches.

Benefit 7 - GPUs can come with their own core processor. Cores are independent. The caches collaborate.

Benefit 8 - The operating system can delegate to multitasking multi-cores with a pointer swap.

Benefit 9 - Many applications contain only private data. (This should be the default allocation.)

Requirement 1 - Software allocation instruction that differentiates shared data from private data.

Requirement 2 - Either (A) place swap capability on each main memory chip or (B) serialize atomic swap instructions. Usage of atomic swaps is currently minimized. The software avoids contention for one resource. The swap often eliminates the lock by serializing through one virtual data address. Current software permits the swap to occur at any main memory location because it may reside in a contiguous record.

Option (C) would be another instruction to allocate swap addresses and contiguous locations, however this would require programming. Software currently does not distinguish swap areas from shared.

However it already distinguishes shared from private. The software treats them differently; the hardware should too.

A Coherence-Free Processor Design

Different Algorithms for the Same Problem