CFP: Coherence-Free Processor: July 2024

Tuesday, July 30, 2024

Snoopy Protocol

current snoopy protocol (performed on every instruction for ALL data, circa 1983)
on read - cache-hit - read from cache.
on read - cache miss - read from memory and store in cache.

on write - write-through and invalidate in EVERY cache.

Overhead increases as a binomial with the number of caches.

Invalidation occurs on writes for ALL data (because shared data is not identified).

Thread Safe Protocol:

Private data (mutex not required) - private cache - no protocol.

Shared data - read-through and write-through - no protocol

Swap data - execute-through - serialized in memory

Saturday, July 27, 2024

Thread Safe Hardware

Two algorithms that synergize:

First algorithm (software):

The thread safety algorithm enables multiple software processes to share data.

The same algorithm could enable multiple processors with caches to share memory.

There are three types of data needed by thread safety - private data, shared data, and swap data.

1 - Thread safety allocates private data so it will not be altered by other processes. This is data that can be changed without a mutex.

2 - Thread safety finalizes shared data changes while it is under mutex control.

3 - Thread safety serializes updates to a swap address with an atomic swap that performs a mutex, data swap (counter), or pointer swap.

First Approach:

1 - Thread safety requires the ability to recognize private data. This could be done at allocation. The data can be placed in a virtual cache for identification. This data is not shared and does not require coherence.

2 - The current software logic finalizes shared data prior to mutex release. However finalizing must be in main memory and not in a cache. Finalizing in a cache requires coherence.

3 - The current atomic swap is pseudo-atomic because it occurs in a cache. The swap must serialize in main memory. This creates a true atomic swap.

This approach requires only one software change. It requires an allocation instruction to identify private data. This change is for performance only. The approach requires changes to hardware instructions which must access cache memory or main memory depending upon whether data is private or shared. Results must be stored at the corresponding output memory address. The atomic swap must be execute-through. It must be atomic and serialized by single threading through the swap address.

Second algorithm (hardware):

Cache coherence provides the consistency of data. The software delivers logical consistency.

Cache coherence merely delivers the values that the software expects; hardware consistency.

Cache coherence provides no software logical consistency.

When the hardware is thread safe, it provides hardware consistency and cache coherence disappears.

Current Snoopy Protocol (performed on every instruction for ALL data, circa 1983)
on read - cache-hit - read from cache.
on read - cache miss - read from memory and store in cache.

on write - write-through and invalidate in EVERY cache.

Overhead increases as a binomial with the number of caches.

Invalidation occurs on writes for ALL data (because shared data is not identified).

Proposed Thread Safe Hardware Protocol:

Private data (updated without mutex) - private cache - no protocol.

Shared data (protected by mutex) - read-through and write-through - no protocol.

Swap data (used by mutex) - execute-through - serialized in memory.

A) The software serializes all shared updates through a mutex. B) Coherence synchronizes the multiple caches that could store the data. C) If the hardware stores all data in only one location, then the mutex will serialize all updates without coherence.

All java concurrency programmers know A. All computer architects know B. C requires 1) an allocation instruction that identifies shared data and 2) a mutex that serializes in main memory without coherence.

No cache for shared data means no cache coherence.

All shared updates must serialize through a swap to ensure currency.

An Alternative to Coherence

Different Algorithms for the Same Problem

Wednesday, July 24, 2024

The Same Problem

The Same Problem (detailed)

The algorithms are different so it is not apparent that the problem is the same.

Cache Coherence (current definition):

"Cache coherence is the regularity or consistency of data stored in cache memory. Maintaining cache and memory consistency is imperative for multiprocessors."

Cache Coherence problem restated:

Enables 1) multiple core processors 2) using cache copies to 3) update shared data when 4) other processors have different cache copies of the same data.

Part 4 is explicit in the current definition. Parts 1-3 are implicit. In particular, the hardware has no way to determine what is a shared data resource (3). The restatement is identical to the Thread safety problem below.

Thread safety problem:

Enables 1) multiple processes 2) to use copies of data to 3) update a shared resource when 4) other processes have different copies of the same resource.

The coherence algorithm is to invalidate other caches. This algorithm increases as a binomial with the number of caches. This algorithm resolves (4) which is the current definition of cache coherence.

The thread safety algorithm is to use an atomic uninterruptible swap to ensure update currency. This algorithm increases linearly with the number of processes. This algorithm renders (4) irrelevant. The software resolves the data race in (3).

Thread safe software provides update integrity for multiple records via either single thread or via mutex. Coherence synchronizes the caches with invalidation. However if you bypass the cache for shared data, then you do not need the cache invalidation which is coherence. Thread safe hardware requires bypassing the cache and performing atomic instructions in memory instead of in the cache. However this removes the cache. To create a cache that does not require coherence, merely designate at allocation that data IS private. The private cache does not require coherence and it does not even require write-through.

Coherence has many issues that thread safety resolves. The most important is that coherence is binomial with the number of processors while thread safety has linear contention for one record. Binomial overhead is why current multi-core is restricted to four. (*The fifth processor increases coherence overhead by 67%.) Hardware thread safety can not be implemented because the uninterruptible swap occurs in a cache. It is pseudo-atomic.

The thread safety algorithm requires that the uninterruptible swap is single threaded. This is ensured by executing at a single location which is the shared resource location to ensure update currency. A lock will suffice, because the lock is an uninterruptible swap. The software single threads through one shared resource location but the hardware can not. All current instructions are executed in the cache.

The change is to the hardware implementation of instructions. The software does not change.

* Each conversation is a conduit for invalidation. The fifth processor adds 4 conversations to the system while a 4 processor system has 6 conversations.

An Alternative to Coherence

Different Algorithms for the Same Problem

Monday, July 22, 2024

JVM

Each JVM is thread-safe.

Thread safety allows multiple JVMs to share system resources, such as memory allocation, on a uniprocessor.

Thread safety uses an atomic instruction to synchronize those processes on a uniprocessor.

To synchronize those processes on a multiprocessor requires coherence.

Paradox:

Multiple simultaneous processes on different processors create a data race.

Think of coherence as a processor race to invalidation.

Thread safety resolves any data race. So why coherence?

Why Coherence?

The current implementation of the atomic instruction is not thread-safe.
Currently all instructions execute in the cache.
The swap occurs in the cache and the caches require coherence.

If the swap bypasses the cache and occurs in main memory then, with some additional changes, coherence vanishes. If you bypass the cache, there is no cache coherence.

This is why infinite JVMs can execute concurrently, but multiprocessors are limited to 4.
Because JVMs are thread-safe but current hardware is not.

Thursday, July 18, 2024

An Alternative to Coherence

Thread-Safe Hardware

The current Thread-Safe software algorithm:

The atomic swap (CMPXCHG) finalizes associated changes.

Before the swap is executed, the software finalizes all data protected by that swap.

Thread-safe algorithm: Lock, finalize all changes, then unlock.

Thread-safe hardware part 1 (update timing):

Problem 1: Write-through occurs prior to execution of the swap.

Explanation 1: Multitasking software is thread-safe. The software resolves the data race. It resolves all timing issues. Therefore write-through can occur prior to the atomic swap that finalizes the change. The software prevents partial changes from being overwritten by another process.

Result: Current write-through of protected data preserves thread-safe software logic.

Thread-safe hardware part 2 (coherence):

Problem 2: Currently writes in a multi-core processor require binomial invalidation of all other caches.

Solution 2: This can be prevented if reads of protected shared data bypass the cache.

Result: Read-through of protected data eliminates coherence.

Read-through eliminates coherence because thread-safe software resolves timing issues.

However there are four hardware issues and one software issue.

Issue 1 - The computer can not identify protected shared data. This is data protected by thread safety from concurrent updates by multiple processes. It is data protected by an atomic swap.

Issue 2 - Currently all instructions, including the swap, occur in the cache. The swap is pseudo-atomic.

Issue 3 - Cache line currency.

Issue 4 - Software update currency.

Issue 5 - Performance impact.

Solution to 1 - A) Allocate shared data separately. Make all shared data bypass the cache. This eliminates the need to invalidate other caches. Since re-accessing shared data has unpredictable results, storing shared data in a cache is of no benefit. It could only benefit another process using the same cache. Because that process is also thread-safe, it resolves any timing issues.

Solution to 2 - B) Modify atomic swaps to serialize in main memory. The combination of (A)&(B) eliminates cache invalidation. No shared data resides in a cache.

Solution to 3 - Step 1 (A) enables a private cache that has neither write-through nor transmits invalidation nor receives invalidation. Step 2 (B) requires that cache-lines only write-through changes. Reads of shared data fetch for one instruction.

Solution to 4 - Current software is thread-safe. Thread-safe resolves timing issues without coherence.

Solution to 5 - See (A). Re-accessing shared data has unpredictable results, storing shared data in a cache is of no benefit to each process.

Thread-safe hardware part 3 (atomic):

Problem 3: Currently the pseudo-atomic swap is executed in the cache.

Solution 3: Serialize the swap in main memory. This makes the hardware thread-safe.

Result: Execute-through of the atomic swap preserves main memory integrity without coherence.

Thread-Safe Computer:

1 - Differentiate shared from private data at allocation.

2 - Private data cache uses neither write-through nor invalidation.

3 - Shared data cache is pass-through.

4 - Serialize the atomic swap in main memory. It is execute-through.

If you bypass the cache, there is no cache coherence.

Only place private data in the cache.

Benefit 1 - No coherence. No processor limit.

Benefit 2 - Private data requires neither write-thru, nor cast out, nor give invalidation nor receive invalidation.

Benefit 3 - Shared data becomes contiguous eliminating write-thru for private data.

Benefit 4 - Runs existing multitasking software without any modification albeit no cache.

Benefit 5 - Enables simultaneous execution of the concurrent multitasking queue, reducing elapsed time.

Benefit 6 - Cores access only main memory, eliminating physical connections between caches.

Benefit 7 - GPUs can come with their own core processor. Cores are independent. The caches collaborate.

Benefit 8 - The operating system can delegate to multitasking multi-cores with a pointer swap.

Benefit 9 - Many applications contain only private data. (This should be the default allocation.)

Requirement 1 - Software allocation instruction that differentiates shared data from private data.

Requirement 2 - Either (A) place swap capability on each main memory chip or (B) serialize atomic swap instructions. Usage of atomic swaps is currently minimized. The software avoids contention for one resource. The swap often eliminates the lock by serializing through one virtual data address. Current software permits the swap to occur at any main memory location because it may reside in a contiguous record.

Option (C) would be another instruction to allocate swap addresses and contiguous locations, however this would require programming. Software currently does not distinguish swap areas from shared.

However it already distinguishes shared from private. The software treats them differently; the hardware should too.

A Coherence-Free Processor Design

Different Algorithms for the Same Problem

Saturday, July 13, 2024

Thread-Safe Defined

Thread-Safe Algorithm - A Single-Threaded Swap Ensures Currency

CMPXCHG is NOT atomic. Its is pseudo-atomic.

This compares software requirements for thread safety with current hardware design and shows how to implement thread-safe hardware.

Thread-Safe Software Rules:

1 - Private data is protected from other tasks. (re-entrant memory allocation)

2 - Shared data is protected by data swap, pointer swap, or lock (i.e. mutex, semaphore).

3 - Swap data is the memory location that ensures currency.

4 - Shared data is either a swap data address or protected by a swap data address.

5 - The swap must occur at the swap data address or single-thread to test currency.

6 - The swap tests the currency of the proposed update and finalizes if current.

Result - The result is no task limit.

Current Hardware scorecard:

1 - Currently not identified.

2 - Currently not identified.

3 - Identified by the swap instruction.

4 - Currently not identified.

5 - The swap currently single-threads with a cache copy and not in main memory. (pseudo-atomic)

6 - The swap currently requires coherence to ensure cache copy currency required by the swap.

Result - Coherence imposes a processor limit.

Thread-Safe Hardware checklist:

1 - Identified by allocation instruction. (cache)

2 - Identified by allocation instruction. (memory)

3 - Identified by the swap instruction. (swap area)

4 - Identified by allocation. (memory and swap area)

5 - The swap single threads in shared main memory. (atomic Coherence-Free Swap)

6 - The swap has no coherence, it is atomic.

Result - The result is no processor limit.

#1, #2, #4 enable a cache.

#5 & #6 are required for thread-safe.

Software Phase 1: Fix the swap and run without a cache. (all allocation is for shared memory)
Software Phase 2: Modify allocation of private data to a cache that does not require coherence. Some allocations might perform faster as two allocation instructions.

The journal article presents the allocation first because it immediately reaps a performance benefit by reducing coherence.

Different Algorithms for the Same Problem

Speed Comparison - Why thread-safe computers are faster even at two cores.

Friday, July 12, 2024

The Hardware Conundrum

You can have infinite concurrent processes running on a computer.

With JVM, each process could run on its own core.

However hardware has a limitation on the cores.

The limitation is due to cache synchronization which is a binomial algorithm.

Just as JVM requires thread-safe software, the cores could be made thread-safe hardware.

Thread-Safe Algorithm:

Thread-Safe requires an update with an uninterruptible instruction and for shared data to be stored in only one place.

There is a corollary that is not required for thread-safe, but enhances performance. The software minimizes the amount and duration of locked shared data.

There are two major issues and resolution creates two caches.

1 - Software simply knows what data is shared due to program logic and treats it in a thread-safe manner.

However the hardware has no way of identifying shared data. It is just data. Currently the hardware has no way to identify and minimize the amount of shared data.

2 - Software uses an uninterruptible instruction which is atomic on a multiprocessor.

However the implementation of the atomic instruction requires binomial coherence. The instruction is pseudo-atomic because it is implemented in the cache and the cache requires coherence. It needs to be atomic in main memory, bypassing the cache and avoiding coherence.

Step 1 from a software perspective differs from the hardware step 1 mentioned elsewhere in this blog.

Software Step 1:

Remove the shared cache by making it pass-thru and modifying the atomic swap to be performed in memory. The thread-safe logic still holds and ensures hardware update integrity. There are three types of data and all are protected.

1 - Private data is protected by software logic.

2 - Swap data is protected by the swap instruction.

3 - The remainder of shared data is protected by a combination of the three swap uses: lock, data swap, or pointer swap.

Software Step 2:

Create a private cache and modify existing memory allocation instructions to allocate either private memory or shared memory. This can be thought of as a two cache computer. One contains private data that does require coherence or write-thru. The other contains shared data but is virtual and completely pass-through.

Simply stated, this system has a cache that does not require coherence.

Different Algorithms for the Same Problem

Friday, July 5, 2024

Thread-safe Algorithm

The Paradox:

The software performs logical update protection without coherence.

The hardware performs hardware update protection with coherence.

Coherence imposes a processor limit.

The Software Solution:

Allocate memory when needed to self-protect data from other processes.

(The remainder of this article pertains to shared data only. This excludes static or common data that has change control, such as programs. It also excludes read-only data which changes but the the process does not modify, such as timestamps. Shared data is defined as data the process modifies that is also subject to change by another process during execution.)

Logically protect shared data memory updates by coding for a data race.

An atomic Hardware Synchronization Primitive (HSP, i.e. CMPXCHG or Compare and Swap) provides logical shared memory update protection using a combination of three methods.

An HSP logically protects shared memory updates with a lock, data swap, or pointer swap. This permits programs to be thread-safe and eliminates the data race.

The software is thread-safe because it recognizes and executes shared data differently.

Thread-Safe Algorithm:

Thread-Safe requires an update with an uninterruptible instruction and for data to be stored in only one place.

The Hardware Issue:

An HSP must be uninterruptible on a uni-processor.

An HSP must be atomic on a multiprocessor.

Current HSPs are pseudo-atomic since they are implemented in the cache and multiple caches require coherence to synchronize.

No Benefit:

However there is no benefit to implementing the HSP in main memory because coherence is required for all shared data, not just the swap address.

So make the cache pass-through for all shared data. This eliminates the coherence requirement on shared data. And non-shared data does not require coherence.

The misconception is that the cache benefits repeat reads of shared data. It does not because the software avoids repeat reads of shared data. Repeat reads should only occur when different processes on the same core processor read the same shared data address; software resolves this without coherence. The software is thread-safe.

The Hardware Solution:

Implement the HSP in main memory and the HSP becomes atomic and coherence is eliminated.

Instructions that update main memory with a data swap are logically protected from a data race.

Instructions that update main memory within a lock or before a pointer swap, are logically protected by that lock or swap.

The result is hardware update protection without coherence.

Current coherence methods are write-through so it is only the modified HSP instruction that incurs increased overhead.

The result is Thread-Safe Hardware.

Thread-safe computers eliminate coherence by using the thread-safe software algorithm.

It has no core processor limit.

Different Algorithms for the Same Problem

Patent Pending

Thursday, July 4, 2024

Multitasking vs Concurrent vs Thread-safe vs Coherence

On a Uni-processor:

All computers multitask.

Multitasking is concurrent. Multitasking permits many processes to run on one core processor.

Processes that are concurrent must self-protect their data.

These processes are thread-safe.

The data race is resolved by serializing updates through one (cache) location.

On a Multiprocessor:

Thread-safe self-protection enables software to execute on a multiprocessor without a data race.

Coherence enables hardware to deliver the correct value on a multiprocessor.

They serve the same function, coherence solves the hardware data race.

Coherence is required when updates occur in many caches instead of at one location.

Thread-safe Computers:

Thread-safe hardware implements the thread-safe algorithm to eliminate coherence.

Thread-safe hardware enables a processor to self-protect shared data without a data race.

Coherence's data race algorithm imposes a processor limit.

But the thread-safe algorithm has no limit; software has no limit on the number of tasks.

Obvious Benefit:

Thread-safe hardware places data into two caches.

One cache is private and does not require coherence.

This is faster than the current system which requires coherence on ALL data.

Simply implementing a private cache reduces the coherence that limits core processors.

Definitions:

Software Issue:

Thread safety - "Thread safety is the avoidance of data races—situations in which data are set to either correct or incorrect values, depending upon the order in which multiple threads access and modify the data."

Hardware Issue:

Cache Coherence - "In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system."

Concurrency:

Concurrency - "In computing, multitasking is the concurrent execution of multiple tasks (also known as processes) over a certain period of time."

Different Algorithms for the Same Problem

Tuesday, July 2, 2024

Thread safety eliminates Hardware Coherence

Software Issue:

Thread safety - "Thread safety is the avoidance of data races—situations in which data are set to either correct or incorrect values, depending upon the order in which multiple threads access and modify the data."

Hardware Issue:

Cache Coherence - "In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system."

The software solution to data races is to create thread-safe programs. This is known as java concurrency. Java concurrency is coded by writing a program that runs as a single process.

Java Concurrency - "Most implementations of the Java virtual machine run as a single process. In the Java programming language, concurrent programming is primarily concerned with threads (also called lightweight processes). Multiple processes can only be realized with multiple JVMs."

The JVM runs threads as a single process. It is a virtual machine that runs one java thread. The java virtual machines do not have coherence, they have java concurrency. We can simply replace JVMs with single process thread-safe core processors. The same algorithm that is used for JVMs can be used to make thread-safe hardware that does not require coherence.

Implementing the algorithm requires making four changes to current computer architecture.

First - You need an atomic instruction that updates main memory without interruption. Coherence exists because the current pseudo-atomic instruction is performed in the cache.

Second - You need a cache for each core processor.

Third - You need to identify memory that is not shared and place it in an exclusive cache.

Fourth - Memory that does not reside in the cache can be safely written to main memory because it has java concurrency. Java concurrency resolves all data race issues with an atomic instruction. But that instruction is currently pseudo-atomic.

This method of maintaining coherent main memory is slower in only two situations.

1 - The atomic swap must occur in main memory and not in the cache.

2 - Repeat reads of shared data, however java concurrency does not permit this.

The benefit is that the exclusive cache does not require coherence. It also no longer needs to be write-through. In addition writes of shared data do not result in coherence. However the main benefit is that JVMs work in parallel. Thread-safe computers work in parallel. Just add another core processor.

Different Algorithms for the Same Problem