CFP: Coherence-Free Processor: June 2024

Saturday, June 29, 2024

Speed Comparison

Current technology has one cache and ALL data requires coherence.

This proposal has two caches.

The first cache contains data that is private, it does not require coherence.

Updates to this cache are faster because it does not write-thru to main memory. It also does not require invalidation. Invalidation is binomial which limits the number of processors. Invalidation also requires physical connections to avoid the memory bus.

The second cache is virtual. It passes writes to main memory. Snoopy is write-thru and is less efficient because it also entails invalidation.

The second cache reads from main memory. Current systems read from memory on the first access, cache miss. Current systems are faster only on repeat reads. However software does not perform repeat reads on shared data because shared data is subject to change. The results are unpredictable.

Repeat reads will occur only when different processes on the same core processor read the same shared data address; software handles these repeat reads without coherence by being thread-safe. The software ensures every update is current.

The proposed coherence-free system is faster than current except for the speed on the atomic swap. The swap currently occurs in the cache, resulting in coherence. The proposed swap must be single-threaded in main memory, with no cache copies, eliminating cache coherence.

Conclusion:
Only one instruction is slightly slower. And the usage of this instruction is minimized by the programmer.

To compensate, merely add another core processor. Adding core processors reduces the stack or multitasking queue. This reduces elapsed time but not execution time. Execution becomes simultaneous instead of concurrent. Cores no longer require a physical connection.

Infinite cores > four cores

Different Algorithms for the Same Problem

Friday, June 28, 2024

Why Atomic has Coherence

Current atomic instructions have coherence.
They are pseudo-atomic.

This is because they, just as every instruction, is implemented to work inside the cache.

Anything stored in the cache requires coherence because it is stored in multiple locations.

The solution is to implement atomic instructions that bypass the cache. The swap must be performed in main memory, or at least uninterrupted in main memory. This is a change even though current caches are write-through. This is because the cache retains results which require invalidation.

This might seem inefficient. However current writes are pass-thru anyway. Also the software avoids repeat reads of shared data, simply because it is subject to change. Software instead works with a snapshot copy. Therefore the entire shared cache can be virtual and pass-thru, eliminating the cache copy.

The swap itself is slower because it must occur in main memory. Non-swap shared data is slower on repeat reads, but they will not exist within a process or JVM. However eliminating coherence enables scalable core processors and no coherence overhead.

Therefore no coherence on shared data.

No coherence on private data.

An allocation instruction that differentiates shared data from private data.

The atomic instruction can be implemented as atomic in main memory, rather than in the cache, and therefore without coherence. (execute-through)

Different Algorithms for the Same Problem

Thursday, June 27, 2024

Coherence = Data Race

Coherence and Data Race achieve the same goal.

They both make sure data has update integrity.

The software has program logic that provides logical integrity.

Coherence ensures hardware update integrity.

Thread-safe solves the data race and ensures software update integrity.

However the hardware does not know what data is shared and the software does.

This distinction enables software to have an algorithm that handles an infinite number of tasks.

Because software treats shared data differently.

But hardware can not.

So a new allocation instruction is needed to differentiate shared data from private data.
Because private data does not require coherence. It is thread-safe.

Creating a thread-safe cache that does not require coherence, results in less coherence overhead.

Different Algorithms for the Same Problem

Different Algorithms for the Same Problem

A Coherence-Free Processor Design

A Thread-Safe Computer

(Design published in Journal of Computer Science and Technology)

In late 1960s my Dad, Prof. C.N. Yang (Nobel Prize Physics, 1957), was teaching his physics gauge field theory when he realized that math bundle theory had the exact same formula. The terms and symbols were different, but it was the exact same formula. Dad joked that mathematicians were able to solve the same problem even though they have no connection with the physical world. Coming from a Theoretical Physicist!

Coherence and Thread-Safe:

Coherence and thread-safe address the same issue.
However thread safety handles infinite tasks while coherence allows only four processors.

Software Issue:

Thread safety - "Thread safety is the avoidance of data races—situations in which data are set to either correct or incorrect values, depending upon the order in which multiple threads access and modify the data."

Hardware Issue:

Cache Coherence - "In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, which is particularly the case with CPUs in a multiprocessing system."

The solution is thread safe hardware. (overview below)

The Same Problem:

The thread safety algorithm works for multiple software processes sharing data.

The same algorithm could apply to multiple processors with caches sharing memory.

They are manifestations of the same problem. (detailed explanation)

The algorithm the software uses supports infinite tasks.

The algorithm the hardware uses becomes inefficient at 5.

The Thread-Safe Algorithm:

The software algorithm can be used on the hardware, entirely eliminating coherence.

This eliminates the core processor limit on multiprocessors.

It is an algorithm, so it is universal.

JVMs have thread safety which enables java concurrency.

The Hardware is Missing Information:

The hardware does not know what data is shared. However the software does.

Actually the software does not, but the programmer knows.

The program logic knows.

The Solution - Thread-Safe Design:

The solution uses current software.
The new computer design manages memory differently so it is always coherent.

The solution enables two different algorithms creating two caches. (Albeit one cache is pass-through.)

A new memory allocation instruction specifies the cache where the data must be allocated.

The hardware now knows whether data is exclusive or shared.

One cache is private and the other cache is thread-safe.

The thread-safe cache can be made virtual by making it pass-through.

Currently the atomic swap occurs in the cache, requiring coherence. It is pseudo-atomic.

The atomic swap must be implemented as uninterruptible in main memory. This eliminates coherence.

Requires:

Two new caches, a new instruction, and a modified atomic swap that is atomic and uninterruptible in main memory.

The modified swap eliminates coherence. The new instruction just enables the new caches.

The result - Thread-Safe computers:

Shared data bypasses the cache, there is no cache coherence.

Private data is stored in a cache that does not perform coherence.

Poof, coherence disappears from both caches.

Because the software algorithm does not have coherence!

The programmer knows what data is shared!

The new design runs current software because it uses current software logic.

Instructions are the same but memory is kept coherent.

Nothing changes except for the cache. And the cache is transparent.

The new instruction is required only for performance (speed comparison).

History:

Coherence dates to 1965 and relational databases were invented in 1970.

Relational databases taught programmers to store data in one place.
Compare and Swap was implemented in 1973.

However it was implemented in the cache and all instructions still are.

And having shared data in more than one cache requires coherence.

But coherence was required because the hardware did not know what data was shared.

Till now.

Needed:

A computer model.

Published article, the Algorithm, and the New Architecture:

A Coherence-Free Processor Design - Peer reviewed and published article

Publisher link

WIPO Patent Application

Introducing Thread-Safe Computers - The 2 Step Implementation Approach. Step 1 improves performance. Step 2 eliminates coherence.

Mutex - In brief, how the mutex works.

Thread Safe Hardware - Thread Safe Algorithm

Thread Safe Defined - Thread Safe Design Requirements

JVM is Thread-Safe - Why a JVM is thread-safe but processors are not.

An Alternative to Coherence - Detailed description of the implementation.

Speed Comparison - Why thread-safe computers are faster even at two cores.

Further Detail:

The Thread-safe Algorithm - The Algorithm Enabling Updates without Coherence.

Multitasking vs Concurrent vs Thread-safe vs Coherence - How thread-safe can replace coherence.

Coherence Equals Data Race - More on why they are the same problem.

Coherence-Free Thread-Safe Hardware - Main blog page.

Why Atomic currently has coherence - Why current atomic instructions are not atomic.

Thread-Safe Java Concurrency Eliminates Coherence - How Java Virtual Machines have no coherence.

Hardware Conundrum - The current hardware has insufficient information.

Draft for wiki (not an entry due to COI)

Franklin Yang 杨光诺

FrankYang43338@acm.org

FrankYang43338@gmail.com

Monday, June 3, 2024

Introducing Thread-safe Computers

(Peer reviewed and published here)

Thread-safe computers run multiple threads simultaneously and in parallel with no data races.

Software is already thread-safe since it is re-entrant.
However current multi-core processors require coherence or invalidation.

These methods limit the number of threads executed simultaneously in a system.

Simultaneous execution requires one core per thread.

Insufficient cores generate concurrency which is multitasking.

Using thread-safe core processors eliminates both coherence and invalidation.

Using thread-safe core processors permits sufficient cores to eliminate concurrency.

Concurrency is reduced by delegating to another core.

Concurrency increases elapsed time, though not execution time.

The thread-safe algorithm has two major steps.

Step 1:

Allocate private data in a coherent thread-safe cache. The entire cache is coherent.

The result is less coherence, even without step 2. This is because current computers require

coherence on all data. Implementation requires a new allocation instruction.

Step 2:

Modify the atomic instruction (i.e., compare and swap or cmpxchg) which updates shared

memory to be thread-safe. (Current implementation requires coherence.)

Thread-safe core processors synchronize hardware updates using this thread-safe atomic

instruction for physical update integrity.

Therefore a thread-safe computer has hardware update integrity, whereas existing computers

require coherence to provide hardware update integrity.

Current software’s thread-safe logic works for a thread-safe computer. (The software is

thread-safe.)

Therefore a thread-safe computer has both physical hardware update integrity and logical

software update integrity.

Step 2 summary - A thread-safe computer uses atomic instructions modified to be thread-safe

and therefore implemented without coherence. This eliminates the current limit on the number

of core processors in a system.

Needed:

I am in need of a writer to explain the impact of the first architectural change to

multitasking computers in the last 60 years.

It is essential for AI because it permits multiple simultaneous processes.

In current architecture, coherence limits the number of cores thus limiting the number of

simultaneous threads.

Keywords

Multiprocessor, Multitasking, Thread-safe, Cache, Coherence, Atomic, Snoopy

Journal of Computer Science and Technology article

Draft for wiki (not an entry due to COI)

Other explanations (blog)

Frank Yang

FrankYang43338@acm.org

Different Algorithms for the Same Problem