The Problem

Currently, the cost barrier to training state of the art language models is extremely high. GPT-4 is suspected to have cost more than $100 million dollars to train whilst Anthropic’s CEO predicts that training SOTA models will cost $1 billion dollars this year followed by $10 billion dollars next year. This means that only a small oligarchy of well-funded tech giants now have the ability to train these models.

As these models grow more intelligent and have more impact on our daily lives, so do their owners. They end up deciding how they should be censored and whose values they should incorporate. In effect, this means we get to be governed by AI trained on a constitution we never voted for.

Blockchain, the Decentralised movement and more specifically Bittensor have proved that they can provide alternatives to this centralised approach by incentivising the masses to pool their resources together to carry out useful work. As Const often mentions, the collective amount of compute that goes into mining Bitcoin far exceeds the compute of any Google, Microsoft, OpenAI or Anthropic data centres.

Granted, Machine Learning requires a different type of compute but if a decentralised mechanism is able to incentivise that specific type of compute in a similar way whilst accurately validating it then in theory it can have access to a similar size of compute, if not larger, to train an extremely large single model.

The Solution

Our proposed solution is a subnetwork that incentivises Compute, Bandwidth and Latency. The compute helps power the training of a miner’s local version of a model and the bandwidth and latency helps power the averaging of each miners local model weights using an operation called butterfly all-reduce. Once this process is successfully completed, each miner has a unified global averaged gradient that it can use to update it’s model weights.

Train Synapse:

Screenshot 2024-07-10 at 09.55.03.png

All Reduce Synapse:

This particular synapse has required the bulk of our attention for the past few months. It is, in our opinion, the most unstable part within distributed training. It requires all miners to split their gradients into n batches where n is the number of miners in the all reduce group. It then shares those batches of gradients with each miner and averages the gradients it receives.

If some miners fail to send their averaged gradients, then local gradients are used instead.

If some miners send gradients for the wrong model, then they are banned and their gradients are disregarded.

If miners are too slow to send their gradients, then they are also banned and local gradients are used instead.