Introduction

In my recent release of the central CPU infrastructure code, I realized that while it provided a technical foundation, it didn’t fully capture the essence of how to effectively mine Bittensor. At its core, Bittensor mining is a deceptively simple concept — provide the best possible digital commodities as defined by the validation metrics. By excelling in this task, miners can earn the highest incentives and rewards, ultimately benefiting the validators, front-end users, and clients.

To truly understand the challenges and complexities of Bittensor mining, it’s essential to take a closer look at the state of mining and validation during my most active period, roughly five months ago. At that time, validators were primarily seeking high-quality, large-scale inference from fine-tuned text-to-text, encoder-decoder, and machine learning models. As we’ve progressed over the past year, it’s become increasingly clear that this commodity is extremely valuable. Nearly every company is now integrating intelligent, fine-tuned, or RAG-capable models trained on their specific corpus to reduce costs, employees, resources, and overhead. However, validating the quality of this information is a formidable challenge, one that Bittensor, and specifically Subnet 1, is still actively working to solve.

Five months ago, Subnet 1 employed a weighted reward stack of evaluation models to assess the quality of the outputs provided by the network’s miners. The final score of each miner was determined by a complex formula: (0.6 * RLHF score) + (0.4 * DPO score) * (Diversity score) * (Binary relevance result) * (Binary NSFW result). To fully grasp the intricacies of this system, it’s crucial to delve into the inner workings of each model.

As we explore the details of these reward models and the strategies I employed to optimize my mining operation, it will become clear that the path to success in Bittensor mining is far from straightforward. The challenges are numerous, from the constant need for fine-tuning and adaptation to the ever-present threat of external attacks. However, by sharing my experiences and insights, I hope to shed light on the complexities of this ecosystem and contribute to the ongoing conversation about how we can create a more robust, effective, and innovative validation system for the future of blockchain technology.

RLHF:

https://github.com/opentensor/validators/blob/main/openvalidators/reward/open_assistant.py

Overview:

The RLFH model leveraged the OpenAssistant/reward-model-deberta-v3-large-v2 model, which is a pre-trained DeBERTa-v3-large model fine-tuned specifically for the task of assigning rewards in a Reinforcement Learning with Human Feedback (RLHF) setting.

Intention:

The primary purpose of the OpenAssistantRewardModel was to evaluate the quality and relevance of the completions generated by Bittensor miners given a specific prompt. By utilizing the pre-trained reward model, it can assign scores to the generated outputs based on their coherence, fluency, and alignment with the desired task. These reward scores serve as a signal to guide the miners in generating high-quality and relevant outputs, encouraging them to produce completions that are deemed valuable by the reward model.

Downsides:

This model was the most abstract and difficult to fine-tune. The training detasets weren’t great, and in my opinion, the model wasn’t optimized for the specific task that we were solving. The RLHF model had a hard time determining what was a better answer between 2 different models trained on different training data. And this caused some models to inherently perform better than others, subjective to this specific model, even if the answer wasn’t objectively better.

How to optimize:

Because the model tended to favor specific tone or wording, you had to either fine-tune your model on the “accepted” column of the training datasets, or find a fine-tuned model that was already performed well on this metric. The best option was a combination of both. Here’s what I did. I took a sample of one thousand synthetic prompts from the validator’s wandb logs, and stored them in a csv file.

https://github.com/surcyf123/dataset_enrichment/blob/main/dataset/only_prompts.json

After that, I had a scraper scrape all of TheBloke’s 6b gptq ported fine-tuned models.