🧠 Devlog: Teaching LLaMA to Speak Zarian

✨ Before We Begin — Zine 001 Presale Is Live

We’re shipping Zine 001: SHIP: Don’t Shill — a print + digital artifact for builders who reflect as hard as they ship.

🧠 Take the 60-second survey and you’ll be entered to win:

🎟️ A Builder-tier presale slot ($12+)
📖 Your name in the digital credits
📦 Early access to the drop

👉 Take the survey →
Help shape the stack — before it ships.

1. The Spark

I didn’t plan to train a language model.

I said I wanted to do AI research — and that was already enough of a stretch. If you’ve been in this space, you know: AI is an expensive hobby. A $20/month subscription to ChatGPT is one thing. Training open models locally? That’s a 2K GPU and a prayer.

Then my roommate Tony — who speaks Twi, one of Ghana’s major languages — told me he wanted to do a research project in his native tongue. That hit different. We live together. We constantly swap phrases. And most of the time, I don’t understand what he’s saying.

So this became a challenge:
What if I trained a model to help bridge that gap?

And what if I learned something in the process?

2. The Descent into the Rabbit Hole

I thought my RTX 3060 with 12GB VRAM was enough. I’d already run LLaMA 3.1 locally before.
But inference isn’t training — and my machine crashed hard.

I pivoted to Google Colab. Free for 100 compute units, GPU access, flexible with Unsloth. It saved me.

But before I started working with real Twi data, I had to build the pipeline. That’s where Zarian came in — a fictional language I made up to test grammar, structure, and translation logic.

Why? Because you don’t ask someone to clean the data if you don’t even know how to structure it.

I found myself deconstructing English: suffixes, tenses, vowels, grammar rules. All the stuff you think you know — until you try to teach it to a machine.

Training a model like this doesn’t just test your compute.
It tests your assumptions.

3. The Stack

I chose LLaMA 3.1 because it’s just powerful enough. It understands human concepts and conversational logic. It hallucinates, yes — but the base model has enough pattern recognition to build from.

Here’s the stack:

✅ Model: LLaMA 3.1 via Hugging Face
🧠 Framework: Unsloth fine-tuning with Google Colab
🗂️ Dataset: JSON pairs for Zarian ↔ English translation
🔁 Training Goal: Concept-to-concept pattern mapping

Example data:

{ "zarian": "Mira vos kaleth", "english": "Hello and welcome", "type": "greeting", "structure": "light-with-peace" }

Each entry encoded linguistic structure: greeting types, emotional tones, pluralization, even blessing formats. Zarian wasn’t real — but the logic behind it was.

That logic is what I needed the model to learn.

4. The Language That Doesn’t Exist (Yet)

Zarian is fictional — but not pointless.

It’s a testing ground for something that is real: Twi.
Before I touched that dataset, I needed to make sure the structure worked.

Zarian helped me define grammar patterns, token logic, and word classes. It helped me break down how language can be formalized — without stripping its nuance.

No, it’s not a production language. But it was the perfect place to fail fast, and learn faster.

5. The Build Ethic

When I showed Tony the early results, he said, “This is lifetime work.”

And I believe him. There’s no simple end to this.
Building for underrepresented languages isn’t a side project — it’s infrastructure work.

There are barely any datasets. Few validated grammars. Almost no accessible tooling.
But that doesn’t mean we can’t begin.

This project isn’t just about LLaMA or Colab. It’s about building tools where they don’t yet exist — starting from wherever you are.

6. The Broken Outputs

Some results were beautiful. Others were… less so.

Prompt:
Translate to Zarian: "How are you"

Response:
Senth na vira — repeated over and over in a loop, like a chant stuck in a recursive loop.
Too verbose. No break. If I didn’t cap the token count and drop the temperature, it would spiral.

Another case:

Prompt:
What does “Zitira vos makaleth” mean in English?

Response:
Mira kel senga vos makathos.
Complete miss. Not just a wrong answer — a mutation.

It reminded me: models don’t know.
They guess, pattern, stretch.

If your data isn’t right, neither are they.

7. The Next Chapter

The next step is real Twi. With Tony. With care. With structure.

Because he's tired of ChatGPT butchering his language.
Tired of trying to use AI tools that don’t recognize his voice.
Tired of having to translate himself to machines built for someone else.

We’re not done.
We’re just setting up the scaffolding.

If you want to join me:
→ DM me
→ Research with me
→ Share your language, your framework, your failures

This is only the beginning.

🔗 Support the Work

| 🧪 Follow the Research | For builders training models in their kitchens |

Subscribe now

—

/build stays learning.
— Craft The Future