Optimise indexing costs

The indexing phase is the most expensive part of a GRAIL project's lifecycle. It makes one LLM call per corpus chunk. For 200 PDFs of 30 pages each, that's thousands of calls.

This guide gives you the five most effective levers to lower cost.

1. Use a cheap model for extraction

Entity + relationship extraction doesn't need a frontier model. Models like gpt-4o-mini, claude-3-5-haiku, google/gemma-4-26B-A4B-it (DeepInfra) do the job well at a fraction of the cost of gpt-4o or claude-3-5-sonnet.

# grail.yaml
llm:
  endpoint: deepinfra
  model: google/gemma-4-26B-A4B-it
  extra_pricing:
    "deepinfra|google/gemma-4-26B-A4B-it": [0.07, 0.34]

Reserve expensive models for agent / global modes at query time, where reasoning matters more.

search:
  agent_search_endpoint: anthropic
  agent_search_model: claude-3-5-sonnet-20241022

2. Enable LLM cache

If you'll re-index (because you tweaked a prompt, a config, or a model), identical calls shouldn't cost twice:

llm:
  cache_enabled: true
  cache_dir: ./cache/llm  # optional, default is <root>/cache/llm

The cache is deterministic: same prompt + same parameters = same hit. Especially useful in iterative development.

3. Tune chunk size

Larger chunks = fewer calls, but also:

More tokens per call (can push total cost up).
Less granularity in extraction.

indexing:
  chunk_size: 1500           # default 2000
  chunk_overlap: 100         # default 100

Rule of thumb: if your model handles 8K input comfortably, push to chunk_size: 1500–2000. If you use a small model with a 4K window, drop to chunk_size: 800.

4. Reduce `max_gleanings`

By default GRAIL runs a "second pass" over each chunk to find entities missed in the first. This costs double the calls for marginal improvement on many corpora.

indexing:
  max_gleanings: 0   # default 1. Set 0 to disable the second pass.

When to keep max_gleanings: 1: technical corpora with many dense entities (legal, medical). When to set max_gleanings: 0: more narrative corpora, or while you're iterating fast.

5. Index a sample first

Before unleashing 1000 PDFs:

# Copy 10 representative files
mkdir my-kb-sample/input
cp my-kb/input/{0,1,2,3,4,5,6,7,8,9}*.pdf my-kb-sample/input/

# Index the sample
grail index ./my-kb-sample

# Look at the cost
grail status ./my-kb-sample

Multiply by 100 and you have your estimate. If the figure is reasonable, index the full corpus. If not, tune the levers above and re-sample.

6. Conservative but not excessive concurrency

More concurrency = more speed but also more chance of hitting rate limits and costly retries.

llm:
  concurrent_requests: 8     # default 8
  max_retries: 3

If you see lots of retries in the logs, drop to 4. If your provider allows much more, push to 16 or 24.

7. Reranker only when it helps

Reranker improves quality but costs one extra call per query. For queries where precision isn't critical, leave it off:

reranker:
  enabled: false

Override per-query with --rerank / --no-rerank.

8. Memory mode is essentially $0 writes

If your use case is agentic memory more than KB-style, writing is free — the agent declares entities, no LLM extraction. Only consolidate does reflection (and it doesn't use an LLM in GRAIL, it's pure structural analysis).

You only pay when you query (cascade, local, global, agent).

Typical savings table

For a 200-PDF corpus (~5K chunks):

Optimisation	Approximate cost	Savings vs default
Default (gpt-4o + reranker + gleanings=1)	$50–80	—
`gpt-4o-mini` replaces `gpt-4o`	$8–15	80%
+ `max_gleanings: 0`	$5–9	90%
+ cache enabled (re-runs)	depends	up to 100%

When NOT to skimp

Embedding model: if you drop quality a lot, the whole retrieval system degrades. Keep a decent model (Qwen3-Embedding-8B, text-embedding-3-small, or similar).
Community reports: these are the basis of global mode. Very small models produce bad reports.

Next step

Honest cost tracking — how to read cost reports.
Search modes — which mode fits each question type.
KB quickstart — to run the sample.

1. Use a cheap model for extraction​

2. Enable LLM cache​

3. Tune chunk size​

4. Reduce max_gleanings​

5. Index a sample first​

6. Conservative but not excessive concurrency​

7. Reranker only when it helps​

8. Memory mode is essentially $0 writes​

Typical savings table​

When NOT to skimp​

Next step​