Thinking tokens count against num_predict. At 4096 the model was
running out mid-response after spending ~3000 tokens on thinking.
16384 gives enough headroom for thinking + full response.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Empty documents caused the model to spin in its thinking loop and waste
all tokens. Now raises a clear job error before the Ollama call.
Also adds OLLAMA_THINK env var (default false) to control whether
the model uses extended thinking - disabling it avoids runaway thinking
loops on ambiguous inputs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds repeat_penalty=1.15 and repeat_last_n=128 to suppress token
repetition loops (e.g. "tragen" -> "tragen" -> ...). Also caps output
via num_predict (default 4096, configurable via OLLAMA_NUM_PREDICT env
var) as a hard stop in case the model still gets stuck.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>