Just to make things clear: API access to most models is charged per input tokens + output tokens. It means that the longer your conversation is, the more you pay for every new answer.
Single prompt with no context and 100 tokens of answer is cheap.
Single prompt with 100k tokens of context and 100 tokens of answer is NOT cheap.
Extremely long conversations with most expensive top of the line models can absolutely demolish your budget.
does it give the full history to the LLM each time?
Last time I tried implementing something like this, it suggested to have a rolling window of history so that it takes into account your last X messages but not the entire conversation.
(I guess this is what ollama calls “context length”?)
You send the entire history for that conversation every time and likely more if its getting info from tools. If its not in the context the model dose not see it unless you have a memory system that dose something like feeding in summaries of past conversations that also takes up tokens and context. Rolling drops old messages to not reach context limits but you can lose important info or get odd results. If the history gets bigger than the context things break or slow way down.
presumably this is why Claude periodically writes its conclusions so far into a text file that it can read later instead of having to remember everything. Sounds like an interesting approach.
Just to make things clear: API access to most models is charged per input tokens + output tokens. It means that the longer your conversation is, the more you pay for every new answer. Single prompt with no context and 100 tokens of answer is cheap. Single prompt with 100k tokens of context and 100 tokens of answer is NOT cheap.
Extremely long conversations with most expensive top of the line models can absolutely demolish your budget.
does it give the full history to the LLM each time?
Last time I tried implementing something like this, it suggested to have a rolling window of history so that it takes into account your last X messages but not the entire conversation.
(I guess this is what ollama calls “context length”?)
Most agent harnesses do something called “compaction.” For example, here’s how Pi does compaction
You send the entire history for that conversation every time and likely more if its getting info from tools. If its not in the context the model dose not see it unless you have a memory system that dose something like feeding in summaries of past conversations that also takes up tokens and context. Rolling drops old messages to not reach context limits but you can lose important info or get odd results. If the history gets bigger than the context things break or slow way down.
presumably this is why Claude periodically writes its conclusions so far into a text file that it can read later instead of having to remember everything. Sounds like an interesting approach.