AI Chat API - Handa Uncle Platform

The /api/v1/ai router powers everything related to the in-app assistant. It exposes exactly two routes today: a public health probe and a unified chat endpoint that can serve both traditional JSON responses and server-sent event (SSE) streams.

Flow	Endpoint	Description
Health	`GET /api/v1/ai/health`	Returns provider, model, and capability metadata for monitoring.
Chat	`POST /api/v1/ai/chat`	Sends a user message, optional attachments, and returns the assistant reply (streaming or buffered).

Authentication

Registered users send Authorization: Bearer <Auth0 access token> (same token issued by the Auth service).
Guest devices omit the bearer token and instead send:
- x-device-id (required): stable per installation.
- x-platform: android, ios, or web.
- x-user-id, x-user-email, x-user-phone (all optional hints that help map the device to an existing profile faster).

Both paths go through the same middleware so downstream services always see userId and an isGuest flag.

Rate limits & quotas

Hard rate limit: 20 chat requests per minute per user/device. Exceeding this returns 429 with the standard error envelope plus code RATE_LIMIT_EXCEEDED.
Soft quota: usageLimitMiddleware checks each message against FREE_MESSAGE_THRESHOLD. Crossing the free allowance yields 429 + FREE_LIMIT_EXCEEDED with details.requires_signup set for guests so clients can prompt for signup.

Request payload

{
  "message": "Explain debt mutual funds vs equity",
  "conversation_id": "665f5e6fcb6e4c73dc6dca01",
  "model": "gpt-4o-mini",
  "attachments": [
    {
      "fileId": "665f5e6fcb6e4c73dc6dcaff"
    },
    {
      "data": "<base64-bytes>",
      "mimeType": "application/pdf",
      "filename": "portfolio.pdf"
    }
  ]
}

message is required (1–2000 chars).
conversation_id (optional) must be a valid MongoDB ObjectId; omit to start a new thread.
model (optional) overrides the default configured model. The backend still enforces provider allowlists.
attachments accept up to 20MB each. Either pass a previously uploaded fileId or inline data + mimeType + filename.

Attachments

Inline attachments are decoded, size-checked (20MB), and converted into the AI SDK’s multimodal format before the LLM call. Referenced files go through the file-upload service and are re-hydrated via GCS if necessary. Non-image/PDF files are sent as type: 'file' parts so models like Claude Sonnet and GPT-4o can read them.

Streaming vs non-streaming

Set X-Stream-Response: true or Stream: true to receive SSE chunks in text/event-stream format. Chunks arrive as JSON objects keyed by type (token, tool_call, tool_result, done, error).
Omit the header (or set it to false) to get the regular JSON envelope with the full assistant message, token count, and tool call metadata.
You can reuse the same endpoint for both behaviors, which simplifies client routing.

Example curl (buffered)

curl -X POST https://api.handauncle.com/api/v1/ai/chat \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{ "message": "How do ELSS funds work?" }'

Example curl (streaming guest)

curl -N -X POST https://api.handauncle.com/api/v1/ai/chat \
  -H "x-device-id: ios-device-3f92" \
  -H "x-platform: ios" \
  -H "X-Stream-Response: true" \
  -H "Content-Type: application/json" \
  -d '{ "message": "List retirement planning steps" }'

Error handling

All responses use the shared { success, data|error, meta } envelope. Expect:

401 when neither bearer auth nor device headers are present.
429 for both rate limiting and free-tier depletion (check error.code).
500 when upstream providers (LLM, RAG, storage) fail. The controller still attempts to send a friendly fallback message when the LLM output is empty.

Retrieval Augmented Generation (RAG)

When enabled via config, the chat service automatically classifies each question, executes a Groq-powered query generator, and injects high-scoring snippets from the knowledge base into the system prompt. Conversational small talk skips RAG to keep latency low. No extra headers are required to toggle this behavior—clients simply send the regular chat request and the backend decides whether to enrich it.

Documentation Index

​Authentication

​Rate limits & quotas

​Request payload

​Attachments

​Streaming vs non-streaming

​Example curl (buffered)

​Example curl (streaming guest)

​Error handling

​Retrieval Augmented Generation (RAG)