Use case

RAG LLM API for retrieval, embeddings and generation

RAG systems need more than a single chat model. A production retrieval-augmented generation stack usually needs embedding models, generation models, fallback choices, cost controls and a way to test model changes without rewriting the application. PEKPIK LLM gives RAG teams one OpenAI-compatible gateway for those model decisions.

Request access Read quickstart View model catalog

Primary query

RAG LLM API

Why teams search for this

Use embedding and generation models through one gateway workflow.

Compare retrieval quality separately from final answer quality.

Route RAG generation to different models by cost, latency and answer risk.

Keep fallbacks available when a model struggles with long context or citations.

Where PEKPIK fits

Good fit

OKDocument search, help center assistants and internal knowledge copilots.
OKTeams testing embeddings, reranking and chat models together.
OKProducts that need to control RAG cost as document volume grows.

Check first

!Changing embedding models can require re-indexing vectors.
!RAG quality depends on chunking, metadata and retrieval strategy, not only the LLM.
!Citation behavior and hallucination risk should be evaluated with real documents.

OpenAI-compatible example

base_url swap

from openai import OpenAI

client = OpenAI(
    base_url="https://aiapiv2.pekpik.com/v1",
    api_key="sk-...",
)

response = client.chat.completions.create(
    model="claude-opus-4-7",
    messages=[{"role": "user", "content": "Summarize this for a product team."}],
)

Suggested rollout

01

Choose an embedding model and confirm vector dimensions before indexing.
02

Build a query set from real user questions and expected source documents.
03

Evaluate retrieval quality before judging the generation model.
04

Compare final answers across candidate chat models and route by workload risk.

FAQ

Does PEKPIK replace a vector database for RAG?

No. PEKPIK provides model access through a gateway. You still need your own retrieval layer, vector database or search system.

Should RAG use the same model for every answer?

Not always. Many teams use lower-cost models for simple grounded answers and stronger models for complex synthesis or high-risk responses.