Langfuse Platform Architecture - High-Level Overview
Langfuse's infrastructure continuously evolves to support increasing scale and new product features. We started on Vercel and Supabase with Next.js and Postgres, and have evolved to the distributed architecture described below. As our product and scale requirements grow, we'll continue to mature our infrastructure to meet those needs.
Infrastructure Components
Application Layer
- Web container (NextJs): Serves the UI application and all APIs.
- Worker container (Express): Processes ingestion events in the background and executes async tasks (e.g. exports, LLM as a Judge).
Storage Layer
- PostgreSQL: Stores transactional data (users, organizations, projects, API keys, prompts, datasets, LLM as a Judge settings).
- ClickHouse: Stores tracing data (traces, observations, scores). We use it to run dashboards, metrics, and render tables in the UI.
- Redis: Stores event queue (BullMQ) and caching layer (API keys, prompts).
- S3: Stores raw ingestion events and multi-modal attachments (images, audio).
Why do we need an OLAP database (Clickhouse) for observability data?
- We built Langfuse initially on Postgres and eventually migrated to Clickhouse. We always knew that Postgres wont be the best fit for our observability data.
- OLAP databases have a columnar layout. With that the database only scans data required to produce results for analytical queries (e.g. LLM cost over time).
- We needed a multi-node database to scale our data insert.
- As we are an open source product, we required a database which runs on an open source license.
Production environments
Our production infrastructure is deployed across multiple AWS regions with a fully automated CI/CD pipeline. All infrastructure is managed using Terraform. Cloudflare WAF (Web Application Firewall) serves as a central proxy in front of AWS.
---
config:
flowchart:
subGraphTitleMargin:
bottom: 30
---
graph TB
MR["Turborepo Monorepository (web/worker/shared code)"]
Python["Python SDK"]
JS["JS/TS SDK"]
NPM["NPM Registry"]
PYPI["PyPI Registry"]
JS --> NPM
Python --> PYPI
GHA["GitHub Actions"]
MR --> GHA
TF["Terraform <br/>(private repository)"]
TF --> ECR
TF --> VPN
TF --> Observability
subgraph CD["CI/CD Pipeline"]
GHA -->|Build & Push| ECR[AWS ECR]
GHA -->|Build & Push| DockerHub["Docker Hub<br/>(OSS releases)"]
end
ECR -->|Deploy| VPN
subgraph CF["Cloudflare"]
WAF["WAF<br/>(Web Application Firewall)"]
Proxy["Central Proxy"]
end
CF -->|Traffic| VPN
subgraph VPN["<span style='display:block; width:100%; text-align:left; white-space: nowrap;'><i>Langfuse Cloud (US, EU, HIPAA)</i></span>"]
ECS["AWS ECS Fargate (Web + Worker)"]
ECS --> ElastiCache["ElastiCache Redis (Clustered)"]
ECS --> S3[S3 Buckets]
ECS --> Aurora[Aurora PostgreSQL]
ECS --> CH[ClickHouse Cloud]
end
subgraph OSS["Managed OSS Deployment Templates"]
direction LR
DockerCompose["Docker Compose"]
AWSCustomer["AWS"]
AzureCustomer["Azure"]
GCPCustomer["GCP"]
HelmCharts["Helm Charts"]
end
DockerHub -->|Deploy| OSS
subgraph "Observability"
DataDog
Sentry
Pagerduty
DataDog --> Pagerduty
Sentry --> Pagerduty
end
VPN --> ObservabilityData Ingestion from SDKs
![]()
- SDKs: SDKs instrument the applications of our users. We built our own Python/JS SDKs which use OpenTelemetry under the hood.
- API: SDKs send data to our API, which uploads the data to S3 and queues it for processing by the worker.
- Redis queue: Decouples ingestion from processing. We only pass S3 references through Redis.
- Worker processing: Asynchronously processes ingestion events, enriches events, flushes to ClickHouse.
- Dual database: ClickHouse for analytical queries, Postgres for transactional data
Was this page helpful?