Together AI Review

About Together AI

Together AI is a cloud-based platform that specializes in open-source artificial intelligence models and infrastructure. The platform provides developers and organizations with tools to train, fine-tune, and deploy various AI models without requiring extensive machine learning infrastructure expertise.

The platform offers access to a wide range of open-source models including large language models, image generation models, and other AI capabilities. Users can fine-tune these models on their own data or use pre-trained models through API endpoints. Together AI handles the underlying infrastructure, including GPU clusters and scaling requirements.

The service targets developers, AI researchers, and companies looking to integrate AI capabilities into their applications without building infrastructure from scratch. It competes with other AI platform providers by focusing specifically on open-source models rather than proprietary solutions.

Together AI offers both API access for inference and training capabilities for custom model development. The platform aims to make open-source AI models more accessible by providing managed infrastructure and simplified deployment options.

Features

AI

Fine-Tuning
Fine-tunes open-source models for production workloads using the latest research techniques to improve accuracy, reduce hallucinations, and control behavior without managing training infrastructure.
Together Kernel Collection
A collection of GPU kernels that enables up to 90% faster pre-training and optimized performance across compute workloads.

Core

Accelerated Compute
Scales from self-serve instant clusters to thousands of GPUs, optimized for better performance using the Together Kernel Collection.
Batch Inference
Processes massive workloads asynchronously at scale up to 30 billion tokens per model with any serverless model or private deployment.
Dedicated Container Inference
Provides GPU infrastructure purpose-built for generative media workloads, supporting video, audio, and image model deployment with performance acceleration.
Dedicated Model Inference
Deploys models on dedicated infrastructure purpose-built for teams who need speed, control, and optimized economics.
Managed Storage
Offers high-performance managed object storage and parallel filesystems optimized for AI-native workloads with zero egress fees.
Sandbox
Provides fast, secure code sandboxes at scale for setting up full-scale development environments for AI apps and agents.
Serverless Inference
Runs open-source models on demand with no infrastructure to manage and no long-term commitments, powered by cutting-edge inference research.

Customization

Workload-Specific Optimization
Applies workload-specific optimizations to reduce infrastructure costs by up to 60% compared to standard deployments.

Preview

Pricing Plans

Popular

Serverless Inference

Free

Pay-per-token API access to hosted models. Most teams start here.

Chat, vision, image, audio, video, transcription, embeddings, rerank, moderation
Text models from $0.02–$1.25 per 1M input tokens
Image generation from $0.0006–$0.134 per image
Video generation from $0.14–$3.20 per video
Audio TTS from $0.0015–$65.00 per 1M characters
Batch API pricing available for select models

Dedicated Inference

$4/hourly

Single-tenant GPU instances for teams needing guaranteed performance and custom models.

Guaranteed performance with no resource sharing
Support for custom models
Autoscaling and traffic spike handling
1x H100 80GB at $3.99/hr, 1x H200 141GB at $5.49/hr, 1x B200 180GB at $9.95/hr

GPU Clusters – On-Demand

$3/hourly

Pay-as-you-go GPU cluster capacity billed hourly.

NVIDIA HGX H100 at $3.49/hr
NVIDIA HGX H200 at $4.19/hr
NVIDIA HGX B200 at $7.49/hr
No long-term commitment required

GPU Clusters – Reserved

$3/weekly

Reserved GPU cluster capacity for 6+ days with discounted rates.

NVIDIA HGX H100 from $2.99/hr (1 week) to $2.55/hr (4–6 months)
NVIDIA HGX H200 from $3.49/hr (1 week) to $2.89/hr (4–6 months)
NVIDIA HGX B200 from $7.15/hr (1 week) to $6.39/hr (4–6 months)
NVIDIA GB200 NVL72 and GB300 NVL72: contact for pricing
Minimum reservation of 6 days

Code Sandbox

$0/per session

VM sandboxes and secure code interpreter for LLM-generated code execution.

Code Interpreter: $0.03 per 60-minute session
Per vCPU compute: $0.0446/hr
Per GiB RAM: $0.0149/hr
Shared filesystem storage: $0.16/GiB/month

Fine-Tuning – Standard

$0/per 1M tokens

Train open-source models up to 100B parameters using LoRA or full fine-tuning.

Supervised Fine-Tuning LoRA: $0.48–$2.90 per 1M tokens by model size
Supervised Fine-Tuning Full: $0.54–$3.20 per 1M tokens by model size
Direct Preference Optimization LoRA: $1.20–$7.25 per 1M tokens
Direct Preference Optimization Full: $1.35–$8.00 per 1M tokens
Supports models up to 100B parameters

Fine-Tuning – Specialized

Contact sales

Fine-tuning for large specialized models like DeepSeek, Llama 4, Qwen3, Kimi K2, and others.

DeepSeek-R1/V3 SFT LoRA: $10/1M tokens, min $20 charge
Llama 4 Maverick SFT LoRA: $8/1M tokens, min $16 charge
Kimi K2 SFT LoRA: $15/1M tokens, min $60 charge
GLM-5 SFT LoRA: $40/1M tokens, min $60 charge
Minimum charges vary by model

AI Panel Reviews

The Decision Maker

Strategic bet, vendor viability, timing, adoption approval

7.9/10

Together AI is the open-source inference cloud the board can sign off on without long explanations.

“Vipul Ved Prakash sold Topsy to Apple in 2013 and is back with $534M raised — NVIDIA, Salesforce Ventures, and Kleiner Perkins on the cap table. The vendor question's settled; the harder call is whether you bet your inference stack on a $3.3B startup or the hyperscaler your CFO already pays.”

Open-source AI infrastructure is the layer where the cloud margins shift over the next decade. Together is the cleanest pure-play, and Vipul Ved Prakash is the right founder for it — Topsy, acquired by Apple in 2013.

The runway math holds. $305M Series B at $3.3B in February 2025, with NVIDIA and Salesforce Ventures on the cap table. The product depth follows — Serverless Inference, Dedicated Inference at $3.99/hr per H100, Fine-Tuning down to $0.48 per million tokens. That's a real stack, not a thin API.

The catch: AWS Bedrock and Azure AI Foundry sit inside the cloud commit your CFO already signed. Together's defense is speed — Llama 4 and DeepSeek-R1 ship there before any hyperscaler catalogs them, and that lead matters this year. Pilot it where open-source freshness is the requirement. Don't standardize the org until renewal.

Competitive Positioning7.7

Differentiated against AWS Bedrock and Azure AI Foundry on open-model freshness, but narrower scope than a full hyperscaler.

Reputation Risk8.0

NVIDIA, Salesforce Ventures, and Kleiner Perkins on the cap table makes the vendor easy to defend in a board review.

Speed to Value8.2

Serverless Inference at $0.02 per million input tokens for some models means a pilot can be wired up in days, not weeks.

Strategic Fit7.8

Pure-play open-source inference advances open-model strategy; less of a fit if the company has already standardized on a single hyperscaler stack.

Vendor Viability7.5

$534M raised across four years and Series B at $3.3B in February 2025 funds at least 24 months of runway, but it is still a startup competing with three hyperscalers.

Pros

NVIDIA, Salesforce Ventures, and Kleiner Perkins on the cap table de-risks the vendor question for the board.
Full inference stack from Serverless Inference to Dedicated H100 to Fine-Tuning under one contract.
Newest open-source models like Llama 4 and DeepSeek-R1 ship faster than on hyperscalers.
$305M Series B at $3.3B in February 2025 funds at least 24 months of execution.

Cons

AWS Bedrock and Azure AI Foundry are bundled inside cloud commits most enterprises already signed.
Open-source moat narrows if Meta or DeepSeek slow their open-weights release cadence.

Right for

Teams running multi-model open-source inference who need an alternative to AWS Bedrock.

Avoid if

Buyers committed to a single hyperscaler with cloud spend already signed.

The Domain Strategist

Craft and strategy in the product's domain — adapts identity per category, same lens

8.3/10

Together AI bet on the kernel layer rather than the API, and that is the right architectural call.

“Together Kernel Collection is the substrate worth studying — vendor-owned GPU kernels that compound across inference, fine-tuning, and training. The full-stack pricing ladder from $0.02-per-million-token serverless to $3.49/hr H100 clusters lets teams scale without a vendor change.”

Together's positioning is the inference-and-training cloud for open-source models, and the architecture follows. Together Kernel Collection — GPU kernels claiming up to 90% faster pre-training — is the layer that compounds across inference, fine-tuning, and dedicated clusters. Fireworks AI counters with FireAttention. Anyscale leans on Ray.

Pricing reflects the full-stack ambition. H100 on-demand at $3.49/hr, Llama-class inference from $0.02 per million tokens, Batch Inference scaling to 30 billion tokens per model. Teams can graduate from serverless to dedicated to reserved capacity without changing vendors — the shape an AI platform group actually wants.

The catch is open-source dependence. The catalog rides the Llama, Qwen, and DeepSeek release cadence; if Meta's open-weights commitment narrows, the moat thins to the kernel work alone. The $305M Series B at a $3.3B valuation in 2025 buys runway, but durability lives in the substrate, not the model selection.

Category Positioning8.5

Clear top-tier in the open-source AI cloud segment alongside Fireworks AI and Replicate, with the deepest research-team lineage.

Domain Fit8.5

Serverless, dedicated, and reserved-cluster tiers map cleanly to how senior AI platform teams actually graduate workloads.

Integration Surface8.0

OpenAI-compatible API, standard endpoints, Hugging Face catalog integration, and Batch API support across serverless and private deployments.

Long-term Implications7.8

Open-source model dependence is a real 3-year constraint; the catalog's relevance rides Meta, Mistral, and DeepSeek release cadence.

Strategic Depth8.5

Together Kernel Collection is real substrate work — vendor-owned GPU kernels claiming up to 90% pre-training speedup, not an API skin.

Pros

Together Kernel Collection delivers vendor-owned GPU optimization that compounds across inference, fine-tuning, and training.
Full-stack pricing ladder lets teams graduate from $0.02-per-million-token serverless to reserved H100 clusters without changing vendors.
Batch Inference scales to 30 billion tokens per model — handles workloads that break most serverless APIs.
Founder lineage from Stanford and ETH Zurich shows in the architectural choices, not just the marketing.

Cons

Open-source model dependence ties the catalog's relevance to Meta, Mistral, and DeepSeek release cadence.
No proprietary frontier model means buyers needing GPT-5 or Claude-tier reasoning still pair Together with another vendor.
Specialized fine-tuning for models like Kimi K2 climbs to $15 per 1M tokens — premium territory for teams expecting commodity rates.

Right for

AI platform teams who run open-source models in production.

Avoid if

Buyers who need a single proprietary frontier model with vendor-managed safety guarantees.

The Finance Lead

Money, total cost of ownership, contracts, procurement math

8.3/10

H100 reserved drops from $3.49 to $2.55/hr if you commit four months — Together's discount curve is honest.

“Together publishes every tier on its pricing page, from $0.02 per million tokens for inference to $2.55/hr for an H100 on a 4-6 month reservation. The catch is Specialized Fine-Tuning — minimums up to $60 per job mean small experiments aren't free.”

Pricing is fully published. Every tier, every GPU, every per-token rate. Inference starts at $0.02 per 1M input tokens. H100 on-demand at $3.49/hr — same shape as Lambda Labs, cheaper than AWS Bedrock provisioned throughput. Procurement won't push back.

The reserved-GPU curve is where the math gets honest. H100 drops to $2.99/hr at one week, $2.55/hr at 4-6 months. Six-day minimum. One H100 reserved four months runs about $7,300 — versus $10,000 on-demand. 27% saving compounds across a fleet, but the 6-day floor punishes spiky workloads.

Two line items matter. Specialized Fine-Tuning carries minimum charges — $20 for DeepSeek-R1 LoRA, up to $60 for GLM-5. Small experiments aren't free. However, Managed Storage charges zero egress, which offsets a year of S3 transfer for inference-heavy teams. Read the contract, not the marketing.

Billing & Procurement8.3

Usage-based with web checkout for serverless removes most procurement friction.

Contract Flexibility8.0

On-demand and reserved are both available; only the 6-day reserved minimum limits flexibility.

Pricing Transparency9.0

Every tier and GPU rate is published; serverless and on-demand require no sales call.

ROI Clarity7.9

Per-token and per-hour rates make inference ROI directly measurable; the 60% optimization claim is unverified.

Total Cost of Ownership7.8

Modeling is feasible across compute, storage, and fine-tuning, though minimum charges complicate small jobs.

Pros

Every pricing tier is published on the website with no sales call required for serverless or on-demand.
Reserved GPU pricing drops 27% from on-demand when you commit to a 4-6 month term.
Managed Storage charges zero egress fees, which is unusual in cloud infrastructure.
Per-token inference starts at $0.02 per 1M input tokens, competitive with hyperscaler list rates.

Cons

Specialized Fine-Tuning carries minimum charges from $20 to $60 per job, so small experiments are not cheap.
Reserved GPU clusters require a six-day minimum reservation, which punishes intermittent workloads.
The published 60% cost-reduction claim from workload-specific optimization is not independently verifiable.

Right for

Teams who run mixed-model inference and want every rate public before signing.

Avoid if

Teams who need spiky GPU access without committing to a six-day reservation.

The Domain Practitioner

Daily hands-on reality in the product's domain — adapts identity per category, same lens

8.1/10

Point your OpenAI client at api.together.ai/v1 and DeepSeek-R1 answers — multi-model inference without the SDK juggling.

“Together AI runs as an OpenAI-compatible endpoint with hundreds of open-weight models behind one base URL, plus serverless Batch Inference and dedicated H100 clusters when you outgrow shared inference. The catch is the long tail — provider-specific features like reasoning traces or strict tool calling don't always pass through cleanly.”

The integration test is a one-liner. Set OPENAI_API_BASE to https://api.together.ai/v1, prefix the model with deepseek-ai/ or meta-llama/, keep the OpenAI client. Compare wiring up Replicate's prediction-polling API — Together is the lowest-effort swap for a Python codebase already on OpenAI.

Batch Inference is the sleeper feature. Asynchronous, 30 billion tokens per model, priced below interactive — for embeddings backfills or eval sweeps it's the right shape. Dedicated Inference at $3.99/hr for an H100 80GB undercuts the ops cost of self-hosting vLLM once you count engineer time. Together Kernel Collection gets cited as the moat.

The friction is the long tail. Reasoning-mode toggles for DeepSeek-R1, prompt caching, structured-output strict mode — features the OpenAI Chat Completions surface doesn't always express, and the docs lag a release behind. However, for the 80% case of multi-model inference across open weights, this is the path of least resistance.

Day-3 Reality8.0

Once integrated it disappears from the stack — daily friction shows up only at provider-specific feature edges.

Documentation Practitioner-Fit7.5

Docs cover the platform broadly but lag new model launches and the Together Kernel Collection details by a release.

Friction Surface7.5

Reasoning-mode and strict tool-call semantics don't always express cleanly through the Chat Completions surface.

Power-User Depth8.5

Batch, Dedicated Inference, fine-tuning to 100B parameters, and GPU clusters all sit on the same control plane.

Workflow Integration8.5

OpenAI-compatible base URL means existing clients, retry logic, and observability tooling work unchanged.

Pros

OpenAI-compatible base URL means existing clients work with a one-line change.
Batch Inference scales to 30 billion tokens per model for asynchronous workloads.
Dedicated H100 80GB at $3.99/hr undercuts self-hosting vLLM once ops time is counted.
Fine-tuning supports open-weight models up to 100B parameters with LoRA or full SFT.

Cons

Provider-specific features like reasoning traces or strict tool calling don't always pass through the OpenAI surface cleanly.
Documentation lags model and feature releases by a release or two.
Specialized fine-tuning on DeepSeek-R1 LoRA starts at $10 per 1M tokens with a $20 minimum charge.

Right for

Backend engineers who run open-weight models behind an OpenAI-compatible client.

Avoid if

Teams who need only proprietary frontier models like GPT-5 or Claude Opus.

The Power User

Daily human experience, onboarding, polish, learning curve, reliability

8.0/10

Together's pricing page lists every number on one screen, and that small thing tells you a lot.

“The playground works without a signup, the pricing page lists every number, and the OpenAI-shaped endpoint means your client code just works. The catch is the docs lag the model catalog by a release.”

The playground at api.together.xyz/playground is the small thing the team got right. No credit card to sign in, 200+ open-source models in a dropdown, paste a prompt and watch tokens stream. Hugging Face Inference Endpoints makes you wire up a deployment first.

The pricing page is where the team earns trust. H100 on-demand at $3.49 an hour, Llama-class inference from $0.02 per million input tokens, Code Interpreter at $0.03 per 60-minute session — every number on one page. Modal makes you log in to see GPU hourly.

But the docs lag a release behind the model catalog. A new model lands Tuesday, the structured-output flag for it shows up in the docs the following week. Worth it for a $305M Series B cloud where NVIDIA is on the cap table. Painful if you're chasing a model that dropped this morning.

Daily Polish8.0

Pricing page consolidates every number on one screen and the playground works without a signup.

Learning Curve7.5

First ten minutes are fast, but the docs trail the catalog by a release which slows the day-thirty fight.

Mobile Parity7.5

Mobile is essentially read-only, but for a dev-infrastructure API this is category norm.

Onboarding Experience8.2

No credit card to reach the playground, OpenAI-compatible base_url means existing code runs in minutes.

Reliability Feel7.8

Full-stack ambition with autoscaling and dedicated GPUs at $3.99 an hour, but uptime depends on open-source model release cadence.

Pros

Playground at api.together.xyz/playground works without a signup or credit card.
OpenAI-compatible base_url lets existing client code call 200+ open-source models without rewrites.
Pricing page lists every number on one screen — H100 at $3.49/hr, inference from $0.02/M tokens, Code Interpreter at $0.03 per session.
$305M Series B at a $3.3B valuation in 2025 with NVIDIA on the cap table signals real durability.

Cons

Docs trail the model catalog by a release — new models land in the dropdown before the reference page.
Mobile is essentially read-only; the playground renders on a phone but nobody would write code there.
Catalog depends on open-source release cadence from Meta, DeepSeek, and Qwen — a narrowing of open weights would thin the offering.

Right for

Developers who want to swap between open-source models without changing their stack.

Avoid if

Teams who need polished docs the same day a new model launches.

The Skeptic

Contrarian. Watch-outs, deal-breakers, broken promises, category patterns

7.6/10

OctoAI got swallowed by NVIDIA in September 2024 — same category, same cap table, fewer survivors.

“Together AI is the largest pure-play left in open-source inference, and the $305M Series B in February 2025 buys real time. The yellow flag is the category's body count — OctoAI absorbed by NVIDIA, MosaicML by Databricks, and Together has NVIDIA on its own cap table.”

Two acquisitions in eighteen months. OctoAI absorbed by NVIDIA around $165M in September 2024, commercial service off by October 31. MosaicML by Databricks at $1.3B in 2023. Same category, fewer survivors. Together is the largest pure-play left.

The evidence holds up. $305M Series B at $3.3B in February 2025, led by General Catalyst and Prosperity7. Serverless Inference from $0.02 per million input tokens. H100 reserved at $2.55/hr. Together Kernel Collection is the moat the docs actually try to defend. Real product.

But NVIDIA is on the cap table. So is Salesforce Ventures. Both have absorbed peers in this category — NVIDIA bought OctoAI, Lepton AI followed. The graveyard pattern doesn't predict Together's outcome. It does mean the question shifts: durable cloud, or attractive tuck-in once the kernel work matures.

Competitive Differentiation7.5

Together Kernel Collection is real engineering work but Fireworks AI and Anyscale chase the same kernel-layer moat.

Exit Portability8.2

OpenAI-compatible API at api.together.ai/v1 means migration off looks mechanical, not catastrophic.

Long-term Viability7.5

$305M Series B at $3.3B in February 2025 buys runway; NVIDIA on the cap table is double-edged.

Marketing Honesty8.0

Pricing page lists every GPU tier, per-token rate, and minimum charge — claims are quantified, not aspirational.

Track Record Match6.8

Open-source inference cloud category has visible failures — OctoAI absorbed by NVIDIA, MosaicML by Databricks.

Pros

$305M Series B at $3.3B in February 2025 led by General Catalyst — runway is real.
Pricing fully published — H100 reserved drops to $2.55/hr, inference from $0.02 per million tokens.
OpenAI-compatible API at api.together.ai/v1 means migration off looks mechanical, not catastrophic.
Together Kernel Collection is a real engineering moat, not a thin API wrapper.

Cons

Open-source inference category has a visible graveyard — OctoAI absorbed by NVIDIA, MosaicML by Databricks.
NVIDIA sits on Together's cap table and has already acquired two peers in the same category.
Catalog rides Llama and DeepSeek release cadence — open-weights momentum is the substrate.

Right for

Teams who need the largest open-source inference pure-play with real Series B runway.

Avoid if

Buyers who already have an AWS commit covering Bedrock open-weights inference.

Buyer Questions

Common questions answered by our AI research team

Pricing

What is the price difference between LoRA and Full Fine-Tuning for models up to 16B parameters, and is there a minimum charge?

For models up to 16B, Supervised Fine-Tuning costs $0.48/1M tokens for LoRA vs $0.54/1M tokens for Full Fine-Tuning, and Direct Preference Optimization costs $1.20/1M tokens for LoRA vs $1.35/1M tokens for Full Fine-Tuning. The standard pricing table for up to 16B models does not list a minimum charge; minimum charges appear only in the Specialized pricing section for specific models.

Features

Can I deploy video, audio, and image models on Dedicated Container Inference, and what GPU hardware options are available per hour?

Yes, Dedicated Container Inference is described as 'GPU infrastructure purpose-built for generative media workloads' that supports deploying 'video, audio, and image models with performance acceleration powered by Together Research.' However, the pricing page only lists hourly hardware options under Dedicated Inference (not Dedicated Container Inference specifically): 1x H100 80GB at $3.99/hr, 1x H200 141GB at $5.49/hr, and 1x B200 180GB at $9.95/hr.

Security

Does the Code Sandbox use isolated, single-tenant environments, and how is pricing structured for vCPU and RAM usage?

The content describes Code Sandboxes as 'fast, secure code sandboxes' but does not specify single-tenant isolation. Pricing is structured as $0.0446 per vCPU/hour and $0.0149 per GiB RAM/hour for compute costs, plus a Code Interpreter option priced at $0.03 per 60-minute session.

Setup

How do I get started with Serverless Inference — is there any infrastructure to manage or long-term commitment required?

Serverless Inference is described as 'the fastest way to run open-source models on demand' with 'no infrastructure to manage, no long-term commitments.' You can get started immediately through the platform without any setup or commitment requirements.

Integration

Can I use the Batch Inference API with privately deployed models, and what is the token scale limit per model?

Yes, Batch Inference explicitly supports 'any serverless model or private deployment' and can 'scale to 30 billion tokens per model.'

Product Information

Company
Together AI
Founded
2022
Pricing
From $0/mo
Free Trial
Available
Free Plan
Available

Platforms

web

Visit Website See Pricing

Panel Scores

Decision Maker7.9

Domain Strategist8.3

Finance Lead8.3

Domain Practitioner8.1

Power User8.0

Skeptic7.6

About Together AI

Build what's next on the AI Native Cloud. Full-stack AI platform for inference, fine-tuning, and GPU clusters — powered by cutting-edge research.

Resources

Documentation

API

Blog

About Together AI

Features

AI

Core

Customization

Preview

Pricing Plans

Serverless Inference

Dedicated Inference

GPU Clusters – On-Demand

GPU Clusters – Reserved

Code Sandbox

Fine-Tuning – Standard

Fine-Tuning – Specialized

AI Panel Reviews

The Decision Maker

Pros

Cons

Right for

Avoid if

The Domain Strategist

Pros

Cons

Right for

Avoid if

The Finance Lead

Pros

Cons

Right for

Avoid if

The Domain Practitioner

Pros

Cons

Right for

Avoid if

The Power User

Pros

Cons

Right for

Avoid if

The Skeptic

Pros

Cons

Right for

Avoid if

Buyer Questions

What is the price difference between LoRA and Full Fine-Tuning for models up to 16B parameters, and is there a minimum charge?

Can I deploy video, audio, and image models on Dedicated Container Inference, and what GPU hardware options are available per hour?

Does the Code Sandbox use isolated, single-tenant environments, and how is pricing structured for vCPU and RAM usage?

How do I get started with Serverless Inference — is there any infrastructure to manage or long-term commitment required?

Can I use the Batch Inference API with privately deployed models, and what is the token scale limit per model?

Product Information

Platforms

Panel Scores

About Together AI

Resources

Categories

Also in LLM Platforms