Skip to content
Oeiuwq Faith Blog OpenSource Porfolio

pytorch/serve

Serve, optimize and scale PyTorch models in production

pytorch/serve.json
{
"createdAt": "2019-10-03T03:17:43Z",
"defaultBranch": "master",
"description": "Serve, optimize and scale PyTorch models in production",
"fullName": "pytorch/serve",
"homepage": "https://pytorch.org/serve/",
"language": "Java",
"name": "serve",
"pushedAt": "2025-08-06T19:17:08Z",
"stargazersCount": 4353,
"topics": [
"cpu",
"deep-learning",
"docker",
"gpu",
"kubernetes",
"machine-learning",
"metrics",
"mlops",
"optimization",
"pytorch",
"serving"
],
"updatedAt": "2025-11-26T01:09:25Z",
"url": "https://github.com/pytorch/serve"
}

⚠️ Notice: Limited Maintenance

This project is no longer actively maintained. While existing releases remain available, there are no planned updates, bug fixes, new features, or security patches. Users should be aware that vulnerabilities may not be addressed.

TorchServe now enforces token authorization enabled and model API control disabled by default. These security features are intended to address the concern of unauthorized API calls and to prevent potential malicious code from being introduced to the model server. Refer the following documentation for more information: Token Authorization, Model API control

Nightly build Docker Nightly build Benchmark Nightly Docker Regression Nightly KServe Regression Nightly Kubernetes Regression Nightly

TorchServe is a flexible and easy-to-use tool for serving and scaling PyTorch models in production.

Requires python >= 3.8

Terminal window
curl http://127.0.0.1:8080/predictions/bert -T input.txt
Terminal window
# Install dependencies
python ./ts_scripts/install_dependencies.py
# Include dependencies for accelerator support with the relevant optional flags
python ./ts_scripts/install_dependencies.py --rocm=rocm61
python ./ts_scripts/install_dependencies.py --cuda=cu121
# Latest release
pip install torchserve torch-model-archiver torch-workflow-archiver
# Nightly build
pip install torchserve-nightly torch-model-archiver-nightly torch-workflow-archiver-nightly
Terminal window
# Install dependencies
python ./ts_scripts/install_dependencies.py
# Include depeendencies for accelerator support with the relevant optional flags
python ./ts_scripts/install_dependencies.py --rocm=rocm61
python ./ts_scripts/install_dependencies.py --cuda=cu121
# Latest release
conda install -c pytorch torchserve torch-model-archiver torch-workflow-archiver
# Nightly build
conda install -c pytorch-nightly torchserve torch-model-archiver torch-workflow-archiver

[Getting started guide]!(docs/getting_started.md)

Terminal window
# Latest release
docker pull pytorch/torchserve
# Nightly build
docker pull pytorch/torchserve-nightly

Refer to [torchserve docker]!(docker/README.md) for details.

Terminal window
# Make sure to install torchserve with pip or conda as described above and login with `huggingface-cli login`
python -m ts.llm_launcher --model_id meta-llama/Llama-3.2-3B-Instruct --disable_token_auth
# Try it out
curl -X POST -d '{"model":"meta-llama/Llama-3.2-3B-Instruct", "prompt":"Hello, my name is", "max_tokens": 200}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model/1.0/v1/completions"
Terminal window
# Make sure to install torchserve with python venv as described above and login with `huggingface-cli login`
# pip install -U --use-deprecated=legacy-resolver -r requirements/trt_llm.txt
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct --engine trt_llm --disable_token_auth
# Try it out
curl -X POST -d '{"prompt":"count from 1 to 9 in french ", "max_tokens": 100}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"

🚢 Quick Start LLM Deployment with Docker

Section titled “🚢 Quick Start LLM Deployment with Docker”
Terminal window
#export token=<HUGGINGFACE_HUB_TOKEN>
docker build --pull . -f docker/Dockerfile.vllm -t ts/vllm
docker run --rm -ti --shm-size 10g --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:8080 -v data:/data ts/vllm --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token_auth
# Try it out
curl -X POST -d '{"model":"meta-llama/Meta-Llama-3-8B-Instruct", "prompt":"Hello, my name is", "max_tokens": 200}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model/1.0/v1/completions"

Refer to [LLM deployment]!(docs/llm_deployment.md) for details and other methods.

  • Write once, run anywhere, on-prem, on-cloud, supports inference on CPUs, GPUs, AWS Inf1/Inf2/Trn1, Google Cloud TPUs, [Nvidia MPS]!(docs/nvidia_mps.md)
  • [Model Management API]!(docs/management_api.md): multi model management with optimized worker to model allocation
  • [Inference API]!(docs/inference_api.md): REST and gRPC support for batched inference
  • [TorchServe Workflows]!(examples/Workflows/README.md): deploy complex DAGs with multiple interdependent models
  • Default way to serve PyTorch models in
    • Sagemaker
    • Vertex AI
    • [Kubernetes]!(kubernetes) with support for [autoscaling]!(kubernetes#session-affinity-with-multiple-torchserve-pods), session-affinity, monitoring using Grafana works on-prem, AWS EKS, Google GKE, Azure AKS
    • Kserve: Supports both v1 and v2 API, [autoscaling and canary deployments]!(kubernetes/kserve/README.md#autoscaling) for A/B testing
    • Kubeflow
    • MLflow
  • Export your model for optimized inference. Torchscript out of the box, [PyTorch Compiler]!(examples/pt2/README.md) preview, ORT and ONNX, IPEX, TensorRT, FasterTransformer, FlashAttention (Better Transformers)
  • [Performance Guide]!(docs/performance_guide.md): builtin support to optimize, benchmark, and profile PyTorch and TorchServe performance
  • [Expressive handlers]!(CONTRIBUTING.md): An expressive handler architecture that makes it trivial to support inferencing for your use case with many supported out of the box
  • [Metrics API]!(docs/metrics.md): out-of-the-box support for system-level metrics with Prometheus exports, custom metrics,
  • [Large Model Inference Guide]!(docs/large_model_inference.md): With support for GenAI, LLMs including
    • SOTA GenAI performance using torch.compile
    • Fast Kernels with FlashAttention v2, continuous batching and streaming response
    • PyTorch [Tensor Parallel]!(examples/large_models/tp_llama) preview, [Pipeline Parallel]!(examples/large_models/Huggingface_pippy)
    • Microsoft [DeepSpeed]!(examples/large_models/deepspeed), [DeepSpeed-Mii]!(examples/large_models/deepspeed_mii)
    • Hugging Face [Accelerate]!(examples/large_models/Huggingface_accelerate), [Diffusers]!(examples/diffusers)
    • Running large models on AWS Sagemaker and Inferentia2
    • Running [Meta Llama Chatbot locally on Mac]!(examples/LLM/llama)
  • Monitoring using Grafana and Datadog
  • [Model Server for PyTorch Documentation]!(docs/README.md): Full documentation
  • [TorchServe internals]!(docs/internals.md): How TorchServe was built
  • [Contributing guide]!(CONTRIBUTING.md): How to contribute to TorchServe
  • [Serving Meta Llama with TorchServe]!(examples/LLM/llama/README.md)
  • [Chatbot with Meta Llama on Mac 🦙💬]!(examples/LLM/llama/chat_app)
  • [🤗 HuggingFace Transformers]!(examples/Huggingface_Transformers) with a [Better Transformer Integration/ Flash Attention & Xformer Memory Efficient ]!(examples/Huggingface_Transformers#Speed-up-inference-with-Better-Transformer)
  • [Stable Diffusion]!(examples/diffusers)
  • [Model parallel inference]!(examples/Huggingface_Transformers#model-parallelism)
  • MultiModal models with MMF combining text, audio and video
  • [Dual Neural Machine Translation]!(examples/Workflows/nmt_transformers_pipeline) for a complex workflow DAG
  • [TorchServe Integrations]!(examples/README.md#torchserve-integrations)
  • [TorchServe Internals]!(examples/README.md#torchserve-internals)
  • [TorchServe UseCases]!(examples/README.md#usecases)

For [more examples]!(examples/README.md)

[SECURITY.md]!(SECURITY.md)

https://pytorch.org/serve

We welcome all contributions!

To learn more about how to contribute, see the contributor guide here.

Made with contrib.rocks.

This repository is jointly operated and maintained by Amazon, Meta and a number of individual contributors listed in the CONTRIBUTORS file. For questions directed at Meta, please send an email to opensource@fb.com. For questions directed at Amazon, please send an email to torchserve@amazon.com. For all other questions, please open up an issue in this repository here.

TorchServe acknowledges the Multi Model Server (MMS) project from which it was derived