From 263884705a3f5dcab922bd90b4f51b88ff236dea Mon Sep 17 00:00:00 2001
From: Geeta Chauhan <4461127+chauhang@users.noreply.github.com>
Date: Sat, 21 Oct 2023 18:45:43 -0700
Subject: [PATCH] Readme updates (#2729)

* Readme updates for new features and blogs

* Updates for what's new and readmes

* Linting fixes
---
 README.md                                     | 27 +++++++++++++++----
 docs/index.rst                                |  6 +++++
 .../tp_llama/{REAME.md => README.md}          |  0
 ts_scripts/spellcheck_conf/wordlist.txt       | 10 +++++++
 4 files changed, 38 insertions(+), 5 deletions(-)
 rename examples/large_models/tp_llama/{REAME.md => README.md} (100%)

diff --git a/README.md b/README.md
index 76cd0100ee..c72b1a4320 100644
--- a/README.md
+++ b/README.md
@@ -55,19 +55,29 @@ docker pull pytorch/torchserve-nightly
 Refer to [torchserve docker](docker/README.md) for details.
 
 ## ⚡ Why TorchServe
+* Write once, run anywhere, on-prem, on-cloud, supports inference on CPUs, GPUs, AWS Inf1/Inf2/Trn1, Google Cloud TPUs, [Nvidia MPS](master/docs/nvidia_mps.md)
 * [Model Management API](docs/management_api.md): multi model management with optimized worker to model allocation
 * [Inference API](docs/inference_api.md): REST and gRPC support for batched inference
 * [TorchServe Workflows](examples/Workflows/README.md): deploy complex DAGs with multiple interdependent models
 * Default way to serve PyTorch models in
-  * [Kubeflow](https://v0-5.kubeflow.org/docs/components/pytorchserving/)
-  * [MLflow](https://github.com/mlflow/mlflow-torchserve)
   * [Sagemaker](https://aws.amazon.com/blogs/machine-learning/serving-pytorch-models-in-production-with-the-amazon-sagemaker-native-torchserve-integration/)
-  * [Kserve](https://kserve.github.io/website/0.8/modelserving/v1beta1/torchserve/): Supports both v1 and v2 API
   * [Vertex AI](https://cloud.google.com/blog/topics/developers-practitioners/pytorch-google-cloud-how-deploy-pytorch-models-vertex-ai)
-* Export your model for optimized inference. Torchscript out of the box, [ORT and ONNX](https://github.com/pytorch/serve/blob/master/docs/performance_guide.md), [IPEX](https://github.com/pytorch/serve/tree/master/examples/intel_extension_for_pytorch), [TensorRT](https://github.com/pytorch/serve/blob/master/docs/performance_guide.md), [FasterTransformer](https://github.com/pytorch/serve/tree/master/examples/FasterTransformer_HuggingFace_Bert)
+  * [Kubernetes](master/kubernetes) with support for [autoscaling](kubernetes#session-affinity-with-multiple-torchserve-pods), session-affinity, monitoring using Grafana works on-prem, AWS EKS, Google GKE, Azure AKS
+  * [Kserve](https://kserve.github.io/website/0.8/modelserving/v1beta1/torchserve/): Supports both v1 and v2 API, [autoscaling and canary deployments](kubernetes/kserve/README.md#autoscaling) for A/B testing
+  * [Kubeflow](https://v0-5.kubeflow.org/docs/components/pytorchserving/) 
+  * [MLflow](https://github.com/mlflow/mlflow-torchserve)
+* Export your model for optimized inference. Torchscript out of the box, [PyTorch Compiler](examples/pt2/README.md) preview, [ORT and ONNX](https://github.com/pytorch/serve/blob/master/docs/performance_guide.md), [IPEX](https://github.com/pytorch/serve/tree/master/examples/intel_extension_for_pytorch), [TensorRT](https://github.com/pytorch/serve/blob/master/docs/performance_guide.md), [FasterTransformer](https://github.com/pytorch/serve/tree/master/examples/FasterTransformer_HuggingFace_Bert), FlashAttention (Better Transformers)
 * [Performance Guide](docs/performance_guide.md): builtin support to optimize, benchmark and profile PyTorch and TorchServe performance
 * [Expressive handlers](CONTRIBUTING.md): An expressive handler architecture that makes it trivial to support inferencing for your usecase with [many supported out of the box](https://github.com/pytorch/serve/tree/master/ts/torch_handler)
-* [Metrics API](docs/metrics.md): out of box support for system level metrics with [Prometheus exports](https://github.com/pytorch/serve/tree/master/examples/custom_metrics), custom metrics and PyTorch profiler support
+* [Metrics API](docs/metrics.md): out of box support for system level metrics with [Prometheus exports](https://github.com/pytorch/serve/tree/master/examples/custom_metrics), custom metrics, 
+* [Large Model Inference Guide](docs/large_model_inference.md): With support for GenAI, LLMs including
+  * Fast Kernels with FlashAttention v2, continuous batching and streaming response
+  * PyTorch [Tensor Parallel](examples/large_models/tp_llama) preview, [Pipeline Parallel](examples/large_models/Huggingface_pippy) 
+  * Microsoft [DeepSpeed](examples/large_models/deepspeed), [DeepSpeed-Mii](examples/large_models/deepspeed_mii) 
+  * Hugging Face [Accelerate](large_models/Huggingface_accelerate), [Diffusers](examples/diffusers) 
+  * Running large models on AWS [Sagemaker](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-tutorials-torchserve.html) and [Inferentia2](https://pytorch.org/blog/high-performance-llama/)
+  * Running [Llama 2 Chatbot locally on Mac](examples/LLM/llama2)
+* Monitoring using Grafana and [Datadog](https://www.datadoghq.com/blog/ai-integrations/#model-serving-and-deployment-vertex-ai-amazon-sagemaker-torchserve)
 
 
 ## 🤔 How does TorchServe work
@@ -80,6 +90,7 @@ Refer to [torchserve docker](docker/README.md) for details.
 * [Serving Llama 2 with TorchServe](examples/LLM/llama2/README.md)
 * [Chatbot with Llama 2 on Mac 🦙💬](examples/LLM/llama2/chat_app)
 * [🤗 HuggingFace Transformers](examples/Huggingface_Transformers) with a [Better Transformer Integration/ Flash Attention & Xformer Memory Efficient ](examples/Huggingface_Transformers#Speed-up-inference-with-Better-Transformer)
+* [Stable Diffusion](examples/diffusers)
 * [Model parallel inference](examples/Huggingface_Transformers#model-parallelism)
 * [MultiModal models with MMF](https://github.com/pytorch/serve/tree/master/examples/MMF-activity-recognition) combining text, audio and video
 * [Dual Neural Machine Translation](examples/Workflows/nmt_transformers_pipeline) for a complex workflow DAG
@@ -100,6 +111,12 @@ We welcome all contributions!
 To learn more about how to contribute, see the contributor guide [here](https://github.com/pytorch/serve/blob/master/CONTRIBUTING.md).
 
 ## 📰 News
+* [High performance Llama 2 deployments with AWS Inferentia2 using TorchServe](https://pytorch.org/blog/high-performance-llama/)
+* [Naver Case Study: Transition From High-Cost GPUs to Intel CPUs and oneAPI powered Software with performance](https://pytorch.org/blog/ml-model-server-resource-saving/)
+* [Run multiple generative AI models on GPU using Amazon SageMaker multi-model endpoints with TorchServe and save up to 75% in inference costs](https://aws.amazon.com/blogs/machine-learning/run-multiple-generative-ai-models-on-gpu-using-amazon-sagemaker-multi-model-endpoints-with-torchserve-and-save-up-to-75-in-inference-costs/)
+* [Deploying your Generative AI model in only four steps with Vertex AI and PyTorch](https://cloud.google.com/blog/products/ai-machine-learning/get-your-genai-model-going-in-four-easy-steps)
+* [PyTorch Model Serving on Google Cloud TPU v5](https://cloud.google.com/tpu/docs/v5e-inference#pytorch-model-inference-and-serving)
+* [Monitoring using Datadog](https://www.datadoghq.com/blog/ai-integrations/#model-serving-and-deployment-vertex-ai-amazon-sagemaker-torchserve)
 * [Torchserve Performance Tuning, Animated Drawings Case-Study](https://pytorch.org/blog/torchserve-performance-tuning/)
 * [Walmart Search: Serving Models at a Scale on TorchServe](https://medium.com/walmartglobaltech/search-model-serving-using-pytorch-and-torchserve-6caf9d1c5f4d)
 * [🎥 Scaling inference on CPU with TorchServe](https://www.youtube.com/watch?v=066_Jd6cwZg)
diff --git a/docs/index.rst b/docs/index.rst
index f16037417e..06a36018fc 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -9,6 +9,12 @@ TorchServe is a performant, flexible and easy to use tool for serving PyTorch mo
 
 What's going on in TorchServe?
 
+* `High performance Llama 2 deployments with AWS Inferentia2 using TorchServe <https://pytorch.org/blog/high-performance-llama/>`__
+* `Naver Case Study: Transition From High-Cost GPUs to Intel CPUs and oneAPI powered Software with performance <https://pytorch.org/blog/ml-model-server-resource-saving/>`__
+* `Run multiple generative AI models on GPU using Amazon SageMaker multi-model endpoints with TorchServe and save up to 75% in inference costs <https://aws.amazon.com/blogs/machine-learning/run-multiple-generative-ai-models-on-gpu-using-amazon-sagemaker-multi-model-endpoints-with-torchserve-and-save-up-to-75-in-inference-costs/>`__
+* `Deploying your Generative AI model in only four steps with Vertex AI and PyTorch <https://cloud.google.com/blog/products/ai-machine-learning/get-your-genai-model-going-in-four-easy-steps>`__
+* `PyTorch Model Serving on Google Cloud TPUv5 <https://cloud.google.com/tpu/docs/v5e-inference#pytorch-model-inference-and-serving>`__
+* `Monitoring using Datadog <https://www.datadoghq.com/blog/ai-integrations/#model-serving-and-deployment-vertex-ai-amazon-sagemaker-torchserve>`__
 * `Torchserve Performance Tuning, Animated Drawings Case-Study <https://pytorch.org/blog/torchserve-performance-tuning/>`__
 * `Walmart Search: Serving Models at a Scale on TorchServe <https://medium.com/walmartglobaltech/search-model-serving-using-pytorch-and-torchserve-6caf9d1c5f4d>`__
 * `Scaling inference on CPU with TorchServe <https://www.youtube.com/watch?v=066_Jd6cwZg>`__
diff --git a/examples/large_models/tp_llama/REAME.md b/examples/large_models/tp_llama/README.md
similarity index 100%
rename from examples/large_models/tp_llama/REAME.md
rename to examples/large_models/tp_llama/README.md
diff --git a/ts_scripts/spellcheck_conf/wordlist.txt b/ts_scripts/spellcheck_conf/wordlist.txt
index f8fe15e126..b4fb8bc4a6 100644
--- a/ts_scripts/spellcheck_conf/wordlist.txt
+++ b/ts_scripts/spellcheck_conf/wordlist.txt
@@ -162,7 +162,10 @@ CN
 CORS
 EventLoopGroup
 EventLoops
+CPUs
 GPUs
+TPU
+TPUs
 JVM
 MaxDirectMemorySize
 OU
@@ -1118,3 +1121,10 @@ quantized
 Chatbot
 LLM
 bitsandbytes
+Datadog
+Trn
+oneAPI
+Naver
+FlashAttention
+GenAI
+prem