LLM Observability Dashboard

Large Language Models (LLMs) are an integral part of AI applications that are used to boost productivity and efficiency. Day by day, these LLM-based applications are becoming more advanced and sophisticated due to massive model sizes, intricate architecture, and non-deterministic outputs. Therefore, running these applications in production poses certain challenges, such as:

High computational requirements: Training and running LLMs is resource intensive and hence can be expensive.
Scale and cost: Running LLMs at scale can be costly as managing the large volumes of data generated by LLMs might generate model drift.
Performance issues: Due to the complexity of LLMs, identifying and troubleshooting the root cause of issues such as request errors or latency bottlenecks becomes challenging.
Quality and accuracy of outputs: LLMs might struggle to provide accurate responses for domain-specific tasks and they require fine-tuning.
Bias and hallucinations: LLMs are trained on data that can contain social biases, stereotypes, and prejudices that result in these models generating biased or harmful content.

The LLM Observability Dashboard helps you address these challenges by providing insights about the LLM performance and behavior in real time. You can visualize and analyze this data to monitor the operational performance of your LLM applications.

Before you begin

Make sure that your LLM applications are instrumented with OpenTelemetry or OpenLLMetry to transmit the traces and metrics data for analysis.

To view the dashboard

From the navigation menu, click Dashboards.
Search for the AIOps Observability folder and select it.
Click LLM Observability Dashboard.
The dashboard is displayed.

Metrics in the LLM Observability Dashboard

The dashboard provides the following metrics to optimize the model performance:

Evaluation metrics to analyze tokens
Training metrics to assess the model quantity and efficacy
GPU metrics to monitor the performance of GPUs

Evaluation metrics

Monitor and analyze the following evaluation metrics to analyze tokens.

Panel	Description
eval/bleu	Shows the comparison against the expected outputs. It is used to evaluate translations.
eval/loss	Shows the difference between the predictions made by the LLMs and the actual target values (labels). A low loss value indicates that the models are making predictions closer to the true values.
eval/perplexity	Measures how confidently the LLMs predict the next word in a sequence. A lower perplexity value indicates that the LLMs make better predictions.
eval/rouge1	Displays the score used to measure the overlap between the generated text and the reference text.
eval/rouge2	Measures the overlap of bigrams (two consecutive words) between the generated text and the reference text. This metric is used for evaluating tasks such as summarization, where the context and relationships between the consecutive words matter.
eval/rougeL	Displays the score that is used for evaluating the longer sequences of text generated by the LLMs, such as text summarization or paraphrasing.
eval/rougeLsum	Displays the score used for evaluating generated text summary. It looks at how much the model’s summary overlaps with a reference summary, focusing on the longest matching sequence of words.
eval/runtime	Indicates how quickly the LLMs can process input responses and generate results. This metric measures the efficiency and performance of the LLMs in terms of execution speed during their deployment and usage, such as generating responses or making predictions.
eval/samples_per_second	Shows the score that indicates the number of input samples, such as tokes or queries, processed by an LLM in one second. This performance metric evaluates how efficiently the LLM handles tasks such as generating responses, making predictions, or processing queries.
eval/steps_per_second	Shows the score that indicates the number of computational steps an LLM can perform per second while processing input and generating output.
eval/valid_mean_token_accuracy	Shows the score that indicates how often the LLM produces valid tokens. This metric is used to evaluate the quality of the output, ensuring that the generated text is syntactically and semantically valid.

Training metrics

Monitor and analyze the following training metrics to identify strengths and weaknesses, optimize training, and make sure that the model meets quality standards for production applications.

Panel	Description
train/grad_norm	Measures the magnitude of gradients while training the LLM. By using this metric, you can tweak the training parameters to make sure that the model is trained effectively without instability.
train/epoch	Tracks how many complete passes (epochs) the model made through the complete training data set. This metric helps to assess the progress of learning.
train/global_step	Tracks the count of updates that are made to the LLM parameters during training. It is used for tracking training progress, tweaking learning rates, and monitoring other training aspects, such as gradient clipping.
train/loss	Tracks the model performance during training by measuring the difference between the model predictions and the target tokens. It is used for optimizing the model performance and improving the model learning.
train/learning rate	Shows the value of the learning rate used while training the LLM. It is used model optimization.

System metrics

Monitor and analyze the following metrics related to GPU.

Panel	Description
system/memory	Shows the amount of memory that is used during the model training or inference. This metric measures the system memory (RAM) or video memory (GPU VRAM) usage.
Process GPU Power Usage (W)	Displays the power usage (in watts) of the GPU while running the training or inference workloads for the LLM.
Process GPU Power Usage (%)	Displays the maximum power usage (in percentage) of the GPU during the model training or inference.
Process GPU Memory Allocated (Bytes)	Displays the amount of GPU memory (in bytes) that is allocated for the model training or inference.
Process GPU Memory Allocated (%)	Displays the percentage of total GPU memory that is allocated for the model training or inference.
Process GPU Time Spent Accessing Memory (%)	Displays the percentage of total processing time that is spent for running the GPU during the model training or inference.
Process GPU Temperature	Displays the current temperature (in Celsius) of the GPU while running training or inference workloads.
Process GPU Utilization (%)	Displays the percentage of time when the GPU processes computations during the model training or inference.
GPU Power Usage (W)	Displays the current power usage (in watts) of the GPU while performing training or inference tasks.
GPU Power Usage (%)	Displays the maximum power usage (in watts) of the GPU during the model training or inference.
GPU Memory Allocated (Bytes)	Displays the amount of GPU memory (in bytes) that is allocated by the model.
GPU Memory Allocated (%)	Displays the percentage of the GPU memory that is allocated by the model during training or inference.