LLM Observability Dashboard


Large Language Models (LLMs) are an integral part of AI applications that are used to boost productivity and efficiency. Day by day, these LLM-based applications are becoming more advanced and sophisticated due to massive model sizes, intricate architecture, and non-deterministic outputs. Therefore, running these applications in production poses certain challenges, such as:

  • High computational requirements: Training and running LLMs is resource intensive and hence can be expensive.
  • Scale and cost: Running LLMs at scale can be costly as managing the large volumes of data generated by LLMs might generate model drift.
  • Performance issues: Due to the complexity of LLMs, identifying and troubleshooting the root cause of issues such as request errors or latency bottlenecks becomes challenging.
  • Quality and accuracy of outputs: LLMs might struggle to provide accurate responses for domain-specific tasks and they require fine-tuning. 
  • Bias and hallucinations: LLMs are trained on data that can contain social biases, stereotypes, and prejudices that result in these models generating biased or harmful content.

The LLM Observability Dashboard helps you address these challenges by providing insights about the LLM performance and behavior in real time. You can visualize and analyze this data to monitor the operational performance of your LLM applications.


Before you begin

Make sure that your LLM applications are instrumented with OpenTelemetry or OpenLLMetry to transmit the traces and metrics data for analysis.


To view the dashboard

  1. From the navigation menu, click Dashboards.
  2. Search for the AIOps Observability folder and select it.
  3. Click LLM Observability Dashboard.
    The dashboard is displayed.

Metrics in the LLM Observability Dashboard

The dashboard provides the following metrics to optimize the model performance:

  • Evaluation metrics to analyze tokens
  • Training metrics to assess the model quantity and efficacy
  • GPU metrics to monitor the performance of GPUs

Evaluation metrics

Monitor and analyze the following evaluation metrics to analyze tokens.

eval_metrics.png

PanelDescription
eval/bleuShows the comparison against the expected outputs. It is used to evaluate translations.
eval/lossShows the difference between the predictions made by the LLMs and the actual target values (labels). A low loss value indicates that the models are making predictions closer to the true values.
eval/perplexityMeasures how confidently the LLMs predict the next word in a sequence. A lower perplexity value indicates that the LLMs make better predictions.
eval/rouge1Displays the score used to measure the overlap between the generated text and the reference text.
eval/rouge2Measures the overlap of bigrams (two consecutive words) between the generated text and the reference text. This metric is used for evaluating tasks such as summarization, where the context and relationships between the consecutive words matter.
eval/rougeLDisplays the score that is used for evaluating the longer sequences of text generated by the LLMs, such as text summarization or paraphrasing.
eval/rougeLsumDisplays the score used for evaluating generated text summary. It looks at how much the model’s summary overlaps with a reference summary, focusing on the longest matching sequence of words.
eval/runtimeIndicates how quickly the LLMs can process input responses and generate results. This metric measures the efficiency and performance of the LLMs in terms of execution speed during their deployment and usage, such as generating responses or making predictions.
eval/samples_per_secondShows the score that indicates the number of input samples, such as tokes or queries, processed by an LLM in one second. This performance metric evaluates how efficiently the LLM handles tasks such as generating responses, making predictions, or processing queries.
eval/steps_per_secondShows the score that indicates the number of computational steps an LLM can perform per second while processing input and generating output.
eval/valid_mean_token_accuracyShows the score that indicates how often the LLM produces valid tokens. This metric is used to evaluate the quality of the output, ensuring that the generated text is syntactically and semantically valid.

Training metrics

Monitor and analyze the following training metrics to identify strengths and weaknesses, optimize training, and make sure that the model meets quality standards for production applications.

1742805634581-227.png

PanelDescription
train/grad_normMeasures the magnitude of gradients while training the LLM. By using this metric, you can tweak the training parameters to make sure that the model is trained effectively without instability.
train/epochTracks how many complete passes (epochs) the model made through the complete training data set. This metric helps to assess the progress of learning.
train/global_stepTracks the count of updates that are made to the LLM parameters during training. It is used for tracking training progress, tweaking learning rates, and monitoring other training aspects, such as gradient clipping.
train/lossTracks the model performance during training by measuring the difference between the model predictions and the target tokens. It is used for optimizing the model performance and improving the model learning.
train/learning rateShows the value of the learning rate used while training the LLM. It is used model optimization.

System metrics

Monitor and analyze the following metrics related to GPU.

1742805678536-368.png

PanelDescription
system/memoryShows the amount of memory that is used during the model training or inference. This metric measures the system memory (RAM) or video memory (GPU VRAM) usage.
Process GPU Power Usage (W)Displays the power usage (in watts) of the GPU while running the training or inference workloads for the LLM.
Process GPU Power Usage (%)Displays the maximum power usage (in percentage) of the GPU during the model training or inference.
Process GPU Memory Allocated (Bytes)Displays the amount of GPU memory (in bytes) that is allocated for the model training or inference.
Process GPU Memory Allocated (%)Displays the percentage of total GPU memory that is allocated for the model training or inference.
Process GPU Time Spent Accessing Memory (%)Displays the percentage of total processing time that is spent for running the GPU during the model training or inference.
Process GPU TemperatureDisplays the current temperature (in Celsius) of the GPU while running training or inference workloads.
Process GPU Utilization (%)Displays the percentage of time when the GPU processes computations during the model training or inference.
GPU Power Usage (W)Displays the current power usage (in watts) of the GPU while performing training or inference tasks.
GPU Power Usage (%)Displays the maximum power usage (in watts) of the GPU during the model training or inference. 
GPU Memory Allocated (Bytes)Displays the amount of GPU memory (in bytes) that is allocated by the model.
GPU Memory Allocated (%)Displays the percentage of the GPU memory that is allocated by the model during training or inference.

 

 

Tip: For faster searching, add an asterisk to the end of your partial query. Example: cert*
OSZAR »