Comparative Analysis of AI-Powered Lightweight Language Model Deployment Across Cloud Platforms
Noble Dhas, Nisha Jhansi (2025)
Noble Dhas, Nisha Jhansi
2025
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-202505028662
https://urn.fi/URN:NBN:fi:amk-202505028662
Tiivistelmä
The increasing adoption of Large Language Models (LLMs) creates a demand for cost-effective cloud deployment solutions, particularly for lightweight models. While cloud free tiers offer initial accessibility, their suitability for hosting even resource-modest LLMs is not well-established. This thesis presents an empirical comparison of deploying the DistilGPT2 lightweight LLM on the free-tier CPU-based virtual machine instances of AWS (t2.micro), GCP (e2-micro), and Azure (B1s). The study evaluates differences in performance (latency, throughput), resource utilization (CPU, RAM), deployment complexity, and estimated cost-effectiveness.
A standardized methodology was employed, involving deploying DistilGPT2 via a Flask/Gunicorn API on Ubuntu 22.04 LTS across the specified instances and regions. Performance under low concurrency (5 users) was measured using Locust, while resource usage was tracked via native cloud monitoring tools (requiring platform-specific configuration). Application logs provided granular performance timings.
Results indicate that while deployment is feasible, performance is severely constrained by the 1 GiB RAM limit common to these instances, necessitating swap file usage during setup and resulting in high runtime memory pressure, particularly on Azure. All platforms exhibited low throughput (0.5-0.7 RPS) and significant latency variability under load. CPU utilization patterns varied markedly, with GCP showing high reported usage (>200%) while Azure remained minimal (<1%). Qualitative differences in deployment ease were noted, especially regarding SSH access convenience (GCP) and monitoring setup complexity.
The study concludes that the evaluated free-tier instances are unsuitable for reliable or scalable lightweight LLM inference beyond basic testing due to critical resource bottlenecks, primarily memory. Achieving practical performance requires transitioning to paid tiers, limiting the direct cost-effectiveness of free options for sustained use. These findings provide a crucial empirical baseline for users considering entry-level cloud resources for AI deployment.
A standardized methodology was employed, involving deploying DistilGPT2 via a Flask/Gunicorn API on Ubuntu 22.04 LTS across the specified instances and regions. Performance under low concurrency (5 users) was measured using Locust, while resource usage was tracked via native cloud monitoring tools (requiring platform-specific configuration). Application logs provided granular performance timings.
Results indicate that while deployment is feasible, performance is severely constrained by the 1 GiB RAM limit common to these instances, necessitating swap file usage during setup and resulting in high runtime memory pressure, particularly on Azure. All platforms exhibited low throughput (0.5-0.7 RPS) and significant latency variability under load. CPU utilization patterns varied markedly, with GCP showing high reported usage (>200%) while Azure remained minimal (<1%). Qualitative differences in deployment ease were noted, especially regarding SSH access convenience (GCP) and monitoring setup complexity.
The study concludes that the evaluated free-tier instances are unsuitable for reliable or scalable lightweight LLM inference beyond basic testing due to critical resource bottlenecks, primarily memory. Achieving practical performance requires transitioning to paid tiers, limiting the direct cost-effectiveness of free options for sustained use. These findings provide a crucial empirical baseline for users considering entry-level cloud resources for AI deployment.