Local AI Gemma 4<\/title>\n<\/head>\n<body>\n<article class=\"dlx-article\" itemscope itemtype=\"https:\/\/schema.org\/Article\">\n <header class=\"dlx-article__hero\">\n <p class=\"dlx-article__eyebrow\">Local AI \u00b7 Gemma 4 <\/p>\n <h1 itemprop=\"headline\">Local AI Gemma 4: architecture, benchmarks, deployment, and governance for running Gemma 4 offline<\/h1>\n <p class=\"dlx-article__lead\" itemprop=\"description\">The phrase \u201cLocal AI Gemma 4\u201d refers to an architectural choice: running Gemma 4 on the user\u2019s machine (PC, phone, edge device, on\u2011prem server) rather than sending data to a cloud API. This choice generally pursues four goals: (1) reduce latency by bringing compute closer to the need, (2) strengthen privacy by reducing data exposure, (3) control recurring costs (no per\u2011token billing), and (4) gain <a href=\"https:\/\/www.daillac.com\/blogue\/souverainete-ia-au-canada\/\">operational sovereignty<\/a> (a stack deployed and controlled internally). This \u201cedge AI\u201d logic is now recognized as an explicit cloud vs vehicle\/device vs hybrid trade\u2011off, especially to balance latency, privacy, and cost.<\/p>\n <\/header>\n\n <nav class=\"dlx-toc\" aria-label=\"Table of contents\">\n <div class=\"dlx-toc__title\">In this article<\/div>\n <ul>\n <li><a href=\"#resume-executif\">Executive summary<\/a><\/li>\n <li><a href=\"#quappelle-t-on-ia-locale-et-ou-se-situe-gemma-4\">What is \u201clocal AI,\u201d and where does Gemma 4 fit?<\/a><\/li>\n <li><a href=\"#architecture-technique-de-gemma-4\">Technical architecture of Gemma 4<\/a><\/li>\n <li><a href=\"#performances-exigences-materielles-et-benchmarks\">Performance, hardware requirements, and benchmarks<\/a><\/li>\n <li><a href=\"#guide-dinstallation-et-de-deploiement\">Installation and deployment guide<\/a><\/li>\n <li><a href=\"#confidentialite-securite-et-considerations-juridiques\">Privacy, security, and legal considerations<\/a><\/li>\n <li><a href=\"#comparaison-couts-et-implications-de-license-face-aux-concurrents-locaux\">Comparison, costs, and licensing implications versus local competitors<\/a><\/li>\n <li><a href=\"#perspectives-et-recommandations\">Outlook and recommendations<\/a><\/li>\n <li><a href=\"#references\">References<\/a><\/li>\n <\/ul>\n <\/nav>\n\n \n \n\n <section id=\"resume-executif\" class=\"dlx-section dlx-reveal dlx-share-snippet\" data-dlx=\"reveal\" data-share-anchor=\"resume-executif\" data-share-title=\"Executive summary \u2014 Local AI Gemma 4\" data-share-text=\"Executive summary of Gemma 4\u2019s architecture, benefits, ecosystem, and risks when run locally.\">\n <h2>Executive summary<\/h2>\n <p>Gemma 4 (announced on April 2, 2026) is a family of open models designed for reasoning and <a href=\"https:\/\/www.daillac.com\/blogue\/agents-ia-en-entreprise\/\">agentic workflows<\/a>, offered in four sizes (E2B, E4B, 26B MoE, 31B Dense) and released under Apache 2.0 (commercially permissive).<br>\n Its positioning is twofold: (a) \u201cedge\u201d models (E2B\/E4B) optimized for offline use and mobile\/IoT integration, and (b) \u201cworkstation\u201d models (26B\/31B) targeting a higher quality level, including a 26B MoE designed for latency by activating only about 3.8B parameters at inference.<\/p>\n <p>From an industrial standpoint, the \u201cLocal AI Gemma 4\u201d ecosystem is already well tooled: local execution through apps\/servers (for example Ollama, LM Studio) and inference engines (for example LiteRT\u2011LM, llama.cpp, MLX, vLLM), including \u201cOpenAI\u2011style\u201d compatible APIs for quickly plugging in <a href=\"https:\/\/www.daillac.com\/blogue\/comment-utiliser-lia-en-entreprise-guide-complet-cas-pratiques\/\">existing applications<\/a>.<\/p>\n <p>Finally, \u201clocal\u201d does not mean \u201crisk\u2011free.\u201d Self\u2011hosting shifts part of the risk: host machine security, supply\u2011chain vulnerabilities, prompt injection, <a href=\"https:\/\/www.daillac.com\/blogue\/securite-des-agents-ia\/\">tool control (agents)<\/a>, and misuse risks (spam\/phishing\/disinformation) when Internet\u2011exposed instances are poorly governed.<\/p>\n <div class=\"dlx-share-card\" aria-label=\"Share this block\">\n <div class=\"dlx-share\">\n <div class=\"dlx-share__title\">Share this block<\/div>\n <ul class=\"dlx-share__list\" role=\"list\">\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"linkedin\" aria-label=\"Share on LinkedIn\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433a2.062 2.062 0 0 1-2.063-2.065 2.064 2.064 0 1 1 2.063 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z\"\/><\/svg>\n <span class=\"dlx-share__label\">LinkedIn<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"x\" aria-label=\"Share on X\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-4.714-6.231-5.401 6.231H2.746l7.73-8.835L1.254 2.25H8.08l4.253 5.622 5.911-5.622zm-1.161 17.52h1.833L7.084 4.126H5.117z\"\/><\/svg>\n <span class=\"dlx-share__label\">X<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"facebook\" aria-label=\"Share on Facebook\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M24 12.073c0-6.627-5.373-12-12-12s-12 5.373-12 12c0 5.99 4.388 10.954 10.125 11.854v-8.385H7.078v-3.47h3.047V9.43c0-3.007 1.792-4.669 4.533-4.669 1.312 0 2.686.235 2.686.235v2.953H15.83c-1.491 0-1.956.925-1.956 1.874v2.25h3.328l-.532 3.47h-2.796v8.385C19.612 23.027 24 18.062 24 12.073z\"\/><\/svg>\n <span class=\"dlx-share__label\">Facebook<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"whatsapp\" aria-label=\"Share on WhatsApp\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M17.472 14.382c-.297-.149-1.758-.867-2.03-.967-.273-.099-.471-.148-.67.15-.197.297-.767.966-.94 1.164-.173.199-.347.223-.644.075-.297-.15-1.255-.463-2.39-1.475-.883-.788-1.48-1.761-1.653-2.059-.173-.297-.018-.458.13-.606.134-.133.298-.347.446-.52.149-.174.198-.298.298-.497.099-.198.05-.371-.025-.52-.075-.149-.669-1.612-.916-2.207-.242-.579-.487-.5-.669-.51-.173-.008-.371-.01-.57-.01-.198 0-.52.074-.792.372-.272.297-1.04 1.016-1.04 2.479 0 1.462 1.065 2.875 1.213 3.074.149.198 2.096 3.2 5.077 4.487.709.306 1.262.489 1.694.625.712.227 1.36.195 1.871.118.571-.085 1.758-.719 2.006-1.413.248-.694.248-1.289.173-1.413-.074-.124-.272-.198-.57-.347m-5.421 7.403h-.004a9.87 9.87 0 0 1-5.031-1.378l-.361-.214-3.741.982.998-3.648-.235-.374a9.86 9.86 0 0 1-1.51-5.26c.001-5.45 4.436-9.884 9.888-9.884 2.64 0 5.122 1.03 6.988 2.898a9.825 9.825 0 0 1 2.893 6.994c-.003 5.45-4.437 9.884-9.885 9.884m8.413-18.297A11.815 11.815 0 0 0 12.05 0C5.495 0 .16 5.335.157 11.892c0 2.096.547 4.142 1.588 5.945L.057 24l6.305-1.654a11.882 11.882 0 0 0 5.683 1.448h.005c6.554 0 11.89-5.335 11.893-11.893a11.821 11.821 0 0 0-3.48-8.413z\"\/><\/svg>\n <span class=\"dlx-share__label\">WhatsApp<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <button type=\"button\" class=\"dlx-share__link dlx-share__link--with-label\" data-copy-share aria-label=\"Copy the link to this block\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg>\n <span class=\"dlx-share__label\">Copy<\/span>\n <\/button>\n <\/li>\n <\/ul>\n <\/div>\n <\/div>\n <\/section>\n\n <section id=\"quappelle-t-on-ia-locale-et-ou-se-situe-gemma-4\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>What is \u201clocal AI,\u201d and where does Gemma 4 fit?<\/h2>\n <p>Local AI (or \u201con\u2011device \/ on\u2011prem\u201d) runs inference (and sometimes fine adaptation) as close to the user as possible: workstation, smartphone, industrial embedded device, internal server. This typically reduces outbound data flows and makes it possible to implement controls (encryption, network segmentation, logging, filtering, and so on) at the organizational level.<\/p>\n <p>Data\u2011protection and cybersecurity authorities are increasingly pushing toward robust deployment models, favoring local, secure systems when possible and assessing the risk of data reuse by a provider when an external service is used.<br>\n Strategically, edge AI is explicitly presented as an architectural choice that must balance latency, privacy, and cost, with lighter models and hardware advances making embedded execution more realistic.<\/p>\n <p>In this context, Gemma 4 positions itself as an \u201copen + multi\u2011target\u201d answer: \u201cedge\u201d variants (E2B\/E4B) and \u201cPC\/server\u201d variants (26B\/31B), backed by an Apache 2.0 license (widely seen as more permissive than some restrictive \u201copen\u2011weight\u201d licenses).<br>\n The stated goal is to enable deployments from billions of Android devices to workstations and accelerators, while keeping a common base and agentic capabilities (function calling, structured JSON, system instructions).<\/p>\n <\/section>\n\n <section id=\"architecture-technique-de-gemma-4\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>Technical architecture of Gemma 4<\/h2>\n <h3>Sizes, variants, and modalities<\/h3>\n <p>Gemma 4 is a multimodal family (text + vision, and audio for the smaller variants), with two architecture families: Dense (31B) and Mixture\u2011of\u2011Experts (26B\u2011A4B).<\/p>\n <p>Summary table of the main variants (parameters and key characteristics):<\/p>\n <div id=\"gemma4-variantes\" class=\"dlx-share-snippet\" data-share-anchor=\"gemma4-variantes\" data-share-title=\"Gemma 4 variants table for local AI\" data-share-text=\"Comparison of the Gemma 4 E2B, E4B, 26B\u2011A4B, and 31B variants for local deployment.\">\n <div class=\"dlx-shareable-block\">\n <div class=\"dlx-table-wrap\">\n <table>\n <thead><tr>\n <th scope=\"col\">Variant<\/th>\n <th scope=\"col\">Type<\/th>\n <th scope=\"col\">Parameters (order of magnitude)<\/th>\n <th scope=\"col\">Context window<\/th>\n <th scope=\"col\">Main input modalities<\/th>\n <th scope=\"col\">What it changes for local AI<\/th>\n <\/tr><\/thead>\n <tbody>\n <tr>\n <th scope=\"row\">Gemma 4 E2B<\/th>\n <td>\u201cEdge\u201d dense<\/td>\n <td>2.3B \u201ceffective\u201d (\u22485.1B with PLE embeddings)<\/td>\n <td>128K<\/td>\n <td>Text, image, audio<\/td>\n <td>Designed for mobile\/IoT: a quality\/latency\/memory compromise<\/td>\n <\/tr>\n <tr>\n <th scope=\"row\">Gemma 4 E4B<\/th>\n <td>\u201cEdge\u201d dense<\/td>\n <td>4.5B \u201ceffective\u201d (\u22488B with PLE embeddings)<\/td>\n <td>128K<\/td>\n <td>Text, image, audio<\/td>\n <td>More headroom for reasoning, with hardware cost still contained<\/td>\n <\/tr>\n <tr>\n <th scope=\"row\">Gemma 4 26B\u2011A4B<\/th>\n <td>MoE<\/td>\n <td>\u224825.2B total, \u22483.8B active at inference<\/td>\n <td>256K<\/td>\n <td>Text, image (video via frames depending on the engine)<\/td>\n <td>Local \u201ccheat code\u201d: quality close to large models, throughput close to a smaller model<\/td>\n <\/tr>\n <tr>\n <th scope=\"row\">Gemma 4 31B<\/th>\n <td>Dense<\/td>\n <td>\u224830.7B<\/td>\n <td>256K<\/td>\n <td>Text, image (video via frames depending on the engine)<\/td>\n <td>Highest quality in the family, more demanding (VRAM\/KV cache)<\/td>\n <\/tr>\n <\/tbody>\n <\/table>\n <\/div>\n <\/div>\n <div class=\"dlx-share-card\" aria-label=\"Share this block\">\n <div class=\"dlx-share\">\n <div class=\"dlx-share__title\">Share this block<\/div>\n <ul class=\"dlx-share__list\" role=\"list\">\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"linkedin\" aria-label=\"Share on LinkedIn\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433a2.062 2.062 0 0 1-2.063-2.065 2.064 2.064 0 1 1 2.063 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z\"\/><\/svg>\n <span class=\"dlx-share__label\">LinkedIn<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"x\" aria-label=\"Share on X\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-4.714-6.231-5.401 6.231H2.746l7.73-8.835L1.254 2.25H8.08l4.253 5.622 5.911-5.622zm-1.161 17.52h1.833L7.084 4.126H5.117z\"\/><\/svg>\n <span class=\"dlx-share__label\">X<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"facebook\" aria-label=\"Share on Facebook\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M24 12.073c0-6.627-5.373-12-12-12s-12 5.373-12 12c0 5.99 4.388 10.954 10.125 11.854v-8.385H7.078v-3.47h3.047V9.43c0-3.007 1.792-4.669 4.533-4.669 1.312 0 2.686.235 2.686.235v2.953H15.83c-1.491 0-1.956.925-1.956 1.874v2.25h3.328l-.532 3.47h-2.796v8.385C19.612 23.027 24 18.062 24 12.073z\"\/><\/svg>\n <span class=\"dlx-share__label\">Facebook<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"whatsapp\" aria-label=\"Share on WhatsApp\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M17.472 14.382c-.297-.149-1.758-.867-2.03-.967-.273-.099-.471-.148-.67.15-.197.297-.767.966-.94 1.164-.173.199-.347.223-.644.075-.297-.15-1.255-.463-2.39-1.475-.883-.788-1.48-1.761-1.653-2.059-.173-.297-.018-.458.13-.606.134-.133.298-.347.446-.52.149-.174.198-.298.298-.497.099-.198.05-.371-.025-.52-.075-.149-.669-1.612-.916-2.207-.242-.579-.487-.5-.669-.51-.173-.008-.371-.01-.57-.01-.198 0-.52.074-.792.372-.272.297-1.04 1.016-1.04 2.479 0 1.462 1.065 2.875 1.213 3.074.149.198 2.096 3.2 5.077 4.487.709.306 1.262.489 1.694.625.712.227 1.36.195 1.871.118.571-.085 1.758-.719 2.006-1.413.248-.694.248-1.289.173-1.413-.074-.124-.272-.198-.57-.347m-5.421 7.403h-.004a9.87 9.87 0 0 1-5.031-1.378l-.361-.214-3.741.982.998-3.648-.235-.374a9.86 9.86 0 0 1-1.51-5.26c.001-5.45 4.436-9.884 9.888-9.884 2.64 0 5.122 1.03 6.988 2.898a9.825 9.825 0 0 1 2.893 6.994c-.003 5.45-4.437 9.884-9.885 9.884m8.413-18.297A11.815 11.815 0 0 0 12.05 0C5.495 0 .16 5.335.157 11.892c0 2.096.547 4.142 1.588 5.945L.057 24l6.305-1.654a11.882 11.882 0 0 0 5.683 1.448h.005c6.554 0 11.89-5.335 11.893-11.893a11.821 11.821 0 0 0-3.48-8.413z\"\/><\/svg>\n <span class=\"dlx-share__label\">WhatsApp<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <button type=\"button\" class=\"dlx-share__link dlx-share__link--with-label\" data-copy-share aria-label=\"Copy the link to this block\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg>\n <span class=\"dlx-share__label\">Copy<\/span>\n <\/button>\n <\/li>\n <\/ul>\n <\/div>\n <\/div>\n <\/div>\n\n <p>Two structural points are decisive for \u201cLocal AI Gemma 4\u201d:<\/p>\n <ul>\n <li>MoE (26B\u2011A4B): the central argument is latency and tokens\/s through activation of a subset of parameters (\u22483.8B active), so the inference cost is closer to a \u201c~4B\u201d model than to a dense 26B.<\/li>\n <li>Long context (128K\/256K): excellent for \u201cchat over a Git repo \/ long document,\u201d but KV cache memory becomes a limiting factor locally (especially on the larger variants), making hybrid attention \/ KV quantization techniques very important.<\/li>\n <\/ul>\n\n <h3>Attention mechanisms, long context, and \u201cagentic\u201d behavior<\/h3>\n <p>Published technical integrations converge on an architecture designed for long context and multi\u2011engine compatibility:<\/p>\n <ul>\n <li>Hybrid attention (sliding window + global): described as a mechanism alternating local \u201csliding\u2011window\u201d attention and global attention, useful for handling long context at a reasonable cost.<\/li>\n <li>Shared KV cache and related techniques: the goal is to improve memory\/compute efficiency on long prompts.<\/li>\n <li>MoE on the 26B side: the vLLM documentation mentions an expert structure (128 experts, top\u20118 routing) for the MoE model, consistent with the idea of \u201clarge total, small active subset.\u201d<\/li>\n <\/ul>\n <p>On the \u201cagents\u201d side, Gemma 4 also highlights primitives that facilitate automation: function calling, structured JSON output, and system instructions \u2014 and, depending on the engines, a \u201cthinking \/ reasoning\u201d mode exposing a dedicated field in the API response.<\/p>\n\n <h3>Quantization and local deployment formats<\/h3>\n <p>Local AI almost always relies on quantization (precision reduction) to lower RAM\/VRAM usage and energy consumption:<\/p>\n <ul>\n <li>GGUF + quantization: frameworks such as Ollama and llama.cpp use quantized models in GGUF format to reduce compute requirements (sometimes with only moderate quality degradation).<\/li>\n <li>2\u2011bit \/ 4\u2011bit (edge): LiteRT\u2011LM optimizations advertise 2\u2011bit\/4\u2011bit weights and memory\u2011mapping mechanisms (notably to contain memory usage on small devices).<\/li>\n <li>NVFP4 (GPU): an NVFP4\u2011quantized variant (via NVIDIA Model Optimizer) has been released with evaluation results close to baseline on several benchmarks, along with a sample vLLM service.<\/li>\n <li>TurboQuant (Apple Silicon): on the MLX side, the documentation mentions TurboQuant to sharply reduce active memory (\u22484\u00d7) and speed up long\u2011context inference on Apple Silicon.<\/li>\n <\/ul>\n\n <h3>Reference architecture diagram for Local AI Gemma 4<\/h3>\n <figure id=\"schema-architecture-locale\" class=\"dlx-share-snippet\" data-share-anchor=\"schema-architecture-locale\" data-share-title=\"Local AI Gemma 4 architecture diagram\" data-share-text=\"Reference diagram for deploying Gemma 4 locally with an inference engine, guardrails, and tools.\">\n <div class=\"dlx-shareable-block\">\n <div class=\"dlx-mermaid\">\n <pre class=\"mermaid\">flowchart LR\n U[User \/ Application] -->|request| P[Preprocessing\\n(tokenizer + templates)]\n P --> IE[Local inference engine\\n(Ollama \/ llama.cpp \/ vLLM \/ LiteRT-LM \/ MLX)]\n IE --> M[Gemma 4\\nE2B\/E4B\/26B MoE\/31B]\n M -->|response| IE\n IE -->|post-processing| G[Local guardrails\\n(json schema, filters,\\npolicies, logs)]\n G --> R[Rendered response]\n M -->|optional tool call| T[Local tools\\n(RAG, functions, scripts)]\n T --> IE<\/pre>\n <\/div>\n <\/div>\n <div class=\"dlx-share-card\" aria-label=\"Share this block\">\n <div class=\"dlx-share\">\n <div class=\"dlx-share__title\">Share this block<\/div>\n <ul class=\"dlx-share__list\" role=\"list\">\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"linkedin\" aria-label=\"Share on LinkedIn\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433a2.062 2.062 0 0 1-2.063-2.065 2.064 2.064 0 1 1 2.063 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z\"\/><\/svg><span class=\"dlx-share__label\">LinkedIn<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"x\" aria-label=\"Share on X\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-4.714-6.231-5.401 6.231H2.746l7.73-8.835L1.254 2.25H8.08l4.253 5.622 5.911-5.622zm-1.161 17.52h1.833L7.084 4.126H5.117z\"\/><\/svg><span class=\"dlx-share__label\">X<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"facebook\" aria-label=\"Share on Facebook\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M24 12.073c0-6.627-5.373-12-12-12s-12 5.373-12 12c0 5.99 4.388 10.954 10.125 11.854v-8.385H7.078v-3.47h3.047V9.43c0-3.007 1.792-4.669 4.533-4.669 1.312 0 2.686.235 2.686.235v2.953H15.83c-1.491 0-1.956.925-1.956 1.874v2.25h3.328l-.532 3.47h-2.796v8.385C19.612 23.027 24 18.062 24 12.073z\"\/><\/svg><span class=\"dlx-share__label\">Facebook<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"whatsapp\" aria-label=\"Share on WhatsApp\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M17.472 14.382c-.297-.149-1.758-.867-2.03-.967-.273-.099-.471-.148-.67.15-.197.297-.767.966-.94 1.164-.173.199-.347.223-.644.075-.297-.15-1.255-.463-2.39-1.475-.883-.788-1.48-1.761-1.653-2.059-.173-.297-.018-.458.13-.606.134-.133.298-.347.446-.52.149-.174.198-.298.298-.497.099-.198.05-.371-.025-.52-.075-.149-.669-1.612-.916-2.207-.242-.579-.487-.5-.669-.51-.173-.008-.371-.01-.57-.01-.198 0-.52.074-.792.372-.272.297-1.04 1.016-1.04 2.479 0 1.462 1.065 2.875 1.213 3.074.149.198 2.096 3.2 5.077 4.487.709.306 1.262.489 1.694.625.712.227 1.36.195 1.871.118.571-.085 1.758-.719 2.006-1.413.248-.694.248-1.289.173-1.413-.074-.124-.272-.198-.57-.347m-5.421 7.403h-.004a9.87 9.87 0 0 1-5.031-1.378l-.361-.214-3.741.982.998-3.648-.235-.374a9.86 9.86 0 0 1-1.51-5.26c.001-5.45 4.436-9.884 9.888-9.884 2.64 0 5.122 1.03 6.988 2.898a9.825 9.825 0 0 1 2.893 6.994c-.003 5.45-4.437 9.884-9.885 9.884m8.413-18.297A11.815 11.815 0 0 0 12.05 0C5.495 0 .16 5.335.157 11.892c0 2.096.547 4.142 1.588 5.945L.057 24l6.305-1.654a11.882 11.882 0 0 0 5.683 1.448h.005c6.554 0 11.89-5.335 11.893-11.893a11.821 11.821 0 0 0-3.48-8.413z\"\/><\/svg><span class=\"dlx-share__label\">WhatsApp<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><button type=\"button\" class=\"dlx-share__link dlx-share__link--with-label\" data-copy-share aria-label=\"Copy the link to this block\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><span class=\"dlx-share__label\">Copy<\/span><\/button><\/li>\n <\/ul>\n <\/div>\n <\/div>\n <\/figure>\n <p>This diagram reflects a key point: with local AI, you become the operator (observability, security, quotas, tool isolation) \u2014 which is a strength (full control) but also a responsibility.<\/p>\n <\/section>\n\n <section id=\"performances-exigences-materielles-et-benchmarks\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>Performance, hardware requirements, and benchmarks<\/h2>\n <h3>\u201cReasoning \/ code \/ multimodal\u201d quality (public benchmarks)<\/h3>\n <p>The Gemma 4 model card publishes a multi\u2011task table (reasoning, code, long context, vision, audio). Here is a synthetic extract (a selection of signals useful for local deployment choices):<\/p>\n <div id=\"gemma4-benchmarks-qualite\" class=\"dlx-share-snippet\" data-share-anchor=\"gemma4-benchmarks-qualite\" data-share-title=\"Gemma 4 benchmarks \u2014 quality and reasoning\" data-share-text=\"Quick read of Gemma 4 public benchmarks for reasoning, code, and multimodal tasks.\">\n <div class=\"dlx-shareable-block\">\n <div class=\"dlx-table-wrap\">\n <table>\n <thead><tr>\n <th scope=\"col\">Benchmark (selection)<\/th>\n <th scope=\"col\">31B<\/th>\n <th scope=\"col\">26B\u2011A4B<\/th>\n <th scope=\"col\">E4B<\/th>\n <th scope=\"col\">E2B<\/th>\n <\/tr><\/thead>\n <tbody>\n <tr><th scope=\"row\">MMLU\u2011Pro<\/th><td>85.6<\/td><td>81.4<\/td><td>67.2<\/td><td>59.6<\/td><\/tr>\n <tr><th scope=\"row\">AIME 2026 (without tools)<\/th><td>79.6<\/td><td>72.0<\/td><td>44.9<\/td><td>25.6<\/td><\/tr>\n <tr><th scope=\"row\">LiveCodeBench v6 (pass@1)<\/th><td>69.6<\/td><td>74.5<\/td><td>40.4<\/td><td>23.4<\/td><\/tr>\n <tr><th scope=\"row\">MMMU Pro (vision)<\/th><td>76.9<\/td><td>73.8<\/td><td>52.6<\/td><td>44.2<\/td><\/tr>\n <tr><th scope=\"row\">MRCR v2 (8 needles, 128k)<\/th><td>66.4<\/td><td>44.1<\/td><td>25.4<\/td><td>19.1<\/td><\/tr>\n <\/tbody>\n <\/table>\n <\/div>\n <\/div>\n <div class=\"dlx-share-card\" aria-label=\"Share this block\">\n <div class=\"dlx-share\">\n <div class=\"dlx-share__title\">Share this block<\/div>\n <ul class=\"dlx-share__list\" role=\"list\">\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"linkedin\" aria-label=\"Share on LinkedIn\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433a2.062 2.062 0 0 1-2.063-2.065 2.064 2.064 0 1 1 2.063 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z\"\/><\/svg><span class=\"dlx-share__label\">LinkedIn<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"x\" aria-label=\"Share on X\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-4.714-6.231-5.401 6.231H2.746l7.73-8.835L1.254 2.25H8.08l4.253 5.622 5.911-5.622zm-1.161 17.52h1.833L7.084 4.126H5.117z\"\/><\/svg><span class=\"dlx-share__label\">X<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"facebook\" aria-label=\"Share on Facebook\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M24 12.073c0-6.627-5.373-12-12-12s-12 5.373-12 12c0 5.99 4.388 10.954 10.125 11.854v-8.385H7.078v-3.47h3.047V9.43c0-3.007 1.792-4.669 4.533-4.669 1.312 0 2.686.235 2.686.235v2.953H15.83c-1.491 0-1.956.925-1.956 1.874v2.25h3.328l-.532 3.47h-2.796v8.385C19.612 23.027 24 18.062 24 12.073z\"\/><\/svg><span class=\"dlx-share__label\">Facebook<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"whatsapp\" aria-label=\"Share on WhatsApp\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M17.472 14.382c-.297-.149-1.758-.867-2.03-.967-.273-.099-.471-.148-.67.15-.197.297-.767.966-.94 1.164-.173.199-.347.223-.644.075-.297-.15-1.255-.463-2.39-1.475-.883-.788-1.48-1.761-1.653-2.059-.173-.297-.018-.458.13-.606.134-.133.298-.347.446-.52.149-.174.198-.298.298-.497.099-.198.05-.371-.025-.52-.075-.149-.669-1.612-.916-2.207-.242-.579-.487-.5-.669-.51-.173-.008-.371-.01-.57-.01-.198 0-.52.074-.792.372-.272.297-1.04 1.016-1.04 2.479 0 1.462 1.065 2.875 1.213 3.074.149.198 2.096 3.2 5.077 4.487.709.306 1.262.489 1.694.625.712.227 1.36.195 1.871.118.571-.085 1.758-.719 2.006-1.413.248-.694.248-1.289.173-1.413-.074-.124-.272-.198-.57-.347m-5.421 7.403h-.004a9.87 9.87 0 0 1-5.031-1.378l-.361-.214-3.741.982.998-3.648-.235-.374a9.86 9.86 0 0 1-1.51-5.26c.001-5.45 4.436-9.884 9.888-9.884 2.64 0 5.122 1.03 6.988 2.898a9.825 9.825 0 0 1 2.893 6.994c-.003 5.45-4.437 9.884-9.885 9.884m8.413-18.297A11.815 11.815 0 0 0 12.05 0C5.495 0 .16 5.335.157 11.892c0 2.096.547 4.142 1.588 5.945L.057 24l6.305-1.654a11.882 11.882 0 0 0 5.683 1.448h.005c6.554 0 11.89-5.335 11.893-11.893a11.821 11.821 0 0 0-3.48-8.413z\"\/><\/svg><span class=\"dlx-share__label\">WhatsApp<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><button type=\"button\" class=\"dlx-share__link dlx-share__link--with-label\" data-copy-share aria-label=\"Copy the link to this block\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><span class=\"dlx-share__label\">Copy<\/span><\/button><\/li>\n <\/ul>\n <\/div>\n <\/div>\n <\/div>\n\n <p>Analytical reading (for local AI):<\/p>\n <ul>\n <li>The 26B\u2011A4B appears to be a \u201csweet spot\u201d: very competitive (especially for code) while promising faster execution thanks to its \u201c\u22483.8B active\u201d MoE.<\/li>\n <li>The E2B\/E4B models remain capable, but the \u201cquality vs cost\u201d slope becomes steep as soon as you target difficult math\/code tasks or highly demanding long\u2011context use cases.<\/li>\n <\/ul>\n\n <h3>Inference benchmarks (latency, throughput, memory) on devices<\/h3>\n <p>For Local AI Gemma 4, the critical metrics are TTFT (time to first token), generation throughput (tokens\/s), memory cost (peak RAM\/VRAM), and stability under load.<\/p>\n <p>LiteRT\u2011LM benchmarks provide concrete figures across several platforms (CPU\/GPU, mobile, desktop, IoT), including the E2B extract below.<\/p>\n <div id=\"gemma4-benchmarks-appareils\" class=\"dlx-share-snippet\" data-share-anchor=\"gemma4-benchmarks-appareils\" data-share-title=\"Gemma 4 E2B benchmarks on devices\" data-share-text=\"Observed throughput, TTFT, and memory on mobile, desktop, and edge for Gemma 4 E2B.\">\n <div class=\"dlx-shareable-block\">\n <div class=\"dlx-table-wrap\">\n <table>\n <thead><tr>\n <th scope=\"col\">Device<\/th>\n <th scope=\"col\">Backend<\/th>\n <th scope=\"col\">Prefill (tk\/s)<\/th>\n <th scope=\"col\">Decode (tk\/s)<\/th>\n <th scope=\"col\">TTFT (s)<\/th>\n <th scope=\"col\">Peak CPU mem (MB)<\/th>\n <\/tr><\/thead>\n <tbody>\n <tr><th scope=\"row\">Samsung S26 Ultra<\/th><td>CPU<\/td><td>557<\/td><td>47<\/td><td>1.8<\/td><td>1733<\/td><\/tr>\n <tr><th scope=\"row\">Samsung S26 Ultra<\/th><td>GPU<\/td><td>3808<\/td><td>52<\/td><td>0.3<\/td><td>676<\/td><\/tr>\n <tr><th scope=\"row\">iPhone 17 Pro<\/th><td>CPU<\/td><td>532<\/td><td>25<\/td><td>1.9<\/td><td>607<\/td><\/tr>\n <tr><th scope=\"row\">iPhone 17 Pro<\/th><td>GPU<\/td><td>2878<\/td><td>56<\/td><td>0.3<\/td><td>1450<\/td><\/tr>\n <tr><th scope=\"row\">MacBook Pro M4<\/th><td>GPU<\/td><td>7835<\/td><td>160<\/td><td>0.1<\/td><td>1623<\/td><\/tr>\n <tr><th scope=\"row\">Raspberry Pi 5 (16GB)<\/th><td>CPU<\/td><td>133<\/td><td>8<\/td><td>7.8<\/td><td>1546<\/td><\/tr>\n <tr><th scope=\"row\">Linux + GeForce RTX 4090<\/th><td>GPU<\/td><td>11234<\/td><td>143<\/td><td>0.1<\/td><td>913<\/td><\/tr>\n <\/tbody>\n <\/table>\n <\/div>\n <\/div>\n <div class=\"dlx-share-card\" aria-label=\"Share this block\">\n <div class=\"dlx-share\">\n <div class=\"dlx-share__title\">Share this block<\/div>\n <ul class=\"dlx-share__list\" role=\"list\">\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"linkedin\" aria-label=\"Share on LinkedIn\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433a2.062 2.062 0 0 1-2.063-2.065 2.064 2.064 0 1 1 2.063 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z\"\/><\/svg><span class=\"dlx-share__label\">LinkedIn<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"x\" aria-label=\"Share on X\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-4.714-6.231-5.401 6.231H2.746l7.73-8.835L1.254 2.25H8.08l4.253 5.622 5.911-5.622zm-1.161 17.52h1.833L7.084 4.126H5.117z\"\/><\/svg><span class=\"dlx-share__label\">X<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"facebook\" aria-label=\"Share on Facebook\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M24 12.073c0-6.627-5.373-12-12-12s-12 5.373-12 12c0 5.99 4.388 10.954 10.125 11.854v-8.385H7.078v-3.47h3.047V9.43c0-3.007 1.792-4.669 4.533-4.669 1.312 0 2.686.235 2.686.235v2.953H15.83c-1.491 0-1.956.925-1.956 1.874v2.25h3.328l-.532 3.47h-2.796v8.385C19.612 23.027 24 18.062 24 12.073z\"\/><\/svg><span class=\"dlx-share__label\">Facebook<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"whatsapp\" aria-label=\"Share on WhatsApp\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M17.472 14.382c-.297-.149-1.758-.867-2.03-.967-.273-.099-.471-.148-.67.15-.197.297-.767.966-.94 1.164-.173.199-.347.223-.644.075-.297-.15-1.255-.463-2.39-1.475-.883-.788-1.48-1.761-1.653-2.059-.173-.297-.018-.458.13-.606.134-.133.298-.347.446-.52.149-.174.198-.298.298-.497.099-.198.05-.371-.025-.52-.075-.149-.669-1.612-.916-2.207-.242-.579-.487-.5-.669-.51-.173-.008-.371-.01-.57-.01-.198 0-.52.074-.792.372-.272.297-1.04 1.016-1.04 2.479 0 1.462 1.065 2.875 1.213 3.074.149.198 2.096 3.2 5.077 4.487.709.306 1.262.489 1.694.625.712.227 1.36.195 1.871.118.571-.085 1.758-.719 2.006-1.413.248-.694.248-1.289.173-1.413-.074-.124-.272-.198-.57-.347m-5.421 7.403h-.004a9.87 9.87 0 0 1-5.031-1.378l-.361-.214-3.741.982.998-3.648-.235-.374a9.86 9.86 0 0 1-1.51-5.26c.001-5.45 4.436-9.884 9.888-9.884 2.64 0 5.122 1.03 6.988 2.898a9.825 9.825 0 0 1 2.893 6.994c-.003 5.45-4.437 9.884-9.885 9.884m8.413-18.297A11.815 11.815 0 0 0 12.05 0C5.495 0 .16 5.335.157 11.892c0 2.096.547 4.142 1.588 5.945L.057 24l6.305-1.654a11.882 11.882 0 0 0 5.683 1.448h.005c6.554 0 11.89-5.335 11.893-11.893a11.821 11.821 0 0 0-3.48-8.413z\"\/><\/svg><span class=\"dlx-share__label\">WhatsApp<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><button type=\"button\" class=\"dlx-share__link dlx-share__link--with-label\" data-copy-share aria-label=\"Copy the link to this block\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><span class=\"dlx-share__label\">Copy<\/span><\/button><\/li>\n <\/ul>\n <\/div>\n <\/div>\n <\/div>\n\n <p>Two important additions from \u201cedge\u201d communications:<\/p>\n <ul>\n <li>On Raspberry Pi 5, a Google AI Developers post reports \u2248133 tk\/s prefill and \u22487.6 tk\/s decode (same order of magnitude as LiteRT\u2011LM).<\/li>\n <li>On a Qualcomm Dragonwing IQ8 platform, the same post reports \u22483700 tk\/s prefill and \u224831 tk\/s decode on NPU.<\/li>\n <\/ul>\n\n <h3>Hardware requirements (CPU\/GPU\/Apple Silicon\/ARM) and compatibility<\/h3>\n <p>Requirements vary sharply depending on (a) model size, (b) precision (BF16, FP16, INT4, and so on), (c) context length, and (d) the inference engine.<\/p>\n <p>Documented reference points:<\/p>\n <ul>\n <li>The 31B and 26B\u2011A4B \u201cunquantized BF16\u201d models are said to fit on 1\u00d7 80GB GPU (H100), and the vLLM documentation gives comparable minima (31B: 1\u00d7 80GB; 26B\u2011A4B: 1\u00d7 80GB in BF16).<\/li>\n <li>vLLM also indicates \u201cdense edge\u201d minima: E2B\/E4B on 1\u00d7 NVIDIA 24GB+ GPU (in BF16) \u2014 which underlines that even \u201csmall\u201d multimodal models with long context can push VRAM when targeting BF16 + large max_len.<\/li>\n <li>On the tooling side, LiteRT\u2011LM supports CPU\/GPU and even NPU (Android), with a \u201cbackends & platforms\u201d table (Android\/iOS\/macOS\/Windows\/Linux\/IoT).<\/li>\n <li>For Apple Silicon, MLX is presented as an \u201carray\u201d framework for machine learning on Apple silicon, with PyPI installation and CPU\/CUDA variants.<\/li>\n <\/ul>\n\n <h3>Recommended configurations by category<\/h3>\n <p>Practical recommendations (focused on \u201cLocal AI Gemma 4\u201d), built from the constraints above and the published sizes\/benchmarks. Real performance will depend on the engine, the context, the quantization, and the task type (text vs vision vs audio).<\/p>\n <div class=\"dlx-table-wrap\">\n <table>\n <thead><tr>\n <th scope=\"col\">Category<\/th>\n <th scope=\"col\">Goal<\/th>\n <th scope=\"col\">Recommended model<\/th>\n <th scope=\"col\">Recommended stack<\/th>\n <th scope=\"col\">\u201cSafe\u201d configuration<\/th>\n <\/tr><\/thead>\n <tbody>\n <tr><th scope=\"row\">\u201cLocal AI\u201d laptop<\/th><td>Assistants, light RAG, code<\/td><td>E4B or quantized 26B\u2011A4B<\/td><td>Ollama \/ LM Studio \/ llama.cpp<\/td><td>32\u201364GB RAM; 12\u201324GB VRAM GPU (if 26B is quantized)<\/td><\/tr>\n <tr><th scope=\"row\">Developer desktop<\/th><td>Code & agents, vision<\/td><td>26B\u2011A4B (souvent sweet spot)<\/td><td>vLLM (GPU), llama.cpp, Ollama<\/td><td>64GB RAM; 24GB+ VRAM GPU (quantized)<\/td><\/tr>\n <tr><th scope=\"row\">Edge\/IoT<\/th><td>Offline, low energy<\/td><td>E2B\/E4B<\/td><td>LiteRT\u2011LM<\/td><td>ARM64, 8\u201316GB RAM depending on the device; GPU\/NPU acceleration if available<\/td><\/tr>\n <tr><th scope=\"row\">On\u2011prem server<\/th><td>Multi-user, SLA<\/td><td>31B Dense \/ 26B\u2011A4B BF16<\/td><td>vLLM + Docker<\/td><td>1\u00d7 80GB (or multi\u2011GPU) + fast storage + logs\/monitoring<\/td><\/tr>\n <\/tbody>\n <\/table>\n <\/div>\n\n <h3>Energy (quantified approach, with explicit assumptions)<\/h3>\n <p>Sources provide throughput (tokens\/s) but rarely a direct \u201cwatts\u201d measure for LLM inference. A useful approach is to estimate an order of magnitude:<br>energy (kWh) \u2248 power (W) \u00d7 time (h); time \u2248 tokens \/ (tokens\/s).<\/p>\n <p>Power assumptions (\u201chardware\u201d sources): RTX 4090: 450W Total Graphics Power, \u201caverage gaming power 315W\u201d (a plausible lower bound outside stress). Raspberry Pi 5: \u224811.6W under multi\u2011core load in a \u201cworst\u2011case\u201d scenario (technical review). Apple M4 Pro: up to \u224846W (\u224840W sustained) under multi\u2011core load (review). Tokens\/s throughput: LiteRT\u2011LM (E2B table).<\/p>\n <p>Estimate (generation of 1M decode tokens, E2B, order of magnitude):<\/p>\n <div class=\"dlx-table-wrap\">\n <table>\n <thead><tr>\n <th scope=\"col\">Platform<\/th>\n <th scope=\"col\">Throughput (tk\/s)<\/th>\n <th scope=\"col\">Power (W)<\/th>\n <th scope=\"col\">Approx. energy (kWh \/ 1M tokens)<\/th>\n <th scope=\"col\">Interpretation<\/th>\n <\/tr><\/thead>\n <tbody>\n <tr><th scope=\"row\">RTX 4090<\/th><td>143<\/td><td>315\u2013450<\/td><td>~0.61 to ~0.87<\/td><td>Very fast, but high watts<\/td><\/tr>\n <tr><th scope=\"row\">MacBook Pro M4<\/th><td>160<\/td><td>~40\u201346<\/td><td>~0.07 to ~0.08<\/td><td>Remarkable efficiency (if workload is comparable)<\/td><\/tr>\n <tr><th scope=\"row\">Raspberry Pi 5<\/th><td>8<\/td><td>~11,6<\/td><td>~0,42<\/td><td>Slow, but energy use is not unreasonable (low power)<\/td><\/tr>\n <\/tbody>\n <\/table>\n <\/div>\n <p>These figures are estimates (real inference power may differ from a multi\u2011core CPU benchmark or a \u201cgaming\u201d measurement). The most robust takeaway is this: at the \u201celectricity\u201d level, cost per million tokens can be low; the dominant cost often becomes hardware amortization (GPU) and operating engineering (MLOps\/observability\/security).<\/p>\n <\/section>\n\n <section id=\"guide-dinstallation-et-de-deploiement\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>Installation and deployment guide<\/h2>\n <h3>\u201cZero-friction\u201d deployment with Ollama<\/h3>\n <p>The official Gemma guide explains that Ollama (and llama.cpp) use quantized GGUF models to reduce compute requirements, and provides installation \/ pull \/ run \/ local API commands.<\/p>\n <p>Key commands (example):<\/p>\n <pre><code># Check installation\nollama --version\n\n# Download Gemma 4 (default tag)\nollama pull gemma4\n\n# List models\nollama list\n\n# Run a text prompt\nollama run gemma4 \"Give me a unit test plan for a REST API.\"\n\n# Tags mentioned in the docs (depending on size)\n# gemma4:e2b gemma4:e4b gemma4:26b gemma4:31b<\/code><\/pre>\n <p>Local API test (generation):<\/p>\n <pre><code>curl http:\/\/localhost:11434\/api\/generate -d '{\n \"model\": \"gemma4\",\n \"prompt\": \"Summarize this text in 5 points: ...\"\n}'<\/code><\/pre>\n\n <h3>GUI deployment + local server with LM Studio<\/h3>\n <p>The official \u201cLM Studio\u201d guide highlights (a) in\u2011app downloading, (b) GGUF import, and (c) starting a local API server through the CLI.<\/p>\n <pre><code># Import a GGUF\nlms import \/path\/to\/model.gguf\n\n# Load a downloaded model\nlms load <model_key>\n\n# Start the local API server\nlms server start<\/code><\/pre>\n <p>On memory sizing, LM Studio gives rough orders of magnitude for required RAM depending on the size (\u22484 to \u224819GB depending on the variant), useful for an initial pass before fine optimization.<\/p>\n\n <h3>Python inference with Transformers<\/h3>\n <p>The Hugging Face post announces \u201cfirst\u2011class\u201d Transformers support and integration with bitsandbytes \/ PEFT \/ TRL, with an \u201cany\u2011to\u2011any\u201d pipeline example (text + image, and so on).<\/p>\n <p>Minimal installation:<\/p>\n <pre><code>pip install -U transformers<\/code><\/pre>\n <p>Example (\u201cany\u2011to\u2011any\u201d multimodal pipeline):<\/p>\n <pre><code>from transformers import pipeline\n\npipe = pipeline(\"any-to-any\", model=\"google\/gemma-4-e2b-it\")\n\nmessages = [{\n \"role\": \"user\",\n \"content\": [\n {\"type\": \"image\", \"image\": \"https:\/\/...\/thailand.jpg\"},\n {\"type\": \"text\", \"text\": \"Describe the scene and suggest 3 travel tips.\"}\n ],\n}]\n\nout = pipe(messages, max_new_tokens=200, return_full_text=False)\nprint(out[0][\"generated_text\"])<\/code><\/pre>\n\n <h3>\u201cProduction\u201d deployment as an OpenAI-compatible server with vLLM + Docker<\/h3>\n <p>The \u201cGemma 4\u201d vLLM guide provides: (a) vllm serve commands, (b) Docker images, and (c) multi\u2011GPU examples and options (max_model_len, tool calling, thinking).<\/p>\n <p>Docker \u201cOpenAI\u2011style server\u201d example:<\/p>\n <pre><code>docker run -itd --name gemma4 \\\n --ipc=host \\\n --network host \\\n --shm-size 16G \\\n --gpus all \\\n -v ~\/.cache\/huggingface:\/root\/.cache\/huggingface \\\n vllm\/vllm-openai:gemma4 \\\n --model google\/gemma-4-31B-it \\\n --tensor-parallel-size 2 \\\n --max-model-len 32768 \\\n --gpu-memory-utilization 0.90 \\\n --host 0.0.0.0 \\\n --port 8000<\/code><\/pre>\n <p>\u201cThinking + Tool calling\u201d example:<\/p>\n <pre><code>vllm serve google\/gemma-4-31B-it \\\n --max-model-len 16384 \\\n --enable-auto-tool-choice \\\n --reasoning-parser gemma4 \\\n --tool-call-parser gemma4<\/code><\/pre>\n <p>NVFP4 quantization (published vLLM service example):<\/p>\n <pre><code>vllm serve \/models\/gemma-4-31b-it-nvfp4 \\\n --quantization modelopt \\\n --tensor-parallel-size 8<\/code><\/pre>\n\n <h3>Edge and cross-platform deployment with LiteRT\u2011LM<\/h3>\n <p>LiteRT\u2011LM is presented as a \u201cproduction\u2011ready\u201d open-source inference framework for deploying LLMs on edge devices, with CLI\/Python\/Kotlin\/C++ support and CPU\/GPU\/NPU backends depending on the platform.<\/p>\n <p>\u201cQuick try\u201d CLI example (from the repo):<\/p>\n <pre><code>uv tool install litert-lm\n\nlitert-lm run \\\n --from-huggingface-repo=litert-community\/gemma-4-E2B-it-litert-lm \\\n gemma-4-E2B-it.litertlm \\\n --prompt=\"What is the capital of France?\"<\/code><\/pre>\n\n <h3>\u201cLow-level\u201d deployment + OpenAI compatibility with llama.cpp<\/h3>\n <p>llama.cpp exposes a local HTTP server with \u201c\/v1\/chat\/completions\u201d compatible endpoints (OpenAI-style) and a benchmarking CLI (llama-bench). The Hugging Face post also gives an example of llama-server -hf ... on a GGUF checkpoint.<\/p>\n <pre><code># Local OpenAI-style server (local GGUF)\nllama-server -m model.gguf --port 8080\n\n# Or directly from an HF repo (for example E2B)\nllama-server -hf ggml-org\/gemma-4-E2B-it-GGUF<\/code><\/pre>\n\n <h3>Troubleshooting Local AI Gemma 4 (common issues)<\/h3>\n <p>The dominant issues are generally:<\/p>\n <ul>\n <li>OOM \/ saturated VRAM: reduce --max-model-len, switch to quantized formats (GGUF INT4), reduce the vision\/audio budget, limit the number of images per prompt, or choose a smaller variant. The vLLM guide explicitly shows the use of --max-model-len and multi\u2011GPU deployments.<\/li>\n <li>High TTFT latency: prioritize GPU\/NPU (if available), enable batch\/paged attention, reduce prefill and\/or chunking, and avoid continuously sending \u201chuge\u201d prompts. LiteRT\u2011LM metrics illustrate the major impact of the backend (CPU vs GPU).<\/li>\n <li>Quality degradation (quantization): accept a trade\u2011off or move up in precision (Q6\/Q8) if RAM\/VRAM allows it; the Ollama guide explicitly reminds readers of the possible quality drop when quantized.<\/li>\n <li>Ecosystem in motion (April 2026): some engines may hit specific \u201cday\u20110\/week\u20111\u201d bugs; one public llama.cpp example mentions abnormal outputs (tokens <unused24>) on a Gemma 4 checkpoint, a reminder of the importance of updates and regression testing.<\/li>\n <\/ul>\n <\/section>\n\n <section id=\"confidentialite-securite-et-considerations-juridiques\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>Privacy, security, and legal considerations<\/h2>\n <h3>Privacy and compliance (GDPR, CNIL)<\/h3>\n <p>Local AI is often chosen to minimize data exposure: processing happens \u201con your side,\u201d which makes minimization, network isolation, and flow control easier. CNIL, regarding generative AI, notably recommends choosing a robust and secure deployment, favoring local systems where relevant, and analyzing data\u2011reuse conditions if a provider is involved.<\/p>\n <p>Regarding \u201cpersonal data\u201d security, Article 32 of the GDPR requires appropriate technical and organizational measures (for example encryption\/pseudonymization, and means to ensure confidentiality\/integrity\/availability).<br>\n Practical conclusion: Local AI Gemma 4 does not exempt you from GDPR; it mainly changes the attack surface and the responsibility model (you control more, so you must document more).<\/p>\n\n <h3>Application security (LLM apps): main risks<\/h3>\n <p>LLM security risks are now stable enough to be listed as a \u201cTop 10\u201d (prompt injection, insecure output handling, poisoning, DoS, supply chain, and so on). \u201cAgent\u201d risks further increase the need for governance (control\u2011by\u2011design, accountability) when the model can act on systems through tools.<\/p>\n <p>On open source in production, research shows that Internet\u2011exposed self\u2011hosted deployments can be diverted to malicious use (spam\/phishing\/disinformation), and that guardrails are sometimes removed by operators.<\/p>\n\n <h3>Usage policy, license, and responsibilities<\/h3>\n <p>Gemma 4 is announced under Apache 2.0 (a permissive license) \u2014 a strong argument for commercial adoption and on\u2011prem\/edge deployment. However, Google also publishes a Prohibited Use Policy listing forbidden uses (illegal activities, fraud\/phishing\/malware, generation\/processing of sensitive data without authorization, filter bypass, and so on).<br>\n Even if a policy is not always the same thing as a license, it should be read as a \u201cminimum\u201d governance element: in a product, these prohibitions must be translated into controls (rate limiting, filtering, refusal logic, logs, human review).<\/p>\n <\/section>\n\n <section id=\"comparaison-couts-et-implications-de-license-face-aux-concurrents-locaux\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>Comparison, costs, and licensing implications versus local competitors<\/h2>\n <h3>Comparative matrix (local): Gemma vs Llama vs Mistral vs MPT vs Falcon<\/h3>\n <p>This table compares major \u201clocal\u2011friendly\u201d families. It does not replace an \u201capples\u2011to\u2011apples\u201d benchmark (same prompts, same engine, same quantizations), but it helps with selection based on license, modalities, and ecosystem.<\/p>\n <div id=\"gemma4-comparatif-local\" class=\"dlx-share-snippet\" data-share-anchor=\"gemma4-comparatif-local\" data-share-title=\"Gemma vs Llama vs Mistral vs MPT vs Falcon\" data-share-text=\"Local-friendly comparison of licenses, context windows, and local deployment signals.\">\n <div class=\"dlx-shareable-block\">\n <div class=\"dlx-table-wrap\">\n <table>\n <thead><tr>\n <th scope=\"col\">Family<\/th>\n <th scope=\"col\">Example<\/th>\n <th scope=\"col\">License<\/th>\n <th scope=\"col\">Modalities<\/th>\n <th scope=\"col\">Context window<\/th>\n <th scope=\"col\">\u201cLocal\u201d signals (highlights)<\/th>\n <\/tr><\/thead>\n <tbody>\n <tr><th scope=\"row\">Gemma 4<\/th><td>26B\u2011A4B \/ 31B<\/td><td>Apache 2.0<\/td><td>Vision (all), audio (E2B\/E4B)<\/td><td>128K\/256K<\/td><td>MoE \u201c\u22483.8B active\u201d for latency, with a very broad tool ecosystem (Ollama\/LM Studio\/LiteRT\u2011LM\/MLX\/vLLM)<\/td><\/tr>\n <tr><th scope=\"row\">Llama<\/th><td>Llama 3.1 8B\/70B\/405B<\/td><td>License \"community\" (conditions)<\/td><td>Text<\/td><td>128K<\/td><td>Attribution requirement + \u201c700M MAU\u201d clause; excellent ecosystem, but not Apache-style<\/td><\/tr>\n <tr><th scope=\"row\">Mistral<\/th><td>Mistral 7B<\/td><td>Apache 2.0<\/td><td>Text<\/td><td>(depending on implementation)<\/td><td>GQA + Sliding Window Attention for faster\/less costly inference, Apache 2.0<\/td><\/tr>\n <tr><th scope=\"row\">MPT<\/th><td>MPT\u201130B (Base)<\/td><td>Apache 2.0 (Base)<\/td><td>Text<\/td><td>8K<\/td><td>Positioned as \u201ccommercial Apache 2.0,\u201d 8k long context, but some chat variants may carry a non-commercial license<\/td><\/tr>\n <tr><th scope=\"row\">Falcon<\/th><td>Falcon\u201140B<\/td><td>Apache 2.0<\/td><td>Text<\/td><td>(depending on implementation)<\/td><td>Inference-optimized architecture (FlashAttention + multiquery), raw model requiring fine-tuning for chat use<\/td><\/tr>\n <\/tbody>\n <\/table>\n <\/div>\n <\/div>\n <div class=\"dlx-share-card\" aria-label=\"Share this block\">\n <div class=\"dlx-share\">\n <div class=\"dlx-share__title\">Share this block<\/div>\n <ul class=\"dlx-share__list\" role=\"list\">\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"linkedin\" aria-label=\"Share on LinkedIn\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433a2.062 2.062 0 0 1-2.063-2.065 2.064 2.064 0 1 1 2.063 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z\"\/><\/svg><span class=\"dlx-share__label\">LinkedIn<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"x\" aria-label=\"Share on X\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-4.714-6.231-5.401 6.231H2.746l7.73-8.835L1.254 2.25H8.08l4.253 5.622 5.911-5.622zm-1.161 17.52h1.833L7.084 4.126H5.117z\"\/><\/svg><span class=\"dlx-share__label\">X<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"facebook\" aria-label=\"Share on Facebook\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M24 12.073c0-6.627-5.373-12-12-12s-12 5.373-12 12c0 5.99 4.388 10.954 10.125 11.854v-8.385H7.078v-3.47h3.047V9.43c0-3.007 1.792-4.669 4.533-4.669 1.312 0 2.686.235 2.686.235v2.953H15.83c-1.491 0-1.956.925-1.956 1.874v2.25h3.328l-.532 3.47h-2.796v8.385C19.612 23.027 24 18.062 24 12.073z\"\/><\/svg><span class=\"dlx-share__label\">Facebook<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"whatsapp\" aria-label=\"Share on WhatsApp\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M17.472 14.382c-.297-.149-1.758-.867-2.03-.967-.273-.099-.471-.148-.67.15-.197.297-.767.966-.94 1.164-.173.199-.347.223-.644.075-.297-.15-1.255-.463-2.39-1.475-.883-.788-1.48-1.761-1.653-2.059-.173-.297-.018-.458.13-.606.134-.133.298-.347.446-.52.149-.174.198-.298.298-.497.099-.198.05-.371-.025-.52-.075-.149-.669-1.612-.916-2.207-.242-.579-.487-.5-.669-.51-.173-.008-.371-.01-.57-.01-.198 0-.52.074-.792.372-.272.297-1.04 1.016-1.04 2.479 0 1.462 1.065 2.875 1.213 3.074.149.198 2.096 3.2 5.077 4.487.709.306 1.262.489 1.694.625.712.227 1.36.195 1.871.118.571-.085 1.758-.719 2.006-1.413.248-.694.248-1.289.173-1.413-.074-.124-.272-.198-.57-.347m-5.421 7.403h-.004a9.87 9.87 0 0 1-5.031-1.378l-.361-.214-3.741.982.998-3.648-.235-.374a9.86 9.86 0 0 1-1.51-5.26c.001-5.45 4.436-9.884 9.888-9.884 2.64 0 5.122 1.03 6.988 2.898a9.825 9.825 0 0 1 2.893 6.994c-.003 5.45-4.437 9.884-9.885 9.884m8.413-18.297A11.815 11.815 0 0 0 12.05 0C5.495 0 .16 5.335.157 11.892c0 2.096.547 4.142 1.588 5.945L.057 24l6.305-1.654a11.882 11.882 0 0 0 5.683 1.448h.005c6.554 0 11.89-5.335 11.893-11.893a11.821 11.821 0 0 0-3.48-8.413z\"\/><\/svg><span class=\"dlx-share__label\">WhatsApp<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><button type=\"button\" class=\"dlx-share__link dlx-share__link--with-label\" data-copy-share aria-label=\"Copy the link to this block\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><span class=\"dlx-share__label\">Copy<\/span><\/button><\/li>\n <\/ul>\n <\/div>\n <\/div>\n <\/div>\n\n <p>Interpretation \"license & business\" :<\/p>\n <ul>\n <li>Apache 2.0 (Gemma 4, Mistral 7B, MPT\u201130B base, Falcon\u201140B) is simpler for commercial use (less legal uncertainty) than \u201ccustom\u201d licenses that are sometimes criticized for their restrictions.<\/li>\n <li>The Llama 3.1 license notably imposes attribution obligations and specific commercial conditions (for example an MAU threshold), which can matter in a consumer product.<\/li>\n <\/ul>\n\n <h3>Local AI Gemma 4 cost: an analysis model (TCO) rather than an \u201cabsolute\u201d price<\/h3>\n <p>Total \u201clocal\u201d cost can be broken down schematically as follows:<\/p>\n <ol>\n <li>CAPEX (GPU\/server) amortized over N months<\/li>\n <li>Electricity OPEX (often low per token, but not zero)<\/li>\n <li>Engineering OPEX (deployment, security, MLOps, observability)<\/li>\n <li>Opportunity cost (latency, offline capability, compliance)<\/li>\n <\/ol>\n <p>\u201cCompute demand\u201d analyses emphasize that growing demand for compute and energy is a macro issue (pressure on data centers\/electricity), which makes optimization (smaller models, quantization, edge) structural.<\/p>\n <p>Example order of magnitude (electricity only, E2B, 1M tokens): from ~0.07 to ~0.87 kWh depending on platform and power, which represents a few cents to a few tens of cents depending on the local price per kWh.<br>\n In many cases, the decisive question becomes: how many tokens per day and how many concurrent users? If you serve 50 simultaneous users, planning becomes \u201cserver + batching + quotas,\u201d and vLLM \/ dedicated servers become more relevant than local GUIs.<\/p>\n <\/section>\n\n <section id=\"perspectives-et-recommandations\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>Outlook and recommendations<\/h2>\n <h3>Likely trends (2026+)<\/h3>\n <p>Three structuring dynamics:<\/p>\n <ol>\n <li>\u201cReasoned\u201d edge AI (cloud + local hybrid): more and more products explicitly arbitrate where to run the model in order to balance latency, cost, and privacy.<\/li>\n <li>Explosion of agents: agents + tool calling = more value but also more risk, hence a stronger need for \u201ccontrol\u2011by\u2011design.\u201d<\/li>\n <li>Industrialization of open\u2011weight models: the tooling ecosystem (quantization, runtimes, OpenAI-compatible servers) is standardizing, but Internet\u2011exposed self\u2011hosting without governance remains a source of misuse.<\/li>\n <\/ol>\n\n <h3>Operational recommendations for \u201cLocal AI Gemma 4\u201d<\/h3>\n <p>Quick selection rule of thumb:<\/p>\n <ul>\n <li>Mobile\/edge\/strict offline \u2192 E2B (or E4B if you need more reasoning) with LiteRT\u2011LM.<\/li>\n <li>Developer workstation \/ copilot \/ local agent \u2192 prioritize 26B\u2011A4B: a good quality\/speed trade\u2011off, especially if you target code + tools.<\/li>\n <li>Maximum quality + tuning \u2192 31B Dense, accepting the hardware cost (VRAM, context lengths) and a server stack (vLLM) for stability.<\/li>\n <\/ul>\n <p>Essential guardrails (if you are doing \u201clocal agentic\u201d):<\/p>\n <ul>\n <li>Treat model output as untrusted by default (JSON validation, allow\u2011lists, tool sandboxing, access limits).<\/li>\n <li>Protect the host (network segmentation, secrets management, logs, patching) and avoid Internet exposure without authentication\/quotas.<\/li>\n <li>Document compliance (GDPR Art. 32, minimization, DPIA if necessary) and align usage with the Prohibited Use Policy.<\/li>\n <\/ul>\n <p>Explicit assumptions made in this report: \u201cper-token\u201d energy measures are estimates (derived from generic hardware power figures + published tokens\/s). \u201cMin VRAM\u201d figures (vLLM) should be read as BF16 requirements for server deployments, and do not necessarily reflect what is possible with GGUF\/Ollama quantization on consumer GPUs.<\/p>\n <\/section>\n\n <section id=\"references\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>References<\/h2>\n <ul>\n <li><a href=\"https:\/\/ai.google.dev\/gemma\/docs\/core\/model_card_4\" target=\"_blank\" rel=\"noopener noreferrer\">ai.google.dev \u2014 Gemma 4 model card<\/a><\/li>\n <li><a href=\"https:\/\/ai.google.dev\/gemma\/docs\/integrations\/ollama\" target=\"_blank\" rel=\"noopener noreferrer\">ai.google.dev \u2014 Run Gemma with Ollama<\/a><\/li>\n <li><a href=\"https:\/\/ai.google.dev\/gemma\/docs\/integrations\/lmstudio\" target=\"_blank\" rel=\"noopener noreferrer\">ai.google.dev \u2014 Run Gemma with LM Studio<\/a><\/li>\n <li><a href=\"https:\/\/ai.google.dev\/edge\/litert-lm\/overview\" target=\"_blank\" rel=\"noopener noreferrer\">ai.google.dev \u2014 LiteRT-LM Overview<\/a><\/li>\n <li><a href=\"https:\/\/ai.google.dev\/gemma\/prohibited_use_policy\" target=\"_blank\" rel=\"noopener noreferrer\">ai.google.dev \u2014 Gemma Prohibited Use Policy<\/a><\/li>\n <li><a href=\"https:\/\/blog.google\/innovation-and-ai\/technology\/developers-tools\/gemma-4\/\" target=\"_blank\" rel=\"noopener noreferrer\">blog.google \u2014 Gemma 4: Our most capable open models to date<\/a><\/li>\n <li><a href=\"https:\/\/developers.googleblog.com\/bring-state-of-the-art-agentic-skills-to-the-edge-with-gemma-4\/\" target=\"_blank\" rel=\"noopener noreferrer\">developers.googleblog.com \u2014 Bring state-of-the-art agentic skills to the edge with Gemma 4<\/a><\/li>\n <li><a href=\"https:\/\/huggingface.co\/blog\/gemma4\" target=\"_blank\" rel=\"noopener noreferrer\">huggingface.co \u2014 Gemma 4 blog post<\/a><\/li>\n <li><a href=\"https:\/\/huggingface.co\/nvidia\/Gemma-4-31B-IT-NVFP4\" target=\"_blank\" rel=\"noopener noreferrer\">huggingface.co \u2014 NVIDIA Gemma-4-31B-IT-NVFP4<\/a><\/li>\n <li><a href=\"https:\/\/huggingface.co\/meta-llama\/Llama-3.1-8B\" target=\"_blank\" rel=\"noopener noreferrer\">huggingface.co \u2014 Meta Llama 3.1 8B<\/a><\/li>\n <li><a href=\"https:\/\/huggingface.co\/tiiuae\/falcon-40b\" target=\"_blank\" rel=\"noopener noreferrer\">huggingface.co \u2014 Falcon-40B<\/a><\/li>\n <li><a href=\"https:\/\/docs.vllm.ai\/projects\/recipes\/en\/latest\/Google\/Gemma4.html\" target=\"_blank\" rel=\"noopener noreferrer\">docs.vllm.ai \u2014 Gemma 4 recipes<\/a><\/li>\n <li><a href=\"https:\/\/github.com\/google-ai-edge\/LiteRT-LM\" target=\"_blank\" rel=\"noopener noreferrer\">github.com \u2014 google-ai-edge\/LiteRT-LM<\/a><\/li>\n <li><a href=\"https:\/\/github.com\/ggml-org\/llama.cpp\" target=\"_blank\" rel=\"noopener noreferrer\">github.com \u2014 ggml-org\/llama.cpp<\/a><\/li>\n <li><a href=\"https:\/\/github.com\/ggml-org\/llama.cpp\/issues\/21321\" target=\"_blank\" rel=\"noopener noreferrer\">github.com \u2014 llama.cpp issue #21321<\/a><\/li>\n <li><a href=\"https:\/\/github.com\/ml-explore\/mlx\" target=\"_blank\" rel=\"noopener noreferrer\">github.com \u2014 ml-explore\/mlx<\/a><\/li>\n <li><a href=\"https:\/\/www.cnil.fr\/fr\/comment-deployer-une-ia-generative-la-cnil-apporte-de-premieres-precisions\" target=\"_blank\" rel=\"noopener noreferrer\">cnil.fr \u2014 How to deploy generative AI<\/a><\/li>\n <li><a href=\"https:\/\/eur-lex.europa.eu\/legal-content\/EN\/TXT\/HTML\/?uri=CELEX%3A02016R0679-20160504\" target=\"_blank\" rel=\"noopener noreferrer\">eur-lex.europa.eu \u2014 GDPR (Regulation 2016\/679)<\/a><\/li>\n <li><a href=\"https:\/\/owasp.org\/www-project-top-10-for-large-language-model-applications\/\" target=\"_blank\" rel=\"noopener noreferrer\">owasp.org \u2014 OWASP Top 10 for LLM Applications<\/a><\/li>\n <li><a href=\"https:\/\/www.theverge.com\/ai-artificial-intelligence\/906062\/googles-gemma-4-open-ai-model\" target=\"_blank\" rel=\"noopener noreferrer\">theverge.com \u2014 Google\u2019s new Gemma 4 \u201copen\u201d AI model<\/a><\/li>\n <li><a href=\"https:\/\/venturebeat.com\/technology\/google-releases-gemma-4-under-apache-2-0-and-that-license-change-may-matter\" target=\"_blank\" rel=\"noopener noreferrer\">venturebeat.com \u2014 Google releases Gemma 4 under Apache 2.0<\/a><\/li>\n <li><a href=\"https:\/\/www.reuters.com\/technology\/open-source-ai-models-vulnerable-criminal-misuse-researchers-warn-2026-01-29\/\" target=\"_blank\" rel=\"noopener noreferrer\">reuters.com \u2014 Open-source AI models vulnerable to criminal misuse<\/a><\/li>\n <li><a href=\"https:\/\/www.mckinsey.com\/industries\/semiconductors\/our-insights\/the-rise-of-edge-ai-in-automotive\" target=\"_blank\" rel=\"noopener noreferrer\">mckinsey.com \u2014 The rise of edge AI in automotive<\/a><\/li>\n <li><a href=\"https:\/\/www.bain.com\/insights\/how-can-we-meet-ais-insatiable-demand-for-compute-power-technology-report-2025\/\" target=\"_blank\" rel=\"noopener noreferrer\">bain.com \u2014 How can we meet AI\u2019s insatiable demand for compute power<\/a><\/li>\n <li><a href=\"https:\/\/www.bcg.com\/publications\/2025\/what-happens-ai-stops-asking-permission\" target=\"_blank\" rel=\"noopener noreferrer\">bcg.com \u2014 What happens when AI stops asking permission<\/a><\/li>\n <li><a href=\"https:\/\/www.nvidia.com\/en-us\/geforce\/graphics-cards\/40-series\/rtx-4090\/\" target=\"_blank\" rel=\"noopener noreferrer\">nvidia.com \u2014 GeForce RTX 4090<\/a><\/li>\n <li><a href=\"https:\/\/mistral.ai\/news\/announcing-mistral-7b\" target=\"_blank\" rel=\"noopener noreferrer\">mistral.ai \u2014 Announcing Mistral 7B<\/a><\/li>\n <li><a href=\"https:\/\/www.databricks.com\/blog\/mpt-30b\" target=\"_blank\" rel=\"noopener noreferrer\">databricks.com \u2014 MPT-30B<\/a><\/li>\n <li><a href=\"https:\/\/bret.dk\/raspberry-pi-5-review\/\" target=\"_blank\" rel=\"noopener noreferrer\">bret.dk \u2014 Raspberry Pi 5 review<\/a><\/li>\n <li><a href=\"https:\/\/www.notebookcheck.net\/Apple-M4-Pro-analysis-Extremely-fast-but-not-as-efficient.915270.0.html\" target=\"_blank\" rel=\"noopener noreferrer\">notebookcheck.net \u2014 Apple M4 Pro analysis<\/a><\/li>\n <\/ul>\n <\/section>\n\n<\/article>\n<\/body>\n<\/html>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Local AI Gemma 4 Local AI \u00b7 Gemma 4 Local AI Gemma 4: architecture, benchmarks, deployment, and governance for running Gemma 4 offline The phrase \u201cLocal AI Gemma 4\u201d refers to an architectural choice: running Gemma 4 on the user\u2019s machine (PC, phone, edge device, on\u2011prem server) rather than sending data to a cloud API. […]<\/p>\n","protected":false},"author":4,"featured_media":12949,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[61],"tags":[],"class_list":["post-12979","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-non-classified"],"_links":{"self":[{"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/posts\/12979","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/comments?post=12979"}],"version-history":[{"count":4,"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/posts\/12979\/revisions"}],"predecessor-version":[{"id":12983,"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/posts\/12979\/revisions\/12983"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/media\/12949"}],"wp:attachment":[{"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/media?parent=12979"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/categories?post=12979"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/tags?post=12979"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}

\n\t\t\t\t\t\t

\n\t\t\t\t\t

\n\t\t\t

\n\t\t\t\t\t\t

\n\t\t\t\t

\n\t\t\t\t\t\n\n\n\nLocal AI Gemma 4<\/title>\n<\/head>\n<body>\n<article class=\"dlx-article\" itemscope itemtype=\"https:\/\/schema.org\/Article\">\n <header class=\"dlx-article__hero\">\n <p class=\"dlx-article__eyebrow\">Local AI \u00b7 Gemma 4 <\/p>\n <h1 itemprop=\"headline\">Local AI Gemma 4: architecture, benchmarks, deployment, and governance for running Gemma 4 offline<\/h1>\n <p class=\"dlx-article__lead\" itemprop=\"description\">The phrase \u201cLocal AI Gemma 4\u201d refers to an architectural choice: running Gemma 4 on the user\u2019s machine (PC, phone, edge device, on\u2011prem server) rather than sending data to a cloud API. This choice generally pursues four goals: (1) reduce latency by bringing compute closer to the need, (2) strengthen privacy by reducing data exposure, (3) control recurring costs (no per\u2011token billing), and (4) gain <a href=\"https:\/\/www.daillac.com\/blogue\/souverainete-ia-au-canada\/\">operational sovereignty<\/a> (a stack deployed and controlled internally). This \u201cedge AI\u201d logic is now recognized as an explicit cloud vs vehicle\/device vs hybrid trade\u2011off, especially to balance latency, privacy, and cost.<\/p>\n <\/header>\n\n <nav class=\"dlx-toc\" aria-label=\"Table of contents\">\n <div class=\"dlx-toc__title\">In this article<\/div>\n <ul>\n <li><a href=\"#resume-executif\">Executive summary<\/a><\/li>\n <li><a href=\"#quappelle-t-on-ia-locale-et-ou-se-situe-gemma-4\">What is \u201clocal AI,\u201d and where does Gemma 4 fit?<\/a><\/li>\n <li><a href=\"#architecture-technique-de-gemma-4\">Technical architecture of Gemma 4<\/a><\/li>\n <li><a href=\"#performances-exigences-materielles-et-benchmarks\">Performance, hardware requirements, and benchmarks<\/a><\/li>\n <li><a href=\"#guide-dinstallation-et-de-deploiement\">Installation and deployment guide<\/a><\/li>\n <li><a href=\"#confidentialite-securite-et-considerations-juridiques\">Privacy, security, and legal considerations<\/a><\/li>\n <li><a href=\"#comparaison-couts-et-implications-de-license-face-aux-concurrents-locaux\">Comparison, costs, and licensing implications versus local competitors<\/a><\/li>\n <li><a href=\"#perspectives-et-recommandations\">Outlook and recommendations<\/a><\/li>\n <li><a href=\"#references\">References<\/a><\/li>\n <\/ul>\n <\/nav>\n\n \n \n\n <section id=\"resume-executif\" class=\"dlx-section dlx-reveal dlx-share-snippet\" data-dlx=\"reveal\" data-share-anchor=\"resume-executif\" data-share-title=\"Executive summary \u2014 Local AI Gemma 4\" data-share-text=\"Executive summary of Gemma 4\u2019s architecture, benefits, ecosystem, and risks when run locally.\">\n <h2>Executive summary<\/h2>\n <p>Gemma 4 (announced on April 2, 2026) is a family of open models designed for reasoning and <a href=\"https:\/\/www.daillac.com\/blogue\/agents-ia-en-entreprise\/\">agentic workflows<\/a>, offered in four sizes (E2B, E4B, 26B MoE, 31B Dense) and released under Apache 2.0 (commercially permissive).<br>\n Its positioning is twofold: (a) \u201cedge\u201d models (E2B\/E4B) optimized for offline use and mobile\/IoT integration, and (b) \u201cworkstation\u201d models (26B\/31B) targeting a higher quality level, including a 26B MoE designed for latency by activating only about 3.8B parameters at inference.<\/p>\n <p>From an industrial standpoint, the \u201cLocal AI Gemma 4\u201d ecosystem is already well tooled: local execution through apps\/servers (for example Ollama, LM Studio) and inference engines (for example LiteRT\u2011LM, llama.cpp, MLX, vLLM), including \u201cOpenAI\u2011style\u201d compatible APIs for quickly plugging in <a href=\"https:\/\/www.daillac.com\/blogue\/comment-utiliser-lia-en-entreprise-guide-complet-cas-pratiques\/\">existing applications<\/a>.<\/p>\n <p>Finally, \u201clocal\u201d does not mean \u201crisk\u2011free.\u201d Self\u2011hosting shifts part of the risk: host machine security, supply\u2011chain vulnerabilities, prompt injection, <a href=\"https:\/\/www.daillac.com\/blogue\/securite-des-agents-ia\/\">tool control (agents)<\/a>, and misuse risks (spam\/phishing\/disinformation) when Internet\u2011exposed instances are poorly governed.<\/p>\n <div class=\"dlx-share-card\" aria-label=\"Share this block\">\n <div class=\"dlx-share\">\n <div class=\"dlx-share__title\">Share this block<\/div>\n <ul class=\"dlx-share__list\" role=\"list\">\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"linkedin\" aria-label=\"Share on LinkedIn\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433a2.062 2.062 0 0 1-2.063-2.065 2.064 2.064 0 1 1 2.063 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z\"\/><\/svg>\n <span class=\"dlx-share__label\">LinkedIn<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"x\" aria-label=\"Share on X\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-4.714-6.231-5.401 6.231H2.746l7.73-8.835L1.254 2.25H8.08l4.253 5.622 5.911-5.622zm-1.161 17.52h1.833L7.084 4.126H5.117z\"\/><\/svg>\n <span class=\"dlx-share__label\">X<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"facebook\" aria-label=\"Share on Facebook\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M24 12.073c0-6.627-5.373-12-12-12s-12 5.373-12 12c0 5.99 4.388 10.954 10.125 11.854v-8.385H7.078v-3.47h3.047V9.43c0-3.007 1.792-4.669 4.533-4.669 1.312 0 2.686.235 2.686.235v2.953H15.83c-1.491 0-1.956.925-1.956 1.874v2.25h3.328l-.532 3.47h-2.796v8.385C19.612 23.027 24 18.062 24 12.073z\"\/><\/svg>\n <span class=\"dlx-share__label\">Facebook<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"whatsapp\" aria-label=\"Share on WhatsApp\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M17.472 14.382c-.297-.149-1.758-.867-2.03-.967-.273-.099-.471-.148-.67.15-.197.297-.767.966-.94 1.164-.173.199-.347.223-.644.075-.297-.15-1.255-.463-2.39-1.475-.883-.788-1.48-1.761-1.653-2.059-.173-.297-.018-.458.13-.606.134-.133.298-.347.446-.52.149-.174.198-.298.298-.497.099-.198.05-.371-.025-.52-.075-.149-.669-1.612-.916-2.207-.242-.579-.487-.5-.669-.51-.173-.008-.371-.01-.57-.01-.198 0-.52.074-.792.372-.272.297-1.04 1.016-1.04 2.479 0 1.462 1.065 2.875 1.213 3.074.149.198 2.096 3.2 5.077 4.487.709.306 1.262.489 1.694.625.712.227 1.36.195 1.871.118.571-.085 1.758-.719 2.006-1.413.248-.694.248-1.289.173-1.413-.074-.124-.272-.198-.57-.347m-5.421 7.403h-.004a9.87 9.87 0 0 1-5.031-1.378l-.361-.214-3.741.982.998-3.648-.235-.374a9.86 9.86 0 0 1-1.51-5.26c.001-5.45 4.436-9.884 9.888-9.884 2.64 0 5.122 1.03 6.988 2.898a9.825 9.825 0 0 1 2.893 6.994c-.003 5.45-4.437 9.884-9.885 9.884m8.413-18.297A11.815 11.815 0 0 0 12.05 0C5.495 0 .16 5.335.157 11.892c0 2.096.547 4.142 1.588 5.945L.057 24l6.305-1.654a11.882 11.882 0 0 0 5.683 1.448h.005c6.554 0 11.89-5.335 11.893-11.893a11.821 11.821 0 0 0-3.48-8.413z\"\/><\/svg>\n <span class=\"dlx-share__label\">WhatsApp<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <button type=\"button\" class=\"dlx-share__link dlx-share__link--with-label\" data-copy-share aria-label=\"Copy the link to this block\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg>\n <span class=\"dlx-share__label\">Copy<\/span>\n <\/button>\n <\/li>\n <\/ul>\n <\/div>\n <\/div>\n <\/section>\n\n <section id=\"quappelle-t-on-ia-locale-et-ou-se-situe-gemma-4\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>What is \u201clocal AI,\u201d and where does Gemma 4 fit?<\/h2>\n <p>Local AI (or \u201con\u2011device \/ on\u2011prem\u201d) runs inference (and sometimes fine adaptation) as close to the user as possible: workstation, smartphone, industrial embedded device, internal server. This typically reduces outbound data flows and makes it possible to implement controls (encryption, network segmentation, logging, filtering, and so on) at the organizational level.<\/p>\n <p>Data\u2011protection and cybersecurity authorities are increasingly pushing toward robust deployment models, favoring local, secure systems when possible and assessing the risk of data reuse by a provider when an external service is used.<br>\n Strategically, edge AI is explicitly presented as an architectural choice that must balance latency, privacy, and cost, with lighter models and hardware advances making embedded execution more realistic.<\/p>\n <p>In this context, Gemma 4 positions itself as an \u201copen + multi\u2011target\u201d answer: \u201cedge\u201d variants (E2B\/E4B) and \u201cPC\/server\u201d variants (26B\/31B), backed by an Apache 2.0 license (widely seen as more permissive than some restrictive \u201copen\u2011weight\u201d licenses).<br>\n The stated goal is to enable deployments from billions of Android devices to workstations and accelerators, while keeping a common base and agentic capabilities (function calling, structured JSON, system instructions).<\/p>\n <\/section>\n\n <section id=\"architecture-technique-de-gemma-4\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>Technical architecture of Gemma 4<\/h2>\n <h3>Sizes, variants, and modalities<\/h3>\n <p>Gemma 4 is a multimodal family (text + vision, and audio for the smaller variants), with two architecture families: Dense (31B) and Mixture\u2011of\u2011Experts (26B\u2011A4B).<\/p>\n <p>Summary table of the main variants (parameters and key characteristics):<\/p>\n <div id=\"gemma4-variantes\" class=\"dlx-share-snippet\" data-share-anchor=\"gemma4-variantes\" data-share-title=\"Gemma 4 variants table for local AI\" data-share-text=\"Comparison of the Gemma 4 E2B, E4B, 26B\u2011A4B, and 31B variants for local deployment.\">\n <div class=\"dlx-shareable-block\">\n <div class=\"dlx-table-wrap\">\n <table>\n <thead><tr>\n <th scope=\"col\">Variant<\/th>\n <th scope=\"col\">Type<\/th>\n <th scope=\"col\">Parameters (order of magnitude)<\/th>\n <th scope=\"col\">Context window<\/th>\n <th scope=\"col\">Main input modalities<\/th>\n <th scope=\"col\">What it changes for local AI<\/th>\n <\/tr><\/thead>\n <tbody>\n <tr>\n <th scope=\"row\">Gemma 4 E2B<\/th>\n <td>\u201cEdge\u201d dense<\/td>\n <td>2.3B \u201ceffective\u201d (\u22485.1B with PLE embeddings)<\/td>\n <td>128K<\/td>\n <td>Text, image, audio<\/td>\n <td>Designed for mobile\/IoT: a quality\/latency\/memory compromise<\/td>\n <\/tr>\n <tr>\n <th scope=\"row\">Gemma 4 E4B<\/th>\n <td>\u201cEdge\u201d dense<\/td>\n <td>4.5B \u201ceffective\u201d (\u22488B with PLE embeddings)<\/td>\n <td>128K<\/td>\n <td>Text, image, audio<\/td>\n <td>More headroom for reasoning, with hardware cost still contained<\/td>\n <\/tr>\n <tr>\n <th scope=\"row\">Gemma 4 26B\u2011A4B<\/th>\n <td>MoE<\/td>\n <td>\u224825.2B total, \u22483.8B active at inference<\/td>\n <td>256K<\/td>\n <td>Text, image (video via frames depending on the engine)<\/td>\n <td>Local \u201ccheat code\u201d: quality close to large models, throughput close to a smaller model<\/td>\n <\/tr>\n <tr>\n <th scope=\"row\">Gemma 4 31B<\/th>\n <td>Dense<\/td>\n <td>\u224830.7B<\/td>\n <td>256K<\/td>\n <td>Text, image (video via frames depending on the engine)<\/td>\n <td>Highest quality in the family, more demanding (VRAM\/KV cache)<\/td>\n <\/tr>\n <\/tbody>\n <\/table>\n <\/div>\n <\/div>\n <div class=\"dlx-share-card\" aria-label=\"Share this block\">\n <div class=\"dlx-share\">\n <div class=\"dlx-share__title\">Share this block<\/div>\n <ul class=\"dlx-share__list\" role=\"list\">\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"linkedin\" aria-label=\"Share on LinkedIn\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433a2.062 2.062 0 0 1-2.063-2.065 2.064 2.064 0 1 1 2.063 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z\"\/><\/svg>\n <span class=\"dlx-share__label\">LinkedIn<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"x\" aria-label=\"Share on X\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-4.714-6.231-5.401 6.231H2.746l7.73-8.835L1.254 2.25H8.08l4.253 5.622 5.911-5.622zm-1.161 17.52h1.833L7.084 4.126H5.117z\"\/><\/svg>\n <span class=\"dlx-share__label\">X<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"facebook\" aria-label=\"Share on Facebook\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M24 12.073c0-6.627-5.373-12-12-12s-12 5.373-12 12c0 5.99 4.388 10.954 10.125 11.854v-8.385H7.078v-3.47h3.047V9.43c0-3.007 1.792-4.669 4.533-4.669 1.312 0 2.686.235 2.686.235v2.953H15.83c-1.491 0-1.956.925-1.956 1.874v2.25h3.328l-.532 3.47h-2.796v8.385C19.612 23.027 24 18.062 24 12.073z\"\/><\/svg>\n <span class=\"dlx-share__label\">Facebook<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"whatsapp\" aria-label=\"Share on WhatsApp\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M17.472 14.382c-.297-.149-1.758-.867-2.03-.967-.273-.099-.471-.148-.67.15-.197.297-.767.966-.94 1.164-.173.199-.347.223-.644.075-.297-.15-1.255-.463-2.39-1.475-.883-.788-1.48-1.761-1.653-2.059-.173-.297-.018-.458.13-.606.134-.133.298-.347.446-.52.149-.174.198-.298.298-.497.099-.198.05-.371-.025-.52-.075-.149-.669-1.612-.916-2.207-.242-.579-.487-.5-.669-.51-.173-.008-.371-.01-.57-.01-.198 0-.52.074-.792.372-.272.297-1.04 1.016-1.04 2.479 0 1.462 1.065 2.875 1.213 3.074.149.198 2.096 3.2 5.077 4.487.709.306 1.262.489 1.694.625.712.227 1.36.195 1.871.118.571-.085 1.758-.719 2.006-1.413.248-.694.248-1.289.173-1.413-.074-.124-.272-.198-.57-.347m-5.421 7.403h-.004a9.87 9.87 0 0 1-5.031-1.378l-.361-.214-3.741.982.998-3.648-.235-.374a9.86 9.86 0 0 1-1.51-5.26c.001-5.45 4.436-9.884 9.888-9.884 2.64 0 5.122 1.03 6.988 2.898a9.825 9.825 0 0 1 2.893 6.994c-.003 5.45-4.437 9.884-9.885 9.884m8.413-18.297A11.815 11.815 0 0 0 12.05 0C5.495 0 .16 5.335.157 11.892c0 2.096.547 4.142 1.588 5.945L.057 24l6.305-1.654a11.882 11.882 0 0 0 5.683 1.448h.005c6.554 0 11.89-5.335 11.893-11.893a11.821 11.821 0 0 0-3.48-8.413z\"\/><\/svg>\n <span class=\"dlx-share__label\">WhatsApp<\/span>\n <\/a>\n <\/li>\n <li class=\"dlx-share__item\">\n <button type=\"button\" class=\"dlx-share__link dlx-share__link--with-label\" data-copy-share aria-label=\"Copy the link to this block\">\n <svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg>\n <span class=\"dlx-share__label\">Copy<\/span>\n <\/button>\n <\/li>\n <\/ul>\n <\/div>\n <\/div>\n <\/div>\n\n <p>Two structural points are decisive for \u201cLocal AI Gemma 4\u201d:<\/p>\n <ul>\n <li>MoE (26B\u2011A4B): the central argument is latency and tokens\/s through activation of a subset of parameters (\u22483.8B active), so the inference cost is closer to a \u201c~4B\u201d model than to a dense 26B.<\/li>\n <li>Long context (128K\/256K): excellent for \u201cchat over a Git repo \/ long document,\u201d but KV cache memory becomes a limiting factor locally (especially on the larger variants), making hybrid attention \/ KV quantization techniques very important.<\/li>\n <\/ul>\n\n <h3>Attention mechanisms, long context, and \u201cagentic\u201d behavior<\/h3>\n <p>Published technical integrations converge on an architecture designed for long context and multi\u2011engine compatibility:<\/p>\n <ul>\n <li>Hybrid attention (sliding window + global): described as a mechanism alternating local \u201csliding\u2011window\u201d attention and global attention, useful for handling long context at a reasonable cost.<\/li>\n <li>Shared KV cache and related techniques: the goal is to improve memory\/compute efficiency on long prompts.<\/li>\n <li>MoE on the 26B side: the vLLM documentation mentions an expert structure (128 experts, top\u20118 routing) for the MoE model, consistent with the idea of \u201clarge total, small active subset.\u201d<\/li>\n <\/ul>\n <p>On the \u201cagents\u201d side, Gemma 4 also highlights primitives that facilitate automation: function calling, structured JSON output, and system instructions \u2014 and, depending on the engines, a \u201cthinking \/ reasoning\u201d mode exposing a dedicated field in the API response.<\/p>\n\n <h3>Quantization and local deployment formats<\/h3>\n <p>Local AI almost always relies on quantization (precision reduction) to lower RAM\/VRAM usage and energy consumption:<\/p>\n <ul>\n <li>GGUF + quantization: frameworks such as Ollama and llama.cpp use quantized models in GGUF format to reduce compute requirements (sometimes with only moderate quality degradation).<\/li>\n <li>2\u2011bit \/ 4\u2011bit (edge): LiteRT\u2011LM optimizations advertise 2\u2011bit\/4\u2011bit weights and memory\u2011mapping mechanisms (notably to contain memory usage on small devices).<\/li>\n <li>NVFP4 (GPU): an NVFP4\u2011quantized variant (via NVIDIA Model Optimizer) has been released with evaluation results close to baseline on several benchmarks, along with a sample vLLM service.<\/li>\n <li>TurboQuant (Apple Silicon): on the MLX side, the documentation mentions TurboQuant to sharply reduce active memory (\u22484\u00d7) and speed up long\u2011context inference on Apple Silicon.<\/li>\n <\/ul>\n\n <h3>Reference architecture diagram for Local AI Gemma 4<\/h3>\n <figure id=\"schema-architecture-locale\" class=\"dlx-share-snippet\" data-share-anchor=\"schema-architecture-locale\" data-share-title=\"Local AI Gemma 4 architecture diagram\" data-share-text=\"Reference diagram for deploying Gemma 4 locally with an inference engine, guardrails, and tools.\">\n <div class=\"dlx-shareable-block\">\n <div class=\"dlx-mermaid\">\n <pre class=\"mermaid\">flowchart LR\n U[User \/ Application] -->|request| P[Preprocessing\\n(tokenizer + templates)]\n P --> IE[Local inference engine\\n(Ollama \/ llama.cpp \/ vLLM \/ LiteRT-LM \/ MLX)]\n IE --> M[Gemma 4\\nE2B\/E4B\/26B MoE\/31B]\n M -->|response| IE\n IE -->|post-processing| G[Local guardrails\\n(json schema, filters,\\npolicies, logs)]\n G --> R[Rendered response]\n M -->|optional tool call| T[Local tools\\n(RAG, functions, scripts)]\n T --> IE<\/pre>\n <\/div>\n <\/div>\n <div class=\"dlx-share-card\" aria-label=\"Share this block\">\n <div class=\"dlx-share\">\n <div class=\"dlx-share__title\">Share this block<\/div>\n <ul class=\"dlx-share__list\" role=\"list\">\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"linkedin\" aria-label=\"Share on LinkedIn\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433a2.062 2.062 0 0 1-2.063-2.065 2.064 2.064 0 1 1 2.063 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z\"\/><\/svg><span class=\"dlx-share__label\">LinkedIn<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"x\" aria-label=\"Share on X\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-4.714-6.231-5.401 6.231H2.746l7.73-8.835L1.254 2.25H8.08l4.253 5.622 5.911-5.622zm-1.161 17.52h1.833L7.084 4.126H5.117z\"\/><\/svg><span class=\"dlx-share__label\">X<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"facebook\" aria-label=\"Share on Facebook\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M24 12.073c0-6.627-5.373-12-12-12s-12 5.373-12 12c0 5.99 4.388 10.954 10.125 11.854v-8.385H7.078v-3.47h3.047V9.43c0-3.007 1.792-4.669 4.533-4.669 1.312 0 2.686.235 2.686.235v2.953H15.83c-1.491 0-1.956.925-1.956 1.874v2.25h3.328l-.532 3.47h-2.796v8.385C19.612 23.027 24 18.062 24 12.073z\"\/><\/svg><span class=\"dlx-share__label\">Facebook<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"whatsapp\" aria-label=\"Share on WhatsApp\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M17.472 14.382c-.297-.149-1.758-.867-2.03-.967-.273-.099-.471-.148-.67.15-.197.297-.767.966-.94 1.164-.173.199-.347.223-.644.075-.297-.15-1.255-.463-2.39-1.475-.883-.788-1.48-1.761-1.653-2.059-.173-.297-.018-.458.13-.606.134-.133.298-.347.446-.52.149-.174.198-.298.298-.497.099-.198.05-.371-.025-.52-.075-.149-.669-1.612-.916-2.207-.242-.579-.487-.5-.669-.51-.173-.008-.371-.01-.57-.01-.198 0-.52.074-.792.372-.272.297-1.04 1.016-1.04 2.479 0 1.462 1.065 2.875 1.213 3.074.149.198 2.096 3.2 5.077 4.487.709.306 1.262.489 1.694.625.712.227 1.36.195 1.871.118.571-.085 1.758-.719 2.006-1.413.248-.694.248-1.289.173-1.413-.074-.124-.272-.198-.57-.347m-5.421 7.403h-.004a9.87 9.87 0 0 1-5.031-1.378l-.361-.214-3.741.982.998-3.648-.235-.374a9.86 9.86 0 0 1-1.51-5.26c.001-5.45 4.436-9.884 9.888-9.884 2.64 0 5.122 1.03 6.988 2.898a9.825 9.825 0 0 1 2.893 6.994c-.003 5.45-4.437 9.884-9.885 9.884m8.413-18.297A11.815 11.815 0 0 0 12.05 0C5.495 0 .16 5.335.157 11.892c0 2.096.547 4.142 1.588 5.945L.057 24l6.305-1.654a11.882 11.882 0 0 0 5.683 1.448h.005c6.554 0 11.89-5.335 11.893-11.893a11.821 11.821 0 0 0-3.48-8.413z\"\/><\/svg><span class=\"dlx-share__label\">WhatsApp<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><button type=\"button\" class=\"dlx-share__link dlx-share__link--with-label\" data-copy-share aria-label=\"Copy the link to this block\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><span class=\"dlx-share__label\">Copy<\/span><\/button><\/li>\n <\/ul>\n <\/div>\n <\/div>\n <\/figure>\n <p>This diagram reflects a key point: with local AI, you become the operator (observability, security, quotas, tool isolation) \u2014 which is a strength (full control) but also a responsibility.<\/p>\n <\/section>\n\n <section id=\"performances-exigences-materielles-et-benchmarks\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>Performance, hardware requirements, and benchmarks<\/h2>\n <h3>\u201cReasoning \/ code \/ multimodal\u201d quality (public benchmarks)<\/h3>\n <p>The Gemma 4 model card publishes a multi\u2011task table (reasoning, code, long context, vision, audio). Here is a synthetic extract (a selection of signals useful for local deployment choices):<\/p>\n <div id=\"gemma4-benchmarks-qualite\" class=\"dlx-share-snippet\" data-share-anchor=\"gemma4-benchmarks-qualite\" data-share-title=\"Gemma 4 benchmarks \u2014 quality and reasoning\" data-share-text=\"Quick read of Gemma 4 public benchmarks for reasoning, code, and multimodal tasks.\">\n <div class=\"dlx-shareable-block\">\n <div class=\"dlx-table-wrap\">\n <table>\n <thead><tr>\n <th scope=\"col\">Benchmark (selection)<\/th>\n <th scope=\"col\">31B<\/th>\n <th scope=\"col\">26B\u2011A4B<\/th>\n <th scope=\"col\">E4B<\/th>\n <th scope=\"col\">E2B<\/th>\n <\/tr><\/thead>\n <tbody>\n <tr><th scope=\"row\">MMLU\u2011Pro<\/th><td>85.6<\/td><td>81.4<\/td><td>67.2<\/td><td>59.6<\/td><\/tr>\n <tr><th scope=\"row\">AIME 2026 (without tools)<\/th><td>79.6<\/td><td>72.0<\/td><td>44.9<\/td><td>25.6<\/td><\/tr>\n <tr><th scope=\"row\">LiveCodeBench v6 (pass@1)<\/th><td>69.6<\/td><td>74.5<\/td><td>40.4<\/td><td>23.4<\/td><\/tr>\n <tr><th scope=\"row\">MMMU Pro (vision)<\/th><td>76.9<\/td><td>73.8<\/td><td>52.6<\/td><td>44.2<\/td><\/tr>\n <tr><th scope=\"row\">MRCR v2 (8 needles, 128k)<\/th><td>66.4<\/td><td>44.1<\/td><td>25.4<\/td><td>19.1<\/td><\/tr>\n <\/tbody>\n <\/table>\n <\/div>\n <\/div>\n <div class=\"dlx-share-card\" aria-label=\"Share this block\">\n <div class=\"dlx-share\">\n <div class=\"dlx-share__title\">Share this block<\/div>\n <ul class=\"dlx-share__list\" role=\"list\">\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"linkedin\" aria-label=\"Share on LinkedIn\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433a2.062 2.062 0 0 1-2.063-2.065 2.064 2.064 0 1 1 2.063 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z\"\/><\/svg><span class=\"dlx-share__label\">LinkedIn<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"x\" aria-label=\"Share on X\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-4.714-6.231-5.401 6.231H2.746l7.73-8.835L1.254 2.25H8.08l4.253 5.622 5.911-5.622zm-1.161 17.52h1.833L7.084 4.126H5.117z\"\/><\/svg><span class=\"dlx-share__label\">X<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"facebook\" aria-label=\"Share on Facebook\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M24 12.073c0-6.627-5.373-12-12-12s-12 5.373-12 12c0 5.99 4.388 10.954 10.125 11.854v-8.385H7.078v-3.47h3.047V9.43c0-3.007 1.792-4.669 4.533-4.669 1.312 0 2.686.235 2.686.235v2.953H15.83c-1.491 0-1.956.925-1.956 1.874v2.25h3.328l-.532 3.47h-2.796v8.385C19.612 23.027 24 18.062 24 12.073z\"\/><\/svg><span class=\"dlx-share__label\">Facebook<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"whatsapp\" aria-label=\"Share on WhatsApp\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M17.472 14.382c-.297-.149-1.758-.867-2.03-.967-.273-.099-.471-.148-.67.15-.197.297-.767.966-.94 1.164-.173.199-.347.223-.644.075-.297-.15-1.255-.463-2.39-1.475-.883-.788-1.48-1.761-1.653-2.059-.173-.297-.018-.458.13-.606.134-.133.298-.347.446-.52.149-.174.198-.298.298-.497.099-.198.05-.371-.025-.52-.075-.149-.669-1.612-.916-2.207-.242-.579-.487-.5-.669-.51-.173-.008-.371-.01-.57-.01-.198 0-.52.074-.792.372-.272.297-1.04 1.016-1.04 2.479 0 1.462 1.065 2.875 1.213 3.074.149.198 2.096 3.2 5.077 4.487.709.306 1.262.489 1.694.625.712.227 1.36.195 1.871.118.571-.085 1.758-.719 2.006-1.413.248-.694.248-1.289.173-1.413-.074-.124-.272-.198-.57-.347m-5.421 7.403h-.004a9.87 9.87 0 0 1-5.031-1.378l-.361-.214-3.741.982.998-3.648-.235-.374a9.86 9.86 0 0 1-1.51-5.26c.001-5.45 4.436-9.884 9.888-9.884 2.64 0 5.122 1.03 6.988 2.898a9.825 9.825 0 0 1 2.893 6.994c-.003 5.45-4.437 9.884-9.885 9.884m8.413-18.297A11.815 11.815 0 0 0 12.05 0C5.495 0 .16 5.335.157 11.892c0 2.096.547 4.142 1.588 5.945L.057 24l6.305-1.654a11.882 11.882 0 0 0 5.683 1.448h.005c6.554 0 11.89-5.335 11.893-11.893a11.821 11.821 0 0 0-3.48-8.413z\"\/><\/svg><span class=\"dlx-share__label\">WhatsApp<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><button type=\"button\" class=\"dlx-share__link dlx-share__link--with-label\" data-copy-share aria-label=\"Copy the link to this block\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><span class=\"dlx-share__label\">Copy<\/span><\/button><\/li>\n <\/ul>\n <\/div>\n <\/div>\n <\/div>\n\n <p>Analytical reading (for local AI):<\/p>\n <ul>\n <li>The 26B\u2011A4B appears to be a \u201csweet spot\u201d: very competitive (especially for code) while promising faster execution thanks to its \u201c\u22483.8B active\u201d MoE.<\/li>\n <li>The E2B\/E4B models remain capable, but the \u201cquality vs cost\u201d slope becomes steep as soon as you target difficult math\/code tasks or highly demanding long\u2011context use cases.<\/li>\n <\/ul>\n\n <h3>Inference benchmarks (latency, throughput, memory) on devices<\/h3>\n <p>For Local AI Gemma 4, the critical metrics are TTFT (time to first token), generation throughput (tokens\/s), memory cost (peak RAM\/VRAM), and stability under load.<\/p>\n <p>LiteRT\u2011LM benchmarks provide concrete figures across several platforms (CPU\/GPU, mobile, desktop, IoT), including the E2B extract below.<\/p>\n <div id=\"gemma4-benchmarks-appareils\" class=\"dlx-share-snippet\" data-share-anchor=\"gemma4-benchmarks-appareils\" data-share-title=\"Gemma 4 E2B benchmarks on devices\" data-share-text=\"Observed throughput, TTFT, and memory on mobile, desktop, and edge for Gemma 4 E2B.\">\n <div class=\"dlx-shareable-block\">\n <div class=\"dlx-table-wrap\">\n <table>\n <thead><tr>\n <th scope=\"col\">Device<\/th>\n <th scope=\"col\">Backend<\/th>\n <th scope=\"col\">Prefill (tk\/s)<\/th>\n <th scope=\"col\">Decode (tk\/s)<\/th>\n <th scope=\"col\">TTFT (s)<\/th>\n <th scope=\"col\">Peak CPU mem (MB)<\/th>\n <\/tr><\/thead>\n <tbody>\n <tr><th scope=\"row\">Samsung S26 Ultra<\/th><td>CPU<\/td><td>557<\/td><td>47<\/td><td>1.8<\/td><td>1733<\/td><\/tr>\n <tr><th scope=\"row\">Samsung S26 Ultra<\/th><td>GPU<\/td><td>3808<\/td><td>52<\/td><td>0.3<\/td><td>676<\/td><\/tr>\n <tr><th scope=\"row\">iPhone 17 Pro<\/th><td>CPU<\/td><td>532<\/td><td>25<\/td><td>1.9<\/td><td>607<\/td><\/tr>\n <tr><th scope=\"row\">iPhone 17 Pro<\/th><td>GPU<\/td><td>2878<\/td><td>56<\/td><td>0.3<\/td><td>1450<\/td><\/tr>\n <tr><th scope=\"row\">MacBook Pro M4<\/th><td>GPU<\/td><td>7835<\/td><td>160<\/td><td>0.1<\/td><td>1623<\/td><\/tr>\n <tr><th scope=\"row\">Raspberry Pi 5 (16GB)<\/th><td>CPU<\/td><td>133<\/td><td>8<\/td><td>7.8<\/td><td>1546<\/td><\/tr>\n <tr><th scope=\"row\">Linux + GeForce RTX 4090<\/th><td>GPU<\/td><td>11234<\/td><td>143<\/td><td>0.1<\/td><td>913<\/td><\/tr>\n <\/tbody>\n <\/table>\n <\/div>\n <\/div>\n <div class=\"dlx-share-card\" aria-label=\"Share this block\">\n <div class=\"dlx-share\">\n <div class=\"dlx-share__title\">Share this block<\/div>\n <ul class=\"dlx-share__list\" role=\"list\">\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"linkedin\" aria-label=\"Share on LinkedIn\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433a2.062 2.062 0 0 1-2.063-2.065 2.064 2.064 0 1 1 2.063 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z\"\/><\/svg><span class=\"dlx-share__label\">LinkedIn<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"x\" aria-label=\"Share on X\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-4.714-6.231-5.401 6.231H2.746l7.73-8.835L1.254 2.25H8.08l4.253 5.622 5.911-5.622zm-1.161 17.52h1.833L7.084 4.126H5.117z\"\/><\/svg><span class=\"dlx-share__label\">X<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"facebook\" aria-label=\"Share on Facebook\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M24 12.073c0-6.627-5.373-12-12-12s-12 5.373-12 12c0 5.99 4.388 10.954 10.125 11.854v-8.385H7.078v-3.47h3.047V9.43c0-3.007 1.792-4.669 4.533-4.669 1.312 0 2.686.235 2.686.235v2.953H15.83c-1.491 0-1.956.925-1.956 1.874v2.25h3.328l-.532 3.47h-2.796v8.385C19.612 23.027 24 18.062 24 12.073z\"\/><\/svg><span class=\"dlx-share__label\">Facebook<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"whatsapp\" aria-label=\"Share on WhatsApp\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M17.472 14.382c-.297-.149-1.758-.867-2.03-.967-.273-.099-.471-.148-.67.15-.197.297-.767.966-.94 1.164-.173.199-.347.223-.644.075-.297-.15-1.255-.463-2.39-1.475-.883-.788-1.48-1.761-1.653-2.059-.173-.297-.018-.458.13-.606.134-.133.298-.347.446-.52.149-.174.198-.298.298-.497.099-.198.05-.371-.025-.52-.075-.149-.669-1.612-.916-2.207-.242-.579-.487-.5-.669-.51-.173-.008-.371-.01-.57-.01-.198 0-.52.074-.792.372-.272.297-1.04 1.016-1.04 2.479 0 1.462 1.065 2.875 1.213 3.074.149.198 2.096 3.2 5.077 4.487.709.306 1.262.489 1.694.625.712.227 1.36.195 1.871.118.571-.085 1.758-.719 2.006-1.413.248-.694.248-1.289.173-1.413-.074-.124-.272-.198-.57-.347m-5.421 7.403h-.004a9.87 9.87 0 0 1-5.031-1.378l-.361-.214-3.741.982.998-3.648-.235-.374a9.86 9.86 0 0 1-1.51-5.26c.001-5.45 4.436-9.884 9.888-9.884 2.64 0 5.122 1.03 6.988 2.898a9.825 9.825 0 0 1 2.893 6.994c-.003 5.45-4.437 9.884-9.885 9.884m8.413-18.297A11.815 11.815 0 0 0 12.05 0C5.495 0 .16 5.335.157 11.892c0 2.096.547 4.142 1.588 5.945L.057 24l6.305-1.654a11.882 11.882 0 0 0 5.683 1.448h.005c6.554 0 11.89-5.335 11.893-11.893a11.821 11.821 0 0 0-3.48-8.413z\"\/><\/svg><span class=\"dlx-share__label\">WhatsApp<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><button type=\"button\" class=\"dlx-share__link dlx-share__link--with-label\" data-copy-share aria-label=\"Copy the link to this block\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><span class=\"dlx-share__label\">Copy<\/span><\/button><\/li>\n <\/ul>\n <\/div>\n <\/div>\n <\/div>\n\n <p>Two important additions from \u201cedge\u201d communications:<\/p>\n <ul>\n <li>On Raspberry Pi 5, a Google AI Developers post reports \u2248133 tk\/s prefill and \u22487.6 tk\/s decode (same order of magnitude as LiteRT\u2011LM).<\/li>\n <li>On a Qualcomm Dragonwing IQ8 platform, the same post reports \u22483700 tk\/s prefill and \u224831 tk\/s decode on NPU.<\/li>\n <\/ul>\n\n <h3>Hardware requirements (CPU\/GPU\/Apple Silicon\/ARM) and compatibility<\/h3>\n <p>Requirements vary sharply depending on (a) model size, (b) precision (BF16, FP16, INT4, and so on), (c) context length, and (d) the inference engine.<\/p>\n <p>Documented reference points:<\/p>\n <ul>\n <li>The 31B and 26B\u2011A4B \u201cunquantized BF16\u201d models are said to fit on 1\u00d7 80GB GPU (H100), and the vLLM documentation gives comparable minima (31B: 1\u00d7 80GB; 26B\u2011A4B: 1\u00d7 80GB in BF16).<\/li>\n <li>vLLM also indicates \u201cdense edge\u201d minima: E2B\/E4B on 1\u00d7 NVIDIA 24GB+ GPU (in BF16) \u2014 which underlines that even \u201csmall\u201d multimodal models with long context can push VRAM when targeting BF16 + large max_len.<\/li>\n <li>On the tooling side, LiteRT\u2011LM supports CPU\/GPU and even NPU (Android), with a \u201cbackends & platforms\u201d table (Android\/iOS\/macOS\/Windows\/Linux\/IoT).<\/li>\n <li>For Apple Silicon, MLX is presented as an \u201carray\u201d framework for machine learning on Apple silicon, with PyPI installation and CPU\/CUDA variants.<\/li>\n <\/ul>\n\n <h3>Recommended configurations by category<\/h3>\n <p>Practical recommendations (focused on \u201cLocal AI Gemma 4\u201d), built from the constraints above and the published sizes\/benchmarks. Real performance will depend on the engine, the context, the quantization, and the task type (text vs vision vs audio).<\/p>\n <div class=\"dlx-table-wrap\">\n <table>\n <thead><tr>\n <th scope=\"col\">Category<\/th>\n <th scope=\"col\">Goal<\/th>\n <th scope=\"col\">Recommended model<\/th>\n <th scope=\"col\">Recommended stack<\/th>\n <th scope=\"col\">\u201cSafe\u201d configuration<\/th>\n <\/tr><\/thead>\n <tbody>\n <tr><th scope=\"row\">\u201cLocal AI\u201d laptop<\/th><td>Assistants, light RAG, code<\/td><td>E4B or quantized 26B\u2011A4B<\/td><td>Ollama \/ LM Studio \/ llama.cpp<\/td><td>32\u201364GB RAM; 12\u201324GB VRAM GPU (if 26B is quantized)<\/td><\/tr>\n <tr><th scope=\"row\">Developer desktop<\/th><td>Code & agents, vision<\/td><td>26B\u2011A4B (souvent sweet spot)<\/td><td>vLLM (GPU), llama.cpp, Ollama<\/td><td>64GB RAM; 24GB+ VRAM GPU (quantized)<\/td><\/tr>\n <tr><th scope=\"row\">Edge\/IoT<\/th><td>Offline, low energy<\/td><td>E2B\/E4B<\/td><td>LiteRT\u2011LM<\/td><td>ARM64, 8\u201316GB RAM depending on the device; GPU\/NPU acceleration if available<\/td><\/tr>\n <tr><th scope=\"row\">On\u2011prem server<\/th><td>Multi-user, SLA<\/td><td>31B Dense \/ 26B\u2011A4B BF16<\/td><td>vLLM + Docker<\/td><td>1\u00d7 80GB (or multi\u2011GPU) + fast storage + logs\/monitoring<\/td><\/tr>\n <\/tbody>\n <\/table>\n <\/div>\n\n <h3>Energy (quantified approach, with explicit assumptions)<\/h3>\n <p>Sources provide throughput (tokens\/s) but rarely a direct \u201cwatts\u201d measure for LLM inference. A useful approach is to estimate an order of magnitude:<br>energy (kWh) \u2248 power (W) \u00d7 time (h); time \u2248 tokens \/ (tokens\/s).<\/p>\n <p>Power assumptions (\u201chardware\u201d sources): RTX 4090: 450W Total Graphics Power, \u201caverage gaming power 315W\u201d (a plausible lower bound outside stress). Raspberry Pi 5: \u224811.6W under multi\u2011core load in a \u201cworst\u2011case\u201d scenario (technical review). Apple M4 Pro: up to \u224846W (\u224840W sustained) under multi\u2011core load (review). Tokens\/s throughput: LiteRT\u2011LM (E2B table).<\/p>\n <p>Estimate (generation of 1M decode tokens, E2B, order of magnitude):<\/p>\n <div class=\"dlx-table-wrap\">\n <table>\n <thead><tr>\n <th scope=\"col\">Platform<\/th>\n <th scope=\"col\">Throughput (tk\/s)<\/th>\n <th scope=\"col\">Power (W)<\/th>\n <th scope=\"col\">Approx. energy (kWh \/ 1M tokens)<\/th>\n <th scope=\"col\">Interpretation<\/th>\n <\/tr><\/thead>\n <tbody>\n <tr><th scope=\"row\">RTX 4090<\/th><td>143<\/td><td>315\u2013450<\/td><td>~0.61 to ~0.87<\/td><td>Very fast, but high watts<\/td><\/tr>\n <tr><th scope=\"row\">MacBook Pro M4<\/th><td>160<\/td><td>~40\u201346<\/td><td>~0.07 to ~0.08<\/td><td>Remarkable efficiency (if workload is comparable)<\/td><\/tr>\n <tr><th scope=\"row\">Raspberry Pi 5<\/th><td>8<\/td><td>~11,6<\/td><td>~0,42<\/td><td>Slow, but energy use is not unreasonable (low power)<\/td><\/tr>\n <\/tbody>\n <\/table>\n <\/div>\n <p>These figures are estimates (real inference power may differ from a multi\u2011core CPU benchmark or a \u201cgaming\u201d measurement). The most robust takeaway is this: at the \u201celectricity\u201d level, cost per million tokens can be low; the dominant cost often becomes hardware amortization (GPU) and operating engineering (MLOps\/observability\/security).<\/p>\n <\/section>\n\n <section id=\"guide-dinstallation-et-de-deploiement\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>Installation and deployment guide<\/h2>\n <h3>\u201cZero-friction\u201d deployment with Ollama<\/h3>\n <p>The official Gemma guide explains that Ollama (and llama.cpp) use quantized GGUF models to reduce compute requirements, and provides installation \/ pull \/ run \/ local API commands.<\/p>\n <p>Key commands (example):<\/p>\n <pre><code># Check installation\nollama --version\n\n# Download Gemma 4 (default tag)\nollama pull gemma4\n\n# List models\nollama list\n\n# Run a text prompt\nollama run gemma4 \"Give me a unit test plan for a REST API.\"\n\n# Tags mentioned in the docs (depending on size)\n# gemma4:e2b gemma4:e4b gemma4:26b gemma4:31b<\/code><\/pre>\n <p>Local API test (generation):<\/p>\n <pre><code>curl http:\/\/localhost:11434\/api\/generate -d '{\n \"model\": \"gemma4\",\n \"prompt\": \"Summarize this text in 5 points: ...\"\n}'<\/code><\/pre>\n\n <h3>GUI deployment + local server with LM Studio<\/h3>\n <p>The official \u201cLM Studio\u201d guide highlights (a) in\u2011app downloading, (b) GGUF import, and (c) starting a local API server through the CLI.<\/p>\n <pre><code># Import a GGUF\nlms import \/path\/to\/model.gguf\n\n# Load a downloaded model\nlms load <model_key>\n\n# Start the local API server\nlms server start<\/code><\/pre>\n <p>On memory sizing, LM Studio gives rough orders of magnitude for required RAM depending on the size (\u22484 to \u224819GB depending on the variant), useful for an initial pass before fine optimization.<\/p>\n\n <h3>Python inference with Transformers<\/h3>\n <p>The Hugging Face post announces \u201cfirst\u2011class\u201d Transformers support and integration with bitsandbytes \/ PEFT \/ TRL, with an \u201cany\u2011to\u2011any\u201d pipeline example (text + image, and so on).<\/p>\n <p>Minimal installation:<\/p>\n <pre><code>pip install -U transformers<\/code><\/pre>\n <p>Example (\u201cany\u2011to\u2011any\u201d multimodal pipeline):<\/p>\n <pre><code>from transformers import pipeline\n\npipe = pipeline(\"any-to-any\", model=\"google\/gemma-4-e2b-it\")\n\nmessages = [{\n \"role\": \"user\",\n \"content\": [\n {\"type\": \"image\", \"image\": \"https:\/\/...\/thailand.jpg\"},\n {\"type\": \"text\", \"text\": \"Describe the scene and suggest 3 travel tips.\"}\n ],\n}]\n\nout = pipe(messages, max_new_tokens=200, return_full_text=False)\nprint(out[0][\"generated_text\"])<\/code><\/pre>\n\n <h3>\u201cProduction\u201d deployment as an OpenAI-compatible server with vLLM + Docker<\/h3>\n <p>The \u201cGemma 4\u201d vLLM guide provides: (a) vllm serve commands, (b) Docker images, and (c) multi\u2011GPU examples and options (max_model_len, tool calling, thinking).<\/p>\n <p>Docker \u201cOpenAI\u2011style server\u201d example:<\/p>\n <pre><code>docker run -itd --name gemma4 \\\n --ipc=host \\\n --network host \\\n --shm-size 16G \\\n --gpus all \\\n -v ~\/.cache\/huggingface:\/root\/.cache\/huggingface \\\n vllm\/vllm-openai:gemma4 \\\n --model google\/gemma-4-31B-it \\\n --tensor-parallel-size 2 \\\n --max-model-len 32768 \\\n --gpu-memory-utilization 0.90 \\\n --host 0.0.0.0 \\\n --port 8000<\/code><\/pre>\n <p>\u201cThinking + Tool calling\u201d example:<\/p>\n <pre><code>vllm serve google\/gemma-4-31B-it \\\n --max-model-len 16384 \\\n --enable-auto-tool-choice \\\n --reasoning-parser gemma4 \\\n --tool-call-parser gemma4<\/code><\/pre>\n <p>NVFP4 quantization (published vLLM service example):<\/p>\n <pre><code>vllm serve \/models\/gemma-4-31b-it-nvfp4 \\\n --quantization modelopt \\\n --tensor-parallel-size 8<\/code><\/pre>\n\n <h3>Edge and cross-platform deployment with LiteRT\u2011LM<\/h3>\n <p>LiteRT\u2011LM is presented as a \u201cproduction\u2011ready\u201d open-source inference framework for deploying LLMs on edge devices, with CLI\/Python\/Kotlin\/C++ support and CPU\/GPU\/NPU backends depending on the platform.<\/p>\n <p>\u201cQuick try\u201d CLI example (from the repo):<\/p>\n <pre><code>uv tool install litert-lm\n\nlitert-lm run \\\n --from-huggingface-repo=litert-community\/gemma-4-E2B-it-litert-lm \\\n gemma-4-E2B-it.litertlm \\\n --prompt=\"What is the capital of France?\"<\/code><\/pre>\n\n <h3>\u201cLow-level\u201d deployment + OpenAI compatibility with llama.cpp<\/h3>\n <p>llama.cpp exposes a local HTTP server with \u201c\/v1\/chat\/completions\u201d compatible endpoints (OpenAI-style) and a benchmarking CLI (llama-bench). The Hugging Face post also gives an example of llama-server -hf ... on a GGUF checkpoint.<\/p>\n <pre><code># Local OpenAI-style server (local GGUF)\nllama-server -m model.gguf --port 8080\n\n# Or directly from an HF repo (for example E2B)\nllama-server -hf ggml-org\/gemma-4-E2B-it-GGUF<\/code><\/pre>\n\n <h3>Troubleshooting Local AI Gemma 4 (common issues)<\/h3>\n <p>The dominant issues are generally:<\/p>\n <ul>\n <li>OOM \/ saturated VRAM: reduce --max-model-len, switch to quantized formats (GGUF INT4), reduce the vision\/audio budget, limit the number of images per prompt, or choose a smaller variant. The vLLM guide explicitly shows the use of --max-model-len and multi\u2011GPU deployments.<\/li>\n <li>High TTFT latency: prioritize GPU\/NPU (if available), enable batch\/paged attention, reduce prefill and\/or chunking, and avoid continuously sending \u201chuge\u201d prompts. LiteRT\u2011LM metrics illustrate the major impact of the backend (CPU vs GPU).<\/li>\n <li>Quality degradation (quantization): accept a trade\u2011off or move up in precision (Q6\/Q8) if RAM\/VRAM allows it; the Ollama guide explicitly reminds readers of the possible quality drop when quantized.<\/li>\n <li>Ecosystem in motion (April 2026): some engines may hit specific \u201cday\u20110\/week\u20111\u201d bugs; one public llama.cpp example mentions abnormal outputs (tokens <unused24>) on a Gemma 4 checkpoint, a reminder of the importance of updates and regression testing.<\/li>\n <\/ul>\n <\/section>\n\n <section id=\"confidentialite-securite-et-considerations-juridiques\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>Privacy, security, and legal considerations<\/h2>\n <h3>Privacy and compliance (GDPR, CNIL)<\/h3>\n <p>Local AI is often chosen to minimize data exposure: processing happens \u201con your side,\u201d which makes minimization, network isolation, and flow control easier. CNIL, regarding generative AI, notably recommends choosing a robust and secure deployment, favoring local systems where relevant, and analyzing data\u2011reuse conditions if a provider is involved.<\/p>\n <p>Regarding \u201cpersonal data\u201d security, Article 32 of the GDPR requires appropriate technical and organizational measures (for example encryption\/pseudonymization, and means to ensure confidentiality\/integrity\/availability).<br>\n Practical conclusion: Local AI Gemma 4 does not exempt you from GDPR; it mainly changes the attack surface and the responsibility model (you control more, so you must document more).<\/p>\n\n <h3>Application security (LLM apps): main risks<\/h3>\n <p>LLM security risks are now stable enough to be listed as a \u201cTop 10\u201d (prompt injection, insecure output handling, poisoning, DoS, supply chain, and so on). \u201cAgent\u201d risks further increase the need for governance (control\u2011by\u2011design, accountability) when the model can act on systems through tools.<\/p>\n <p>On open source in production, research shows that Internet\u2011exposed self\u2011hosted deployments can be diverted to malicious use (spam\/phishing\/disinformation), and that guardrails are sometimes removed by operators.<\/p>\n\n <h3>Usage policy, license, and responsibilities<\/h3>\n <p>Gemma 4 is announced under Apache 2.0 (a permissive license) \u2014 a strong argument for commercial adoption and on\u2011prem\/edge deployment. However, Google also publishes a Prohibited Use Policy listing forbidden uses (illegal activities, fraud\/phishing\/malware, generation\/processing of sensitive data without authorization, filter bypass, and so on).<br>\n Even if a policy is not always the same thing as a license, it should be read as a \u201cminimum\u201d governance element: in a product, these prohibitions must be translated into controls (rate limiting, filtering, refusal logic, logs, human review).<\/p>\n <\/section>\n\n <section id=\"comparaison-couts-et-implications-de-license-face-aux-concurrents-locaux\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>Comparison, costs, and licensing implications versus local competitors<\/h2>\n <h3>Comparative matrix (local): Gemma vs Llama vs Mistral vs MPT vs Falcon<\/h3>\n <p>This table compares major \u201clocal\u2011friendly\u201d families. It does not replace an \u201capples\u2011to\u2011apples\u201d benchmark (same prompts, same engine, same quantizations), but it helps with selection based on license, modalities, and ecosystem.<\/p>\n <div id=\"gemma4-comparatif-local\" class=\"dlx-share-snippet\" data-share-anchor=\"gemma4-comparatif-local\" data-share-title=\"Gemma vs Llama vs Mistral vs MPT vs Falcon\" data-share-text=\"Local-friendly comparison of licenses, context windows, and local deployment signals.\">\n <div class=\"dlx-shareable-block\">\n <div class=\"dlx-table-wrap\">\n <table>\n <thead><tr>\n <th scope=\"col\">Family<\/th>\n <th scope=\"col\">Example<\/th>\n <th scope=\"col\">License<\/th>\n <th scope=\"col\">Modalities<\/th>\n <th scope=\"col\">Context window<\/th>\n <th scope=\"col\">\u201cLocal\u201d signals (highlights)<\/th>\n <\/tr><\/thead>\n <tbody>\n <tr><th scope=\"row\">Gemma 4<\/th><td>26B\u2011A4B \/ 31B<\/td><td>Apache 2.0<\/td><td>Vision (all), audio (E2B\/E4B)<\/td><td>128K\/256K<\/td><td>MoE \u201c\u22483.8B active\u201d for latency, with a very broad tool ecosystem (Ollama\/LM Studio\/LiteRT\u2011LM\/MLX\/vLLM)<\/td><\/tr>\n <tr><th scope=\"row\">Llama<\/th><td>Llama 3.1 8B\/70B\/405B<\/td><td>License \"community\" (conditions)<\/td><td>Text<\/td><td>128K<\/td><td>Attribution requirement + \u201c700M MAU\u201d clause; excellent ecosystem, but not Apache-style<\/td><\/tr>\n <tr><th scope=\"row\">Mistral<\/th><td>Mistral 7B<\/td><td>Apache 2.0<\/td><td>Text<\/td><td>(depending on implementation)<\/td><td>GQA + Sliding Window Attention for faster\/less costly inference, Apache 2.0<\/td><\/tr>\n <tr><th scope=\"row\">MPT<\/th><td>MPT\u201130B (Base)<\/td><td>Apache 2.0 (Base)<\/td><td>Text<\/td><td>8K<\/td><td>Positioned as \u201ccommercial Apache 2.0,\u201d 8k long context, but some chat variants may carry a non-commercial license<\/td><\/tr>\n <tr><th scope=\"row\">Falcon<\/th><td>Falcon\u201140B<\/td><td>Apache 2.0<\/td><td>Text<\/td><td>(depending on implementation)<\/td><td>Inference-optimized architecture (FlashAttention + multiquery), raw model requiring fine-tuning for chat use<\/td><\/tr>\n <\/tbody>\n <\/table>\n <\/div>\n <\/div>\n <div class=\"dlx-share-card\" aria-label=\"Share this block\">\n <div class=\"dlx-share\">\n <div class=\"dlx-share__title\">Share this block<\/div>\n <ul class=\"dlx-share__list\" role=\"list\">\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"linkedin\" aria-label=\"Share on LinkedIn\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M20.447 20.452h-3.554v-5.569c0-1.328-.027-3.037-1.852-3.037-1.853 0-2.136 1.445-2.136 2.939v5.667H9.351V9h3.414v1.561h.046c.477-.9 1.637-1.85 3.37-1.85 3.601 0 4.267 2.37 4.267 5.455v6.286zM5.337 7.433a2.062 2.062 0 0 1-2.063-2.065 2.064 2.064 0 1 1 2.063 2.065zm1.782 13.019H3.555V9h3.564v11.452zM22.225 0H1.771C.792 0 0 .774 0 1.729v20.542C0 23.227.792 24 1.771 24h20.451C23.2 24 24 23.227 24 22.271V1.729C24 .774 23.2 0 22.222 0h.003z\"\/><\/svg><span class=\"dlx-share__label\">LinkedIn<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"x\" aria-label=\"Share on X\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M18.244 2.25h3.308l-7.227 8.26 8.502 11.24H16.17l-4.714-6.231-5.401 6.231H2.746l7.73-8.835L1.254 2.25H8.08l4.253 5.622 5.911-5.622zm-1.161 17.52h1.833L7.084 4.126H5.117z\"\/><\/svg><span class=\"dlx-share__label\">X<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"facebook\" aria-label=\"Share on Facebook\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M24 12.073c0-6.627-5.373-12-12-12s-12 5.373-12 12c0 5.99 4.388 10.954 10.125 11.854v-8.385H7.078v-3.47h3.047V9.43c0-3.007 1.792-4.669 4.533-4.669 1.312 0 2.686.235 2.686.235v2.953H15.83c-1.491 0-1.956.925-1.956 1.874v2.25h3.328l-.532 3.47h-2.796v8.385C19.612 23.027 24 18.062 24 12.073z\"\/><\/svg><span class=\"dlx-share__label\">Facebook<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><a href=\"#\" class=\"dlx-share__link dlx-share__link--with-label\" data-share=\"whatsapp\" aria-label=\"Share on WhatsApp\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"currentColor\" aria-hidden=\"true\"><path d=\"M17.472 14.382c-.297-.149-1.758-.867-2.03-.967-.273-.099-.471-.148-.67.15-.197.297-.767.966-.94 1.164-.173.199-.347.223-.644.075-.297-.15-1.255-.463-2.39-1.475-.883-.788-1.48-1.761-1.653-2.059-.173-.297-.018-.458.13-.606.134-.133.298-.347.446-.52.149-.174.198-.298.298-.497.099-.198.05-.371-.025-.52-.075-.149-.669-1.612-.916-2.207-.242-.579-.487-.5-.669-.51-.173-.008-.371-.01-.57-.01-.198 0-.52.074-.792.372-.272.297-1.04 1.016-1.04 2.479 0 1.462 1.065 2.875 1.213 3.074.149.198 2.096 3.2 5.077 4.487.709.306 1.262.489 1.694.625.712.227 1.36.195 1.871.118.571-.085 1.758-.719 2.006-1.413.248-.694.248-1.289.173-1.413-.074-.124-.272-.198-.57-.347m-5.421 7.403h-.004a9.87 9.87 0 0 1-5.031-1.378l-.361-.214-3.741.982.998-3.648-.235-.374a9.86 9.86 0 0 1-1.51-5.26c.001-5.45 4.436-9.884 9.888-9.884 2.64 0 5.122 1.03 6.988 2.898a9.825 9.825 0 0 1 2.893 6.994c-.003 5.45-4.437 9.884-9.885 9.884m8.413-18.297A11.815 11.815 0 0 0 12.05 0C5.495 0 .16 5.335.157 11.892c0 2.096.547 4.142 1.588 5.945L.057 24l6.305-1.654a11.882 11.882 0 0 0 5.683 1.448h.005c6.554 0 11.89-5.335 11.893-11.893a11.821 11.821 0 0 0-3.48-8.413z\"\/><\/svg><span class=\"dlx-share__label\">WhatsApp<\/span><\/a><\/li>\n <li class=\"dlx-share__item\"><button type=\"button\" class=\"dlx-share__link dlx-share__link--with-label\" data-copy-share aria-label=\"Copy the link to this block\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"18\" height=\"18\" viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\" stroke-linecap=\"round\" stroke-linejoin=\"round\" aria-hidden=\"true\"><rect x=\"9\" y=\"9\" width=\"13\" height=\"13\" rx=\"2\" ry=\"2\"><\/rect><path d=\"M5 15H4a2 2 0 0 1-2-2V4a2 2 0 0 1 2-2h9a2 2 0 0 1 2 2v1\"><\/path><\/svg><span class=\"dlx-share__label\">Copy<\/span><\/button><\/li>\n <\/ul>\n <\/div>\n <\/div>\n <\/div>\n\n <p>Interpretation \"license & business\" :<\/p>\n <ul>\n <li>Apache 2.0 (Gemma 4, Mistral 7B, MPT\u201130B base, Falcon\u201140B) is simpler for commercial use (less legal uncertainty) than \u201ccustom\u201d licenses that are sometimes criticized for their restrictions.<\/li>\n <li>The Llama 3.1 license notably imposes attribution obligations and specific commercial conditions (for example an MAU threshold), which can matter in a consumer product.<\/li>\n <\/ul>\n\n <h3>Local AI Gemma 4 cost: an analysis model (TCO) rather than an \u201cabsolute\u201d price<\/h3>\n <p>Total \u201clocal\u201d cost can be broken down schematically as follows:<\/p>\n <ol>\n <li>CAPEX (GPU\/server) amortized over N months<\/li>\n <li>Electricity OPEX (often low per token, but not zero)<\/li>\n <li>Engineering OPEX (deployment, security, MLOps, observability)<\/li>\n <li>Opportunity cost (latency, offline capability, compliance)<\/li>\n <\/ol>\n <p>\u201cCompute demand\u201d analyses emphasize that growing demand for compute and energy is a macro issue (pressure on data centers\/electricity), which makes optimization (smaller models, quantization, edge) structural.<\/p>\n <p>Example order of magnitude (electricity only, E2B, 1M tokens): from ~0.07 to ~0.87 kWh depending on platform and power, which represents a few cents to a few tens of cents depending on the local price per kWh.<br>\n In many cases, the decisive question becomes: how many tokens per day and how many concurrent users? If you serve 50 simultaneous users, planning becomes \u201cserver + batching + quotas,\u201d and vLLM \/ dedicated servers become more relevant than local GUIs.<\/p>\n <\/section>\n\n <section id=\"perspectives-et-recommandations\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>Outlook and recommendations<\/h2>\n <h3>Likely trends (2026+)<\/h3>\n <p>Three structuring dynamics:<\/p>\n <ol>\n <li>\u201cReasoned\u201d edge AI (cloud + local hybrid): more and more products explicitly arbitrate where to run the model in order to balance latency, cost, and privacy.<\/li>\n <li>Explosion of agents: agents + tool calling = more value but also more risk, hence a stronger need for \u201ccontrol\u2011by\u2011design.\u201d<\/li>\n <li>Industrialization of open\u2011weight models: the tooling ecosystem (quantization, runtimes, OpenAI-compatible servers) is standardizing, but Internet\u2011exposed self\u2011hosting without governance remains a source of misuse.<\/li>\n <\/ol>\n\n <h3>Operational recommendations for \u201cLocal AI Gemma 4\u201d<\/h3>\n <p>Quick selection rule of thumb:<\/p>\n <ul>\n <li>Mobile\/edge\/strict offline \u2192 E2B (or E4B if you need more reasoning) with LiteRT\u2011LM.<\/li>\n <li>Developer workstation \/ copilot \/ local agent \u2192 prioritize 26B\u2011A4B: a good quality\/speed trade\u2011off, especially if you target code + tools.<\/li>\n <li>Maximum quality + tuning \u2192 31B Dense, accepting the hardware cost (VRAM, context lengths) and a server stack (vLLM) for stability.<\/li>\n <\/ul>\n <p>Essential guardrails (if you are doing \u201clocal agentic\u201d):<\/p>\n <ul>\n <li>Treat model output as untrusted by default (JSON validation, allow\u2011lists, tool sandboxing, access limits).<\/li>\n <li>Protect the host (network segmentation, secrets management, logs, patching) and avoid Internet exposure without authentication\/quotas.<\/li>\n <li>Document compliance (GDPR Art. 32, minimization, DPIA if necessary) and align usage with the Prohibited Use Policy.<\/li>\n <\/ul>\n <p>Explicit assumptions made in this report: \u201cper-token\u201d energy measures are estimates (derived from generic hardware power figures + published tokens\/s). \u201cMin VRAM\u201d figures (vLLM) should be read as BF16 requirements for server deployments, and do not necessarily reflect what is possible with GGUF\/Ollama quantization on consumer GPUs.<\/p>\n <\/section>\n\n <section id=\"references\" class=\"dlx-section dlx-reveal\" data-dlx=\"reveal\">\n <h2>References<\/h2>\n <ul>\n <li><a href=\"https:\/\/ai.google.dev\/gemma\/docs\/core\/model_card_4\" target=\"_blank\" rel=\"noopener noreferrer\">ai.google.dev \u2014 Gemma 4 model card<\/a><\/li>\n <li><a href=\"https:\/\/ai.google.dev\/gemma\/docs\/integrations\/ollama\" target=\"_blank\" rel=\"noopener noreferrer\">ai.google.dev \u2014 Run Gemma with Ollama<\/a><\/li>\n <li><a href=\"https:\/\/ai.google.dev\/gemma\/docs\/integrations\/lmstudio\" target=\"_blank\" rel=\"noopener noreferrer\">ai.google.dev \u2014 Run Gemma with LM Studio<\/a><\/li>\n <li><a href=\"https:\/\/ai.google.dev\/edge\/litert-lm\/overview\" target=\"_blank\" rel=\"noopener noreferrer\">ai.google.dev \u2014 LiteRT-LM Overview<\/a><\/li>\n <li><a href=\"https:\/\/ai.google.dev\/gemma\/prohibited_use_policy\" target=\"_blank\" rel=\"noopener noreferrer\">ai.google.dev \u2014 Gemma Prohibited Use Policy<\/a><\/li>\n <li><a href=\"https:\/\/blog.google\/innovation-and-ai\/technology\/developers-tools\/gemma-4\/\" target=\"_blank\" rel=\"noopener noreferrer\">blog.google \u2014 Gemma 4: Our most capable open models to date<\/a><\/li>\n <li><a href=\"https:\/\/developers.googleblog.com\/bring-state-of-the-art-agentic-skills-to-the-edge-with-gemma-4\/\" target=\"_blank\" rel=\"noopener noreferrer\">developers.googleblog.com \u2014 Bring state-of-the-art agentic skills to the edge with Gemma 4<\/a><\/li>\n <li><a href=\"https:\/\/huggingface.co\/blog\/gemma4\" target=\"_blank\" rel=\"noopener noreferrer\">huggingface.co \u2014 Gemma 4 blog post<\/a><\/li>\n <li><a href=\"https:\/\/huggingface.co\/nvidia\/Gemma-4-31B-IT-NVFP4\" target=\"_blank\" rel=\"noopener noreferrer\">huggingface.co \u2014 NVIDIA Gemma-4-31B-IT-NVFP4<\/a><\/li>\n <li><a href=\"https:\/\/huggingface.co\/meta-llama\/Llama-3.1-8B\" target=\"_blank\" rel=\"noopener noreferrer\">huggingface.co \u2014 Meta Llama 3.1 8B<\/a><\/li>\n <li><a href=\"https:\/\/huggingface.co\/tiiuae\/falcon-40b\" target=\"_blank\" rel=\"noopener noreferrer\">huggingface.co \u2014 Falcon-40B<\/a><\/li>\n <li><a href=\"https:\/\/docs.vllm.ai\/projects\/recipes\/en\/latest\/Google\/Gemma4.html\" target=\"_blank\" rel=\"noopener noreferrer\">docs.vllm.ai \u2014 Gemma 4 recipes<\/a><\/li>\n <li><a href=\"https:\/\/github.com\/google-ai-edge\/LiteRT-LM\" target=\"_blank\" rel=\"noopener noreferrer\">github.com \u2014 google-ai-edge\/LiteRT-LM<\/a><\/li>\n <li><a href=\"https:\/\/github.com\/ggml-org\/llama.cpp\" target=\"_blank\" rel=\"noopener noreferrer\">github.com \u2014 ggml-org\/llama.cpp<\/a><\/li>\n <li><a href=\"https:\/\/github.com\/ggml-org\/llama.cpp\/issues\/21321\" target=\"_blank\" rel=\"noopener noreferrer\">github.com \u2014 llama.cpp issue #21321<\/a><\/li>\n <li><a href=\"https:\/\/github.com\/ml-explore\/mlx\" target=\"_blank\" rel=\"noopener noreferrer\">github.com \u2014 ml-explore\/mlx<\/a><\/li>\n <li><a href=\"https:\/\/www.cnil.fr\/fr\/comment-deployer-une-ia-generative-la-cnil-apporte-de-premieres-precisions\" target=\"_blank\" rel=\"noopener noreferrer\">cnil.fr \u2014 How to deploy generative AI<\/a><\/li>\n <li><a href=\"https:\/\/eur-lex.europa.eu\/legal-content\/EN\/TXT\/HTML\/?uri=CELEX%3A02016R0679-20160504\" target=\"_blank\" rel=\"noopener noreferrer\">eur-lex.europa.eu \u2014 GDPR (Regulation 2016\/679)<\/a><\/li>\n <li><a href=\"https:\/\/owasp.org\/www-project-top-10-for-large-language-model-applications\/\" target=\"_blank\" rel=\"noopener noreferrer\">owasp.org \u2014 OWASP Top 10 for LLM Applications<\/a><\/li>\n <li><a href=\"https:\/\/www.theverge.com\/ai-artificial-intelligence\/906062\/googles-gemma-4-open-ai-model\" target=\"_blank\" rel=\"noopener noreferrer\">theverge.com \u2014 Google\u2019s new Gemma 4 \u201copen\u201d AI model<\/a><\/li>\n <li><a href=\"https:\/\/venturebeat.com\/technology\/google-releases-gemma-4-under-apache-2-0-and-that-license-change-may-matter\" target=\"_blank\" rel=\"noopener noreferrer\">venturebeat.com \u2014 Google releases Gemma 4 under Apache 2.0<\/a><\/li>\n <li><a href=\"https:\/\/www.reuters.com\/technology\/open-source-ai-models-vulnerable-criminal-misuse-researchers-warn-2026-01-29\/\" target=\"_blank\" rel=\"noopener noreferrer\">reuters.com \u2014 Open-source AI models vulnerable to criminal misuse<\/a><\/li>\n <li><a href=\"https:\/\/www.mckinsey.com\/industries\/semiconductors\/our-insights\/the-rise-of-edge-ai-in-automotive\" target=\"_blank\" rel=\"noopener noreferrer\">mckinsey.com \u2014 The rise of edge AI in automotive<\/a><\/li>\n <li><a href=\"https:\/\/www.bain.com\/insights\/how-can-we-meet-ais-insatiable-demand-for-compute-power-technology-report-2025\/\" target=\"_blank\" rel=\"noopener noreferrer\">bain.com \u2014 How can we meet AI\u2019s insatiable demand for compute power<\/a><\/li>\n <li><a href=\"https:\/\/www.bcg.com\/publications\/2025\/what-happens-ai-stops-asking-permission\" target=\"_blank\" rel=\"noopener noreferrer\">bcg.com \u2014 What happens when AI stops asking permission<\/a><\/li>\n <li><a href=\"https:\/\/www.nvidia.com\/en-us\/geforce\/graphics-cards\/40-series\/rtx-4090\/\" target=\"_blank\" rel=\"noopener noreferrer\">nvidia.com \u2014 GeForce RTX 4090<\/a><\/li>\n <li><a href=\"https:\/\/mistral.ai\/news\/announcing-mistral-7b\" target=\"_blank\" rel=\"noopener noreferrer\">mistral.ai \u2014 Announcing Mistral 7B<\/a><\/li>\n <li><a href=\"https:\/\/www.databricks.com\/blog\/mpt-30b\" target=\"_blank\" rel=\"noopener noreferrer\">databricks.com \u2014 MPT-30B<\/a><\/li>\n <li><a href=\"https:\/\/bret.dk\/raspberry-pi-5-review\/\" target=\"_blank\" rel=\"noopener noreferrer\">bret.dk \u2014 Raspberry Pi 5 review<\/a><\/li>\n <li><a href=\"https:\/\/www.notebookcheck.net\/Apple-M4-Pro-analysis-Extremely-fast-but-not-as-efficient.915270.0.html\" target=\"_blank\" rel=\"noopener noreferrer\">notebookcheck.net \u2014 Apple M4 Pro analysis<\/a><\/li>\n <\/ul>\n <\/section>\n\n<\/article>\n<\/body>\n<\/html>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Local AI Gemma 4 Local AI \u00b7 Gemma 4 Local AI Gemma 4: architecture, benchmarks, deployment, and governance for running Gemma 4 offline The phrase \u201cLocal AI Gemma 4\u201d refers to an architectural choice: running Gemma 4 on the user\u2019s machine (PC, phone, edge device, on\u2011prem server) rather than sending data to a cloud API. […]<\/p>\n","protected":false},"author":4,"featured_media":12949,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[61],"tags":[],"class_list":["post-12979","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-non-classified"],"_links":{"self":[{"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/posts\/12979","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/comments?post=12979"}],"version-history":[{"count":4,"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/posts\/12979\/revisions"}],"predecessor-version":[{"id":12983,"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/posts\/12979\/revisions\/12983"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/media\/12949"}],"wp:attachment":[{"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/media?parent=12979"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/categories?post=12979"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.daillac.com\/en\/wp-json\/wp\/v2\/tags?post=12979"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}