Brief Digest
The Qwen CLI wraps model download, chat, and serving behind a consistent terminal interface. Install steps live in the upstream GitHub README — this page covers orientation, subcommand patterns, and scripting integration. For production serving at scale, a dedicated inference stack (vLLM, TGI) is preferable to the CLI server subcommand.
What the Qwen CLI does
A functional overview of the CLI's role in the Qwen developer toolchain.
The Qwen CLI bridges the gap between the hosted chat interface and a fully custom application integration. Developers who want to interact with a locally hosted Qwen model from a terminal — without writing a Python script or spinning up a separate web application — reach for the CLI. It provides a consistent command surface for the most common model interactions: starting an interactive chat, running a one-shot query, downloading weights from Hugging Face, and launching a local API-compatible server that other tools can query.
The CLI is also useful as a diagnostic tool during integration development. When a developer is building a Python application that calls a local Qwen server, the CLI's chat and run subcommands let them verify the server is responding correctly before introducing application-layer complexity. A quick qwen run --model qwen2.5-7b "test prompt" confirms connectivity and output format without needing a separate test harness.
For embedded engineers and DevOps practitioners who live in the terminal, the CLI is often more comfortable than a browser-based studio. Shell script integration, pipe-compatible output, and the ability to set defaults via a config file mean the CLI fits naturally into existing toolchain patterns rather than requiring a context switch to a browser.
Installation: where to find the authoritative steps
Why this reference page points to the upstream README rather than reproducing install instructions that change with each release.
The Qwen CLI install process is documented in the upstream README on the official Qwen GitHub repository. This reference page intentionally does not reproduce those steps verbatim, because install dependencies, Python version requirements, and platform-specific caveats change with each CLI release. Reproducing install steps here would create a source that diverges from the upstream authoritative version within weeks of any release. The upstream README is always the correct source.
The general pattern, across recent releases, involves cloning the repository and running a pip install against the package directory or a published PyPI package name. Some CLI releases are distributed as a standalone binary for platforms where Python dependency management is inconvenient. Check the upstream README's installation section for the current recommended path for your operating system.
Dependencies that the CLI typically requires include a compatible Python version (3.9 or later in recent releases), the transformers library, and a backend inference library such as torch with the appropriate CUDA or MPS bindings for GPU inference. CPU-only inference is supported but slower. The README's "Quick start" section is the practical first read; the "Requirements" section covers the dependency matrix in detail.
On macOS with Apple Silicon, inference via the Metal Performance Shaders backend is available through the appropriate torch build. The CLI's device selection subcommand or configuration key lets you specify mps as the inference device once the right torch build is installed. This typically provides a meaningful speed-up over CPU inference on M-series hardware.
| Subcommand | Purpose | Example usage |
|---|---|---|
chat |
Start an interactive multi-turn conversation with a local Qwen model | qwen chat --model qwen2.5-7b-instruct |
run |
Execute a single non-interactive prompt and print the response to stdout | qwen run --model qwen2.5-7b "Summarise this: $(cat doc.txt)" |
serve |
Launch a local OpenAI-compatible API server backed by a Qwen model | qwen serve --model qwen2.5-14b --port 8080 |
pull |
Download model weights from Hugging Face to the local cache | qwen pull qwen2.5-7b-instruct |
config set |
Write a default value to the CLI config file | qwen config set default_model qwen2.5-7b-instruct |
config list |
Print all current config file values to stdout | qwen config list |
Configuration files and environment variables
How the Qwen CLI resolves its settings and where to look when a default is not behaving as expected.
The Qwen CLI reads its configuration from a YAML file, typically located at ~/.qwen/config.yaml on Linux and macOS, or an equivalent path under the user's home directory on Windows. The config file accepts a small set of keys that correspond to the most commonly overridden defaults: default_model, device, api_base (for pointing the CLI at a remote or alternative inference endpoint), and output_format.
Environment variables take precedence over config file values, which makes it straightforward to override a config file setting in a specific shell session without modifying the file. The environment variable names follow a QWEN_ prefix convention — for example, QWEN_DEFAULT_MODEL overrides the default_model config key. This precedence order (environment variables beat config file beats built-in defaults) is the standard Unix convention, and the Qwen CLI follows it consistently.
For teams using the CLI in CI/CD pipelines, environment-variable configuration is preferable to committing a config file to the repository. Store sensitive values like API keys in the CI platform's secret store and inject them as environment variables at runtime. The config file should contain only non-sensitive defaults that are safe to version-control if needed.
The config list subcommand prints the currently resolved configuration, including whether each value came from the config file or an environment variable override. This is the fastest diagnostic when the CLI is behaving unexpectedly — it shows exactly what values the tool sees, without requiring the developer to trace through multiple config layers manually.
Integrating the Qwen CLI into shell scripts
Practical patterns for using the CLI's stdout output in bash and zsh pipelines.
The CLI's run subcommand is designed for scripting. It accepts a prompt as a positional argument, sends it to the configured model, and writes the response to stdout. This makes it composable with standard Unix tools: pipe the output to grep, jq, sed, or any other text processor in the pipeline.
A common pattern is to use command substitution to capture the response into a shell variable: SUMMARY=$(qwen run --model qwen2.5-7b "Summarise: $(cat report.txt)"). The response is then available as $SUMMARY for use in subsequent script steps. Use the --format plain flag to suppress markdown formatting in the response, which prevents stray asterisks and backticks from confusing downstream text processing.
For batch processing — running the same prompt against a list of input files — a for loop or xargs with controlled parallelism handles the job. Be aware that concurrent CLI invocations each spawn a separate model-loading process unless the serve subcommand is used to host the model persistently and the run subcommand is pointed at that server via the --api-base flag. For any batch job larger than a handful of files, the serve-plus-run pattern is far more efficient than spawning a fresh model load per file.
Security considerations apply to shell-script integrations that include user-supplied content in prompts. Prompt injection via file content is a real risk when the input is not sanitised before being included in the model instruction. The NIST AI security guidelines include relevant material on input handling risks for production AI integrations, and the UC Berkeley security group has published research on prompt injection patterns that is worth reviewing before deploying any CLI-based automation against untrusted input.
"The serve-and-query pattern in the Qwen CLI finally let us integrate local inference into our firmware validation pipeline without modifying any existing toolchain. One persistent server process, the run subcommand pointing at it — clean and predictable."
Embedded Engineer · Coppermine Fabric Works · Boulder, CO