Skip to main content

· 5 min read
goblin

vLLM

vLLM 是一个快速且易于使用的 LLM 推理和服务库 vLLM(Very Large Language Model Serving)是由加州大学伯克利分校团队开发的高性能、低延迟的大语言模型(LLM)推理和服务框架。它专为大规模生产级部署设计,尤其擅长处理超长上下文(如8k+ tokens)和高并发请求,同时显著优化显存利用率,是当前开源社区中吞吐量最高的LLM推理引擎之一。 官方网站详情可查看

我的环境信息

DetailDescription
CPU64c
内存500GiB
GPUNVODIA A100 80G * 4
数据盘500GiB
操作系统Ubuntu 24.04

安装 CUDA

  • 进入CUDA Toolkit Archive页面。

  • 选择驱动对应的 CUDA 版本。

  • 获取 CUDA 安装包下载地址。

    • CUDA Toolkit 安装
        wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-ubuntu2404.pin
      sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
      wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda-repo-ubuntu2404-12-8-local_12.8.1-570.124.06-1_amd64.deb
      sudo dpkg -i cuda-repo-ubuntu2404-12-8-local_12.8.1-570.124.06-1_amd64.deb
      sudo cp /var/cuda-repo-ubuntu2404-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
      sudo apt-get update
      sudo apt-get -y install cuda-toolkit-12-8
    • 驱动安装
        apt-get install -y cuda-drivers
    • 配置CUDA环境变量
        echo 'export PATH=/usr/local/cuda/bin:$PATH' | sudo tee /etc/profile.d/cuda.sh
      source /etc/profile
    • 检查CUDA是否成功安装
        nvcc -V
      # GPU 信息
      nvidia-smi

安装conda

运行vllm需要Python环境,推荐使用conda创建一个新的Python环境。

# (Recommended) Create a new conda environment.
conda create -n vllm python=3.12 -y
conda activate vllm

切换到环境中:

conda activate vllm

安装vllm

pip install vllm

模型下载

可以直接从 huggingface 下载模型,也可以从 镜像站 或者 魔塔社区 下载。

国内的huggingface镜像站时完全同步huggiingface.co,不存在版本延迟。 魔塔的模型与huggingface可能存在微小差距。

安装huggingface-cli

pip install --upgrade huggingface_hub
# 如需切换镜像站
# export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download adept/fuyu-8b --cache-dir ./path/to/cache

运行模型

对外提供服务,需要以serving模型启动vllm,以下是启动示例:

python -m vllm.entrypoints.openai.api_server \
-- model "/data/models/DeepSeek-R1" \
--served-model-name deepseek_r1 \
--host 0.0.0.0 \
--port 8080 \
--max-model-len 4096 \ # 最大上下文长度
--tensor-parallel-size 4 \ # GPU 数量
--gpu-memory-utilization 0.95 \ # 推理过程显存占用比例,默认值 0.9
--dtype float16 \ # 计算精度控制
--trust-remote-code \
--enforce-eager # 禁用 CUDA 优化提升兼容性

使用 vllm serve 启动示例:

vllm serve /data/models/DeepSeek-R1 \
--served-model-name deepseek_r1 \
--host 0.0.0.0 \
--port 8080 \
--max-model-len 4096 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--dtype float16 \
--trust-remote-code \
--enforce-eager

功能验证

curl --location 'http://127.0.0.1:8080/v1/chat/completions' --header 'Content-Type: application/json' --data '{
"model": "deepseek_r1",
"messages": [{"role": "user", "content": "hello"}]
}'

官方基准测试

git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
  • 基准测试指标含义
指标含义
Avg prompt throughput输入吞吐量(Prompt Tokens/s),0.0 表示当前没有新的输入请求
Avg generation throughput生成吞吐量(Generation Tokens/s),86.8 表示模型每秒生成 86.8 个 token
Running正在处理的请求数(当前正在生成的请求)
Swapped被换出的请求数(当显存不足时,某些请求会被移到 CPU)
Pending等待中的请求数(尚未处理的请求)
GPU KV cache usageGPU KV Cache 使用率,表示当前 GPU 的 key-value cache 使用情况,数值越高表示显存消耗越多

基准测试示例:

CUDA_VISIBLE_DEVICES=0,1,2,3 python benchmark_throughput.py \
  --model "/data/models/deepseek-70b" \
  --backend vllm \
  --input-len 4096 \
  --output-len 10000 \
  --num-prompts 50 \
  --seed 1100 \
  --dtype float16  \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 16384 \
  --cpu-offload-gb 10 \
  --enforce-eager 

相关链接

· 2 min read
goblin

前提条件

  • 一个运行的kubernetes集群
  • 一个阿里云账号,并且已经创建了一个 DNS 域名
  • 阿里云的 AccessKey 和 SecretKey,用于 cert-manager 自动配置 DNS 记录

安装 Cert Manager

官网地址

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.0/cert-manager.yaml

DNS01

官网地址, http01 不支持泛域名

helm repo add cert-manager-alidns-webhook https://devmachine-fr.github.io/cert-manager-alidns-webhook
helm repo update
helm install alidns-webhook cert-manager-alidns-webhook/alidns-webhook --set groupName=example.com
  • 创建阿里云 DNS 访问权限
  apiVersion: v1
kind: Secret
metadata:
name: alidns-secrets
namespace: cert-manager
stringData:
access-key: xxx
secret-key: xxx
  • 创建 ClusterIssuer
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt
spec:
acme:
email: [email protected]
server: https://acme-v02.api.letsencrypt.org/directory # 测试可以使用 staging (https://acme-staging-v02.api.letsencrypt.org/directory)
privateKeySecretRef:
name: letsencrypt
solvers:
- dns01:
webhook:
config:
accessTokenSecretRef:
key: access-key
name: alidns-secrets
regionId: cn-beijing # this value your aliyun region
secretKeySecretRef:
key: secret-key
name: alidns-secrets
groupName: example.com # groupName must match the one configured on webhook deployment (see Helm chart's values) !
solverName: alidns-solver
  • 创建 certification 使用 ClusterIssuer
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: example-tls
spec:
secretName: example-com-tls
dnsNames:
- example.com
- "*.example.com"
issuerRef:
name: letsencrypt
kind: ClusterIssuer

配置 Ingress 自动申请证书

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: 'true'
spec:
rules:
- host: app.example.com
http:
paths:
- path: /
backend:
serviceName: app
servicePort: 80
tls:
- hosts:
- app.example.com
secretName: example-com-tls

· One min read
goblin

查看集群状态

$ ETCDCTL_API=3 etcdctl --cacert=/opt/kubernetes/ssl/ca.pem --cert=/opt/kubernetes/ssl/server.pem --key=/opt/kubernetes/ssl/server-key.pem --endpoints=https://10.0.1.2:2379,https://10.0.1.3:2379,https://10.0.1.4:2379 endpoint health

https://10.0.1.2:2379 is healthy: successfully committed proposal: took = 1.698385ms
https://10.0.1.3:2379 is healthy: successfully committed proposal: took = 1.577913ms
https://10.0.1.4:2379 is healthy: successfully committed proposal: took = 5.616079ms

获取某个 Key 信息

ETCDCTL_API=3 etcdctl --cacert=/opt/kubernetes/ssl/ca.pem --cert=/opt/kubernetes/ssl/server.pem --key=/opt/kubernetes/ssl/server-key.pem --endpoints=https://10.0.1.2:2379,https://10.0.1.3:2379,https://10.0.1.4:2379 get /registry/apiregistration.k8s.io/apiservices/v1.apps

获取所有 Key

ETCDCTL_API=3 etcdctl --cacert=/opt/kubernetes/ssl/ca.pem --cert=/opt/kubernetes/ssl/server.pem --key=/opt/kubernetes/ssl/server-key.pem --endpoints=https://10.0.1.2:2379,https://10.0.1.3:2379,https://10.0.1.4:2379 get / --prefix --keys-only

使用 Snapshot Save 备份

ETCDCTL_API=3 etcdctl --cacert=/opt/kubernetes/ssl/ca.pem --cert=/opt/kubernetes/ssl/server.pem --key=/opt/kubernetes/ssl/server-key.pem --endpoints=https://10.0.1.2:2379,https://10.0.1.3:2379,https://10.0.1.4:2379 snapshot save /data/etcd_backup/etcd-snapshot-`date +%Y%m%d`.db

备份保留 10 天

find /data/etcd_backup/ -name *.db -mtime +10 -exec rm -f {} \;

恢复备份

拷贝etcd备份快照,停止集群所有kube-apiserver服务,停止集群所有ETCD服务

ETCDCTL_API=3 etcdctl snapshot restore /data/etcd_backup/etcd-snapshot-20231225.db \
--name etcd-0 \
--initial-cluster "etcd-0=https://10.0.1.2:2380,etcd-1=https://10.0.1.3:2380,etcd-2=https://10.0.1.4:2380" \
--initial-cluster-token etcd-cluster \
--initial-advertise-peer-urls https://10.0.1.2:2380 \
--data-dir=/var/lib/etcd/default.etcd

· 2 min read
goblin

Mysql的infromation_schema库,可以查询数据库中每个表占用的空间、表记录行数

  • TABLE_SCHEMA: 数据库名
  • TABLE_NAME: 表名
  • ENGINE: 使用的存储引擎
  • TABLES_ROWS: 记录数
  • DATA_LENGTH: 数据大小
  • INDEX_LENGTH: 索引大小

查看所有库大小

use information_schema;
select concat(round(sum(DATA_LENGTH/1024/1024),2),'MB') as data from TABLES;

查看指定库大小

select concat(round(sum(DATA_LENGTH/1024/1024),2),'MB') as data  from TABLES where table_schema='xxx';

查看指定库的指定表的大小

select concat(round(sum(DATA_LENGTH/1024/1024),2),'MB') as data  from TABLES where table_schema='xxx' and table_name='xxx';

查看指定库的索引大小

SELECT CONCAT(ROUND(SUM(index_length)/(1024*1024), 2), ' MB') AS 'Total Index Size' FROM TABLES  WHERE table_schema = 'xxx';

查看指定库的指定表的索引大小

SELECT CONCAT(ROUND(SUM(index_length)/(1024*1024), 2), ' MB') AS 'Total Index Size' FROM TABLES  WHERE table_schema = 'xxx' and table_name='xxx';

查看一个库中的情况

SELECT CONCAT(table_schema,'.',table_name) AS 'Table Name', CONCAT(ROUND(table_rows/1000000,4),'M') AS 'Number of Rows', CONCAT(ROUND(data_length/(1024*1024*1024),4),'G') AS 'Data Size', CONCAT(ROUND(index_length/(1024*1024*1024),4),'G') AS 'Index Size', CONCAT(ROUND((data_length+index_length)/(1024*1024*1024),4),'G') AS'Total'FROM information_schema.TABLES WHERE table_schema LIKE 'xxx';

查看非 Sleep 状态的链接,按消耗时间倒序展示

使用show full processlist可以查看所有链接情况

select id, db, user, host, command, time, state, info
from information_schema.processlist
where command != 'Sleep'
order by time desc;

查询执行时间超过2分钟的线程,然后拼接成kill语句

select concat('kill ', id, ';')
from information_schema.processlist
where command != 'Sleep'
and time > 2*60
order by time desc;

快速杀死所有进程

mysql -e "show full processlist;" -ss | awk '{print "KILL "$1";"}'| mysql

· 6 min read
goblin

获取客户端真实IP、域名、协议、端口

proxy_set_header Host $http_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
  • Host 包含客户端真实的域名和端口号;
  • X-Forwarded-Proto 表示客户端真实的协议;
  • X-Real-IP 表示客户端真实的IP;
  • X-Forwarded-ForX-Real-IP 类似,但它在多层代理时会包含真实客户端及中间每个代理服务器的IP;

负载均衡配置

fail_timeout时间内失败了max_fails次请求后,将该服务地址剔除掉,fail_tiemout时间后会再次将该服务器加入存活列表,进行重试

http {
upstream server_name {
server IP:Port weight=1 max_fails=2 fail_timeout=60s;
server IP:Port weight=2 max_fails=2 fail_timeout=60s;
}
server {
listen 80;
location / {
proxy_pass http://server_name;
}
}
}

静态资源缓存

location ~* .(gif|jpg|jpeg|bmp|png|ico|txt|js|css)$ {
expires 3d;
add_header Static Nginx-Proxy;
}

动态黑名单

  • 一般配置
location / {
deny 192.168.1.1;
deny 192.168.1.0/24;
allow 10.0.0.0/16;
allow 2001:0db8::/32;
deny all;
}
  • Lua+Redis动态黑名单(OpenResty)
yum install yum-utils
yum-config-manager --add-repo https://openresty.org/package/centos/openresty.repo
yum install openresty
yum install openresty-resty
yum --disablerepo="*" --enablerepo="openresty" list available
service openresty start

配置(/usr/local/openresty/nginx/conf/nginx.conf)

lua_shared_dict ip_blacklist 1m;

server {
listen 80;

location / {
access_by_lua_file lua/ip_blacklist.lua;
proxy_pass http://server_name;
}
}

ip_blacklist.lua

local redis_host    = "192.168.1.100"
local redis_port = 6379
local redis_pwd = 123456
local redis_db = 1

-- connection timeout for redis in ms.
local redis_connection_timeout = 100

-- a set key for blacklist entries
local redis_key = "ip_blacklist"

-- cache lookups for this many seconds
local cache_ttl = 60

-- end configuration

local ip = ngx.var.remote_addr
local ip_blacklist = ngx.shared.ip_blacklist
local last_update_time = ip_blacklist:get("last_update_time");

-- update ip_blacklist from Redis every cache_ttl seconds:
if last_update_time == nil or last_update_time < ( ngx.now() - cache_ttl ) then

local redis = require "resty.redis";
local red = redis:new();

red:set_timeout(redis_connect_timeout);

local ok, err = red:connect(redis_host, redis_port);
if not ok then
ngx.log(ngx.ERR, "Redis connection error while connect: " .. err);
else
local ok, err = red:auth(redis_pwd)
if not ok then
ngx.log(ngx.ERR, "Redis password error while auth: " .. err);
else
local new_ip_blacklist, err = red:smembers(redis_key);
if err then
ngx.log(ngx.ERR, "Redis read error while retrieving ip_blacklist: " .. err);
else
ngx.log(ngx.ERR, "Get data success:" .. new_ip_blacklist)
-- replace the locally stored ip_blacklist with the updated values:
ip_blacklist:flush_all();
for index, banned_ip in ipairs(new_ip_blacklist) do
ip_blacklist:set(banned_ip, true);
end
-- update time
ip_blacklist:set("last_update_time", ngx.now());
end
end
end
end

if ip_blacklist:get(ip) then
ngx.log(ngx.ERR, "Banned IP detected and refused access: " .. ip);
return ngx.exit(ngx.HTTP_FORBIDDEN);
end

Websocket

map $http_upgrade $connection_upgrade { 
default upgrade;
'' close;
}
upstream ws_backend{
server IP:Port;
keepalive 1000;
}
server {
listen 80;
location /{
proxy_http_version 1.1;
proxy_pass http://ws_backend;
proxy_redirect off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 3600s;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
}
}

nginx 的转发规则

location [=|~|~*|^~] /uri/ { ... }
  • = 严格匹配,如果请求匹配这个location,那么将停止搜索立即处理请求
  • ~ 区分大小写匹配(可用正则表达式)
  • ~* 不区分大小写匹配(可用正则表达式)
  • !~ 区分大小写不匹配
  • !~* 不区分大小写不匹配
  • ^~ 如果把这个前缀用于一个常规字符串,那么告诉nginx如果路径匹配那么不测试正则表达式

nginx.conf 配置

user  nginx;
worker_processes auto;
worker_rlimit_nofile 65535;

error_log /data/logs/nginx/error.log notice;
pid /var/run/nginx.pid;
events {
worker_connections 65535;
multi_accept on;
use epoll;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '[$time_local] RemoteAddr:"$remote_addr" RemoteUser:"$remote_user" Host:"$host" '
'RequestUil:"$request" HttpStatus:"$status" BodyBytesSent:"$body_bytes_sent" '
'HttpReferer:"$http_referer" HttpUserAgent:"$http_user_agent" '
'Http_X_ForwardedFor:"$http_x_forwarded_for" UpstreamResponseTime:"$upstream_response_time" '
'UpstreamAddr:"$upstream_addr" RequestTime:"$request_time" --- $server_port';
log_format json '{'
'"RemoteAddr":"$remote_addr",'
'"RemoteUser":"$remote_user",'
'"TimeLocal":"$time_local",'
'"RequestUil":"$request",'
'"HttpHost":"$http_host"'
'"HttpStatus":"$status",'
'"BodyBytesSent":"$body_bytes_sent",'
'"HttpReferer":"$http_referer",'
'"HttpUserAgent":"$http_user_agent",'
'"Http_X_ForwardedFor":"$http_x_forwarded_for",'
'"SslProtocol":"$ssl_protocol"'
'"SslCipher":"$ssl_cipher"'
'"UpstreamResponseTime":"$upstream_response_time",'
'"UpstreamAddr":"$upstream_addr",'
'"RequestTime":"$request_time",'
'}';

access_log /data/logs/nginx/access.log main;

access_log off;
server_tokens off;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
send_timeout 300;
keepalive_timeout 300;
resolver_timeout 60;
server_names_hash_max_size 512;
server_names_hash_bucket_size 128;

client_body_timeout 300;
client_header_timeout 300;
client_header_buffer_size 512k;
client_max_body_size 300m;
large_client_header_buffers 8 32k;
client_body_buffer_size 256k;

fastcgi_connect_timeout 300;
fastcgi_send_timeout 300;
fastcgi_read_timeout 300;
fastcgi_buffer_size 128k;
fastcgi_buffers 8 256k;
fastcgi_busy_buffers_size 256k;
fastcgi_temp_file_write_size 256k;
fastcgi_temp_path /tmp/ngx_fcgi_tmp;
fastcgi_cache_path /tmp/fcgi_cache_path levels=1:2 keys_zone=ngx_fcgi_cache:512m inactive=1d max_size=10g;

gzip on;
gzip_http_version 1.1;
gzip_min_length 1k;
gzip_buffers 4 16k;
gzip_comp_level 9;
gzip_types text/plain application/json application/x-javascript text/css application/xml text/javascript application/x-httpd-php image/jpeg image/gif image/png;
gzip_vary on;
gzip_disable "MSIE [1-6]\.";

proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_connect_timeout 300;
proxy_read_timeout 300;
proxy_send_timeout 300;
proxy_buffering on;
proxy_buffer_size 128k;
proxy_buffers 8 128k;
proxy_busy_buffers_size 256k;
proxy_temp_file_write_size 256k;
proxy_temp_path /tmp/proxy_temp_path;
proxy_cache_path /tmp/proxy_cache_path levels=1:2 keys_zone=ngx_proxy_cache:512m inactive=1d max_size=10g;
include /etc/nginx/conf.d/*.conf;
}
include /etc/nginx/stream.d/*.conf;

stream 配置

stream {
upstream server_name {
server IP:Port;
}
server {
listen Port;
proxy_pass server_name;
proxy_connect_timeout 1h;
proxy_timeout 1h;
}
}

server 配置

server {
listen 80;
listen 81;
server_name 127.0.0.1 example.com;
index index.php index.html index.htm;
root /data/www;
charset utf-8;
access_log /data/logs/example.com.log main;

location / {
if (!-e $request_filename) {
rewrite ^/(.*) /index.php?$1 last;
}
}
location /xxx/ {
if ($arg_icpid = "4pd1mtsDhfe" ) {
proxy_pass http://127.0.0.1:38888/test$request_uri&tbid=aldIthSBg04;
}
proxy_pass http://127.0.0.1:9302;
}
location /xxx/ {
proxy_pass http://127.0.0.1:38888/;
}
location ~ \.php$ {
fastcgi_param REMOTE_ADDR $http_x_real_ip;
fastcgi_param LY_ADDRESS $remote_addr;
fastcgi_pass unix:/dev/shm/php-cgi.sock;
fastcgi_index index.php;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
fastcgi_param SERVERNAME $hostname;
include fastcgi_params;
}
location = /favicon.ico {
log_not_found off;
access_log off;
}
location = /robots.txt {
allow all;
log_not_found off;
access_log off;
}
}

server {
listen 80;
listen 443 ssl;
server_name example.com;
charset utf-8;
access_log /data/logs/example.com.log main;

ssl_certificate /etc/nginx/cert/example.com.pem;
ssl_certificate_key /etc/nginx/cert/example.com.key;
ssl_session_timeout 5m;
ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE:ECDH:AES:HIGH:!NULL:!aNULL:!MD5:!ADH:!RC4;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_prefer_server_ciphers on;

location / {
proxy_pass http://127.0.0.1:38888;
}
location = /favicon.ico {
log_not_found off;
access_log off;
}
location = /robots.txt {
allow all;
log_not_found off;
access_log off;
}
}

· One min read
goblin

HTTP/1.0 版本

Istio 使用 Envoy 转发 HTTP 请求,而 Envoy 默认要求使用 HTTP/1.1 或 HTTP/2,当客户端使用 HTTP/1.0 时会返回426 low version

nginx 场景

用 nginx 进行proxy_pass反向代理,默认会用 HTTP/1.0,可以指定proxy_http_version为 1.1

server {
...
location /xxx/ {
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}

· One min read
goblin

导入仓库源

rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org

rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-2.el7.elrepo.noarch.rpm

查看可安装的软件包

# ML 版本为最新版本,TL 版本为稳定版本
yum --enablerepo="elrepo-kernel" list --showduplicates | sort -r | grep kernel-ml.x86_64

# 安装 ML 版本
yum --enablerepo=elrepo-kernel install kernel-ml-devel kernel-ml -y

# 安装 LT 版本
yum --enablerepo=elrepo-kernel install kernel-lt-devel kernel-lt -y

查看现有内核启动顺序

awk -F\' '$1=="menuentry " {print $2}' /etc/grub2.cfg

CentOS Linux (4.4.179-1.el7.elrepo.x86_64) 7 (Core)

CentOS Linux (3.10.0-693.el7.x86_64) 7 (Core)

设置内核启动序号

grub2-set-default 0