从 C 到 Rust:从假数据到真硬件

一、导火索:性能测试

前面四个 Layer 翻译完成,14 个集成测试全部通过,甚至在 QEMU 上完成了安全性和性能的 A/B 对比实验。一切看起来都非常完美。直到做了一个底层原生 API 的基准测试——不经过 SDAA Runtime 那一层封装,直接调用最底层的 gmalloc/gmemcpy_to_device/gsync 来测 DMA 带宽。

出现了一个非常荒谬的结果。

C   (原版 libgdev.so):  187.82 ms  @ 340.8 MB/s   ← 真实 DMA
Rust (libgdev_layer4):    0.02 ms  @ 3,275,435 MB/s ← 物理上不可能

Rust 版的 DMA 耗时只有 0.02 毫秒。这不是 DMA,这是量子纠缠。深入排查后发现了一个令人尴尬的事实:Rust 版在 fire ring buffer(通知硬件开始 DMA)之后,根本没有等待硬件完成。函数直接返回了成功,后面的 gsync 也因为 fence id 不匹配而超时,基准测试没检查返回值,于是"假成功"被计时器捕捉成了 0.02ms。

更严重的问题在后面——那 14 个集成测试为什么数据都是对的?答案是在 Layer 4 的实现中,ai保留了一个 SW_DEVMEM(软件设备内存)缓存层。当 gmemcpy_to_device 将数据发往硬件 ring buffer 时,它同时也把数据写入了这个 HashMap。而 gmemcpy_from_device 在读取时,如果 fence 未完成,会直接从 HashMap 中取回数据。所以测试程序看到的 out[i]==3*i 是正确的,但数据流根本没有经过物理 DMA——它走的是用户态内存模拟。

这不叫"测试通过",这叫"数据造假"。即便它是为了在无硬件环境下验证上层逻辑而设计的,但当硬件已经在面前时,还依赖软件后备就是对真实性的背叛。

之后决定:删掉所有 SW_DEVMEM 相关的软件模拟代码,彻底断绝"假通过"的后路。从这一刻起,每一比特数据都必须经过 /dev/aicard,经过 aicard.ko,经过真实的 PCIe DMA 引擎。要么硬件返回正确结果,要么报错——没有第三条路。

二、环境困境:对话损坏与新 QEMU 的意外

2.1 DeepSeek 上下文溢出,对话直接损坏

真正开始调试后遇到的第一个困难不是代码,是工具。由于调试过程中产生了几十个回合的交互记录(编译错误、运行日志、GDB 回溯、代码修改),对话上下文膨胀到了 DeepSeek 的处理极限。某一次发送消息后,模型无响应,刷新页面后整个对话损坏——历史消息全部丢失,只剩下一个空白输入框。

因为模型上下文窗口不足以容纳整个调试过程,DCP(对话压缩协议)也失效了。只能新建一个对话,用 Markdown 文件手动记录关键环境信息、已发现 Bug 的清单、QEMU 的登录方式和路径配置,然后在新对话中重新建立调试环境。

这暴露了一个现实问题:在超长上下文工程中,工具链的鲁棒性同样重要。如果对话管理不够健壮,一次上下文溢出就能让几天的调试状态付诸东流。所以要养成习惯——将环境调试情况和当前问题写进agent.md,不能高度依赖对话历史。

2.2 换 QEMU 后 Rust 版频繁崩溃

旧 QEMU 环境其实一直有问题。原版 C 的 launch 测试在那里虽然不崩溃,但输出也是错的——out 数组的值不对,只是校验代码被 #if 0 屏蔽了,所以永远打印 “Test passed”。我们最初还以为 C 版也"正常工作",直到换了一个新 QEMU 之后才发现真相。

新 QEMU 上的 C 版 launch 跑出了正确的输出:

out[0]=0 in[0]=0 in2[0]=0
out[1]=3 in[1]=2 in2[1]=1
out[2]=6 in[2]=4 in2[2]=2
out[3]=9 in[3]=6 in2[3]=3
Test passed: size = 0x10000, dev_num = 0

out[i] == 3*i——向量加法 A[i]+B[i]=C[i] 确实在硬件上执行了。内核是真实存在的、可执行的。

但 Rust 版在这个新 QEMU 上直接崩溃——整个 QEMU 进程 segfault,不是用户态程序的优雅退出,是虚拟机直接炸了。这意味着 Rust 版发出的 ring buffer 命令导致了硬件层面的非法操作。旧 QEMU 可能由于某种错误导致很宽松,新 QEMU会严格校验命令。旧环境"能跑"不代表代码正确,只是它不报错。

三、硬战:十一个 Bug 的逐个击破

以下按照实际发现和修复的顺序,记录每一个 Bug 的症状、根因、修复方案和教训。

Bug 1:REG_PB_PUT 写入值错误 → gsync 超时

症状:launch2 测试报 gsync TIMEOUT,fence 值始终为 0xFFFFFFFF,硬件无响应。

根因layer2/src/backend.rs:75fifo_push 向 REG_PB_PUT 寄存器写入了 pb_pos / 8。C 代码(gdev_nvidia_fifo.c:49)写的是 ctx->fifo.pb_pos 原值。REG_PB_PUT 是字节偏移,而 Rust 误以为是 64-bit word 索引,导致硬件认为 ring buffer 位置错误——它去错误的位置读命令,读到的全是垃圾。

修复pb_pos / 8pb_pos。只改了两个字。

教训:端口硬件寄存器时,必须逐比特校验写入值的语义。一个除 8 就决定了硬件能否收到命令。这种错误在旧 QEMU 上可能静默通过(固件不校验位置),但真实硬件上就是死寂。

Bug 2:D2H 使用宿主机虚拟地址做 DMA → segfault

症状:D2H 读回的数据全为零,或 segfault。

根因:C 的 __gmemcpy_from_device_locked 使用 DMA bounce buffer 三步模式——设备 DMA 到物理地址缓冲区 → gsync 等待 → CPU memcpy 到宿主机目标缓冲区。Rust 的 gmemcpy_from_device 直接将 dst_buf as u64(宿主机虚拟地址,如 0x7f...)传给 DMA 引擎。DMA 引擎需要物理地址,它拿到一个用户态虚拟地址后写到完全错误的内存位置。

修复:增加 bounce buffer 流程——DMA 目的地址用 dma_phys(DMA 缓冲区物理地址),gsync 完成后 CPU 拷贝到 dst_buf。

教训:DMA 引擎不认虚拟内存。任何传给硬件的地址都必须经过地址翻译(gphysget),这是内核态和用户态之间最基本的约定。

Bug 3:D2H 缺少 fence_reset → 假同步

症状:D2H 读回全零,即使数据已写入设备。

根因:C 的 gdev_memcpy 三步走——fence_reset → memcpy → fence_write。Rust D2H 路径跳过了 fence_reset。上一轮 H2D 已经写了 fence[1]=1,D2H 用同一个 seq 读 fence1 发现已经是 1 → gsync 认为"已完成"立即返回。但此时 DMA 还没启动。

修复:D2H 开始前增加 aidev_fence_reset(ctx, seq)

教训:fence 是一个状态机,不是 flag。每次使用前必须 reset,否则读到的永远是上一次操作的陈旧值。

Bug 4:ctx_from_handle 每次重置 fence.seq=0 → 所有操作共用槽位 1

症状:多个操作间 fence 互相干扰。

根因runtime.rs:64 硬编码 fence.seq: 0,创建了一个全新的栈上 GdevCtx。无论 handle 中持久 ctx 的 fence.seq 是多少,每次调用都从 0 开始,递增到 1。所有操作(H2D、D2H、launch)全部共用 fence 槽位 1。

修复:从 handle.ctx 中读取持久 seq 值,模 GDEV_FENCE_COUNT(256) 递增。每个操作获得独立 fence 槽位。

教训:C 代码在操作间保持的持久状态(如 ctx->fence.seq)必须在 Rust port 中持久化。创建零值副本会静默破坏同步逻辑——编译器不会报错,测试也可能碰巧通过,但多操作并发时必然混乱。

Bug 5:dma_size 未初始化 → 死循环

症状:gmemcpy_to_device 卡死,进程无响应。

根因api.rs:155——句柄初始化时 dma_size 未赋值 → dma_buf_size = 0 → 分块大小 chunk = 0 → while copied < size 永不退出。

修复:在 gopen 初始化时设置 dma_size = 2 * 1024 * 1024(2MB)。

教训:C 代码中全局/静态初始化的值,在 Rust 的显式构造流程中容易被遗漏。Rust 的"显式优于隐式"原则在这里是一把双刃剑——它让你没法偷偷依赖一个编译器的默认值,但也让你更容易遗漏某些字段。

Bug 6:H2D 不分块 → 缓冲区溢出

症状:memset2 报 out[3670012]=0 然后 segfault。

根因:H2D 一次性把全部数据(如 16MB)通过 copy_nonoverlapping 写入 2MB 的 DMA 缓冲区,溢出破坏相邻内存(fence buffer、ring buffer 等)。

修复:增加分块循环——let chunk = min(size - offset, dma_buf_size),逐块 DMA + fence + gsync。

教训:缓冲区大小是上限,不是建议。任何不检查大小的 copy_nonoverlapping 都是定时炸弹。

Bug 7:glaunch 缺少 fence 管理 → kernel 执行完没通知

症状:memset2 通过后,launch2 显示 out[i]=i 而非期望的 out[i]=3*i

根因:Rust glaunchruntime.rs:136)只调用了 aidev_launch,无 fence 操作。C 的 gdev_launchgdev_nvidia_compute.c:45)完整流程——membar → fence_reset(seq) → compute.launch → fence_write(seq) → notify_intr → return seq。sdLaunch 将返回的 seq 加入 sync_list,后续 sdCtxSynchronize 对每个 seq 做 gsync。Rust 返回 id=0,gsync 等了个空——函数根本没等硬件完成就返回了。

修复:补全 fence_reset → aidev_launch → fence_write → 返回 seq。同时在 sd_launch 中增加 gsync。

教训:每个硬件操作都需要 fence 同步。没有 fence,调用方的 gsync 永远等不到有效事件。软件后备代码(SW_DEVMEM)之所以能"通过",是因为它完全绕过了 fence 机制,直接用 CPU 算完写回。

Bug 8:Cubin 解析器拒绝 libcalculate.so → kernel 从未加载

症状:修复 Bug 7 后,launch 显示 code_addr=0,kernel 未执行。

根因链

  1. C 的 cubin 解析器接受 ET_DYN(共享库)格式的 .so 文件,查找 .aitext.* section 提取 kernel 代码。Rust cubin.rs:106 硬编码检查 ET_REL + EM_NVIDIA → 直接拒绝 libcalculate.so(ET_DYN, machine=0x9906)。

  2. section 的 code_offset 用了 sh_offset(文件偏移量 0x14430)而非 sh_addr(section 虚拟地址 0x4430)→ 传给硬件的 code_pc 完全错误。

修复:去除 ET_REL+EM_NVIDIA 检查,增加 .aitext.* section 匹配,新增 file_size 字段,code_offset 改用 sh_addr。

教训:不要假设 C 代码的 ELF 解析只支持标准 NVIDIA cubin 格式。定制硬件有自定义 section 命名,格式检查应该宽松——不认识的东西可以跳过,但不要拒绝。宁放过,勿错杀。

Bug 9:gload 缺少 fence 管理 → 代码未加载完成就被释放

症状:kernel 输出仍不正确,或硬件访问已释放内存。

根因:C 的 gdev_load 做 fence_reset → compute.load → fence_write → return seq,然后 gload 调用 gdev_poll(seq) 等待。Rust gload 只调了 compute.load(),没 fence,没等待。代码发送到硬件但未等完成就继续执行,紧接着 gfree 释放了那片设备内存——硬件还在后台读它。

修复:重写 gload,补全 ctx_from_handle 获取 seq → fence_reset(seq) → aidev_load → fence_write(seq) → gsync(seq)。

教训:gload 不是 fire-and-forget。它把一个 ELF 文件从设备内存传送到硬件的指令内存,这个过程需要时间。没有 fence 等待,后续的 gfree 就是 use-after-free——只不过"use"的一方是 DMA 引擎,不是在 CPU 上跑的用户态代码。

Bug 10:H2D/D2H 方向值颠倒

症状:在旧 QEMU 上无可见影响。在新 QEMU 上导致崩溃。

根因:C 定义 MEMCPY_FUNC_READ=1(H2D,读设备内存)、MEMCPY_FUNC_WRITE=2(D2H,写设备内存)。Rust H2D 用了 dir=2,D2H 用了 dir=1,刚好颠倒。

修复:互换方向值,匹配 C 的定义。

教训:DMA 方向代码看起来只是数字,但硬件固件据此决定数据流向和权限检查。旧 QEMU 固件不校验方向(你写反它也照做),新 QEMU 的 DMA 引擎严格按照方向位进行读写权限控制——方向反了就是非法操作。

Bug 11:虚拟地址直接传给 DMA 引擎 → QEMU 直接炸

症状:新 QEMU 在 gopen 时立即崩溃,整个虚拟机 segfault。

根因gmemcpy_to_device 的 dst 参数是设备虚拟地址(0x6100...),直接传给 aidev_memcpy → ring buffer。旧 QEMU 固件不校验地址类型,新 QEMU(SW26010 固件)有真实 IOMMU/SMMU 和 DMA 引擎,访问未经翻译的虚拟地址 → bus fault → 全 QEMU 崩溃。

修复:调用 gphysget() 将虚拟地址转换为物理地址后再传给 memcpy。

教训:物理机和虚拟机之间的鸿沟在 IOMMU 上。虚拟地址在用户态是合法的,但在 DMA 引擎眼里是未授权的。每一次传给硬件的地址都必须确认:这是物理地址还是虚拟地址?如果不是物理的,翻译了没有?

四、最终测试结果

当第十一个 Bug 修完后,我们在新 QEMU 上重新编译、重新部署、重新运行。这次不再是 “Test passed” 掩盖下的假数据,而是真实的、经过硬件 DMA 引擎处理后再读回的结果:

memset2 PASS ✓ out[i] == 0xcccccccc 逐字比较,H2D+D2H 全链路验证

[root@qemu-ai-ep memset2]# ./user_test 
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: enumerate_devices enter
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: sd_init
DBG: dev_count=4
DBG: gopen(0)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f6c6f64d000, token=0x1)
DBG: gphysget addr=0x7f6c6f64d000 ty=2 vas.mem_list.len()=0
DBG: gopen(0) -> 0x1
DBG: gopen(1)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f6c6f44d000, token=0x2)
DBG: gphysget addr=0x7f6c6f44d000 ty=2 vas.mem_list.len()=0
DBG: gopen(1) -> 0x2
DBG: gopen(2)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f6c6f24d000, token=0x3)
DBG: gphysget addr=0x7f6c6f24d000 ty=2 vas.mem_list.len()=0
DBG: gopen(2) -> 0x3
DBG: gopen(3)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f6c6f04d000, token=0x4)
DBG: gphysget addr=0x7f6c6f04d000 ty=2 vas.mem_list.len()=0
DBG: gopen(3) -> 0x4
DBG: enumerate_devices push ctx handle=0x1 tid_raw=1572 tid_i32=1572
DBG: ctx_list_push OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: gmalloc vaddr=0x61000100c000 phy=0x7100c000 sz=0x1000000
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: memcpy handle=0x1 dc=4
DBG: H2D h=0x1 dst=0x61000100c000 src=0x7f6c6e04c010 sz=16777216
DBG: gmemcpy_to_device token=0x1 dst=0x61000100c000 src=0x7f6c6e04c010 sz=16777216
DBG: gmemcpy_to_device calling ctx_from_handle(0x1)
DBG: ctx_from_handle OK minor=0 cid=0
DBG: gmemcpy_to_device ctx.fence.map=0x7f6c71f4e000 ctx.fence.phy=0x10103a000
DBG: dma_buf=0x7f6c6f64d000 dma_phys=0x10103e000 dma_buf_size=2097152
DBG: gphysget addr=0x61000100c000 ty=1 vas.mem_list.len()=1
DBG:   mem[0] vaddr=0x61000100c000 phy=0x7100c000 sz=0x1000000
DBG: gphysget DEVICE result=0x7100c000
DBG: dev_base=0x7100c000 (physical via gphysget) dst=0x61000100c000
DBG: copying chunk=2097152 offset=0 src=0x7f6c6e04c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x7100c000 src=0x10103e000 dst=0x7100c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: copying chunk=2097152 offset=2097152 src=0x7f6c6e24c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x7120c000 src=0x10103e000 dst=0x7120c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: copying chunk=2097152 offset=4194304 src=0x7f6c6e44c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x7140c000 src=0x10103e000 dst=0x7140c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: copying chunk=2097152 offset=6291456 src=0x7f6c6e64c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x7160c000 src=0x10103e000 dst=0x7160c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: copying chunk=2097152 offset=8388608 src=0x7f6c6e84c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x7180c000 src=0x10103e000 dst=0x7180c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: copying chunk=2097152 offset=10485760 src=0x7f6c6ea4c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x71a0c000 src=0x10103e000 dst=0x71a0c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: copying chunk=2097152 offset=12582912 src=0x7f6c6ec4c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x71c0c000 src=0x10103e000 dst=0x71c0c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: copying chunk=2097152 offset=14680064 src=0x7f6c6ee4c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x71e0c000 src=0x10103e000 dst=0x71e0c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: H2D gmemcpy_to_device returned 0
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: memcpy handle=0x1 dc=4
DBG: D2H h=0x1 src=0x61000100c000 dst=0x7f6c6f84d010 sz=16777216
DBG: gphysget addr=0x61000100c000 ty=1 vas.mem_list.len()=1
DBG:   mem[0] vaddr=0x61000100c000 phy=0x7100c000 sz=0x1000000
DBG: gphysget DEVICE result=0x7100c000
DBG: D2H round offset=0 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x7100c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x7100c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: D2H round offset=2097152 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x7120c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x7120c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: D2H round offset=4194304 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x7140c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x7140c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: D2H round offset=6291456 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x7160c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x7160c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: D2H round offset=8388608 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x7180c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x7180c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: D2H round offset=10485760 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x71a0c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x71a0c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: D2H round offset=12582912 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x71c0c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x71c0c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: D2H round offset=14680064 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x71e0c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x71e0c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
HtoD: 92.222000
DtoH: 10.750000
Test passed

memcpy2 PASS ✓ 带宽测试,无数据校验但路径正确

[root@qemu-ai-ep memset2]# ./user_test 
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: enumerate_devices enter
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: sd_init
DBG: dev_count=4
DBG: gopen(0)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f6c6f64d000, token=0x1)
DBG: gphysget addr=0x7f6c6f64d000 ty=2 vas.mem_list.len()=0
DBG: gopen(0) -> 0x1
DBG: gopen(1)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f6c6f44d000, token=0x2)
DBG: gphysget addr=0x7f6c6f44d000 ty=2 vas.mem_list.len()=0
DBG: gopen(1) -> 0x2
DBG: gopen(2)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f6c6f24d000, token=0x3)
DBG: gphysget addr=0x7f6c6f24d000 ty=2 vas.mem_list.len()=0
DBG: gopen(2) -> 0x3
DBG: gopen(3)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f6c6f04d000, token=0x4)
DBG: gphysget addr=0x7f6c6f04d000 ty=2 vas.mem_list.len()=0
DBG: gopen(3) -> 0x4
DBG: enumerate_devices push ctx handle=0x1 tid_raw=1572 tid_i32=1572
DBG: ctx_list_push OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: gmalloc vaddr=0x61000100c000 phy=0x7100c000 sz=0x1000000
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: memcpy handle=0x1 dc=4
DBG: H2D h=0x1 dst=0x61000100c000 src=0x7f6c6e04c010 sz=16777216
DBG: gmemcpy_to_device token=0x1 dst=0x61000100c000 src=0x7f6c6e04c010 sz=16777216
DBG: gmemcpy_to_device calling ctx_from_handle(0x1)
DBG: ctx_from_handle OK minor=0 cid=0
DBG: gmemcpy_to_device ctx.fence.map=0x7f6c71f4e000 ctx.fence.phy=0x10103a000
DBG: dma_buf=0x7f6c6f64d000 dma_phys=0x10103e000 dma_buf_size=2097152
DBG: gphysget addr=0x61000100c000 ty=1 vas.mem_list.len()=1
DBG:   mem[0] vaddr=0x61000100c000 phy=0x7100c000 sz=0x1000000
DBG: gphysget DEVICE result=0x7100c000
DBG: dev_base=0x7100c000 (physical via gphysget) dst=0x61000100c000
DBG: copying chunk=2097152 offset=0 src=0x7f6c6e04c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x7100c000 src=0x10103e000 dst=0x7100c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: copying chunk=2097152 offset=2097152 src=0x7f6c6e24c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x7120c000 src=0x10103e000 dst=0x7120c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: copying chunk=2097152 offset=4194304 src=0x7f6c6e44c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x7140c000 src=0x10103e000 dst=0x7140c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: copying chunk=2097152 offset=6291456 src=0x7f6c6e64c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x7160c000 src=0x10103e000 dst=0x7160c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: copying chunk=2097152 offset=8388608 src=0x7f6c6e84c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x7180c000 src=0x10103e000 dst=0x7180c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: copying chunk=2097152 offset=10485760 src=0x7f6c6ea4c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x71a0c000 src=0x10103e000 dst=0x71a0c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: copying chunk=2097152 offset=12582912 src=0x7f6c6ec4c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x71c0c000 src=0x10103e000 dst=0x71c0c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: copying chunk=2097152 offset=14680064 src=0x7f6c6ee4c010 -> dma=0x7f6c6f64d000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=2097152 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x71e0c000 src=0x10103e000 dst=0x71e0c000 sz=2097152
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: H2D gmemcpy_to_device returned 0
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: memcpy handle=0x1 dc=4
DBG: D2H h=0x1 src=0x61000100c000 dst=0x7f6c6f84d010 sz=16777216
DBG: gphysget addr=0x61000100c000 ty=1 vas.mem_list.len()=1
DBG:   mem[0] vaddr=0x61000100c000 phy=0x7100c000 sz=0x1000000
DBG: gphysget DEVICE result=0x7100c000
DBG: D2H round offset=0 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x7100c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x7100c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: D2H round offset=2097152 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x7120c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x7120c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: D2H round offset=4194304 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x7140c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x7140c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: D2H round offset=6291456 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x7160c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x7160c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: D2H round offset=8388608 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x7180c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x7180c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: D2H round offset=10485760 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x71a0c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x71a0c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: D2H round offset=12582912 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x71c0c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x71c0c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: D2H round offset=14680064 chunk=2097152 dma_phys=0x10103e000 dev_phys=0x71e0c000
DBG: D2H fence_reset seq=10
DBG: D2H memcpy dst=0x10103e000 src=0x71e0c000 sz=2097152 dir=1
DBG: D2H fence_write seq=10
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=2097152
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
HtoD: 92.222000
DtoH: 10.750000
Test passed
[root@qemu-ai-ep memset2]# cd ../memcpy2
[root@qemu-ai-ep memcpy2]# make
g++ -o user_test -std=gnu++11 -L /usr/local/gdev/lib64 -I /usr/local/gdev/include -g main.c memcpy.c -lgdev_layer4
[root@qemu-ai-ep memcpy2]# ./user_test 
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: enumerate_devices enter
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: sd_init
DBG: dev_count=4
DBG: gopen(0)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f7e70aaa000, token=0x1)
DBG: gphysget addr=0x7f7e70aaa000 ty=2 vas.mem_list.len()=0
DBG: gopen(0) -> 0x1
DBG: gopen(1)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f7e708aa000, token=0x2)
DBG: gphysget addr=0x7f7e708aa000 ty=2 vas.mem_list.len()=0
DBG: gopen(1) -> 0x2
DBG: gopen(2)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f7e706aa000, token=0x3)
DBG: gphysget addr=0x7f7e706aa000 ty=2 vas.mem_list.len()=0
DBG: gopen(2) -> 0x3
DBG: gopen(3)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f7e704aa000, token=0x4)
DBG: gphysget addr=0x7f7e704aa000 ty=2 vas.mem_list.len()=0
DBG: gopen(3) -> 0x4
DBG: enumerate_devices push ctx handle=0x1 tid_raw=1582 tid_i32=1582
DBG: ctx_list_push OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: sd_ctx_get_current list_len=1 tid_u64=1582 tid_i32=1582
DBG:   [0] minor=0 user=1582 h=1 match=true
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: sd_ctx_get_current list_len=1 tid_u64=1582 tid_i32=1582
DBG:   [0] minor=0 user=1582 h=1 match=true
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: gmalloc vaddr=0x61000100c000 phy=0x7100c000 sz=0x80000
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: sd_ctx_get_current list_len=1 tid_u64=1582 tid_i32=1582
DBG:   [0] minor=0 user=1582 h=1 match=true
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: memcpy handle=0x1 dc=4
DBG: D2H h=0x1 src=0x61000100c000 dst=0x7f7e72318010 sz=524288
DBG: gphysget addr=0x61000100c000 ty=1 vas.mem_list.len()=1
DBG:   mem[0] vaddr=0x61000100c000 phy=0x7100c000 sz=0x80000
DBG: gphysget DEVICE result=0x7100c000
DBG: D2H round offset=0 chunk=524288 dma_phys=0x10185a000 dev_phys=0x7100c000
DBG: D2H fence_reset seq=2
DBG: D2H memcpy dst=0x10185a000 src=0x7100c000 sz=524288 dir=1
DBG: D2H fence_write seq=2
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=524288
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: memcpy handle=0x1 dc=4
DBG: D2H h=0x1 src=0x61000100c000 dst=0x7f7e72318010 sz=524288
DBG: gphysget addr=0x61000100c000 ty=1 vas.mem_list.len()=1
DBG:   mem[0] vaddr=0x61000100c000 phy=0x7100c000 sz=0x80000
DBG: gphysget DEVICE result=0x7100c000
DBG: D2H round offset=0 chunk=524288 dma_phys=0x10185a000 dev_phys=0x7100c000
DBG: D2H fence_reset seq=4
DBG: D2H memcpy dst=0x10185a000 src=0x7100c000 sz=524288 dir=1
DBG: D2H fence_write seq=4
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=524288
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: memcpy handle=0x1 dc=4
DBG: D2H h=0x1 src=0x61000100c000 dst=0x7f7e72318010 sz=524288
DBG: gphysget addr=0x61000100c000 ty=1 vas.mem_list.len()=1
DBG:   mem[0] vaddr=0x61000100c000 phy=0x7100c000 sz=0x80000
DBG: gphysget DEVICE result=0x7100c000
DBG: D2H round offset=0 chunk=524288 dma_phys=0x10185a000 dev_phys=0x7100c000
DBG: D2H fence_reset seq=6
DBG: D2H memcpy dst=0x10185a000 src=0x7100c000 sz=524288 dir=1
DBG: D2H fence_write seq=6
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=524288
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
elapsedTimeInMs is 2.232000
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
bandwidth D to H:688.171997MB/s
Test passed

launch2 PASS ✓ out[i] == 3*i 逐元素验证,kernel 在硬件上真正执行
为了确保数据计算正确,这里用了两套源数据
第一套源数据

 78    in = (unsigned int *) malloc(size);
 79    in2 = (unsigned int *) malloc(size);
 80    out = (unsigned int *) malloc(size);
 81         for (i = 0; i < size / 4; i++) {
 82                 in[i] = i*2 ;
 83                 in2[i] = i*3;
 84                 out[i] = 0;
 85         }

测试结果:

[root@qemu-ai-ep launch2]# ./user_test 
size = 0x10000, dev_num = 0
file size =211846
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: enumerate_devices enter
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: sd_init
DBG: dev_count=4
DBG: gopen(0)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7fcbc3465000, token=0x1)
DBG: gphysget addr=0x7fcbc3465000 ty=2 vas.mem_list.len()=0
DBG: gopen(0) -> 0x1
DBG: gopen(1)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7fcbc3265000, token=0x2)
DBG: gphysget addr=0x7fcbc3265000 ty=2 vas.mem_list.len()=0
DBG: gopen(1) -> 0x2
DBG: gopen(2)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7fcbc3065000, token=0x3)
DBG: gphysget addr=0x7fcbc3065000 ty=2 vas.mem_list.len()=0
DBG: gopen(2) -> 0x3
DBG: gopen(3)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7fcbc2e65000, token=0x4)
DBG: gphysget addr=0x7fcbc2e65000 ty=2 vas.mem_list.len()=0
DBG: gopen(3) -> 0x4
DBG: enumerate_devices push ctx handle=0x1 tid_raw=1609 tid_i32=1609
DBG: ctx_list_push OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
dev count 4
--------cubin = 0x7fcbc4d25010-------
DBG: register_fat_binary enter, fat_cubin=0x7ffebb4508c0, name=unnamed
DBG: register_fat_binary locking runtime
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: register_fat_binary insert handle=0
DBG: register_fat_binary done, handle=0
DBG: register_function enter handle=0 name=add_test
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: register_function parsed 3 funcs file_size=211846
DBG: gmalloc vaddr=0x61000100c000 phy=0x7100c000 sz=0x33b86
DBG: register_function code_addr=0x61000100c000 sz=0x33b86
DBG: gmemcpy_to_device token=0x1 dst=0x61000100c000 src=0x7fcbc4ceb010 sz=211846
DBG: gmemcpy_to_device calling ctx_from_handle(0x1)
DBG: ctx_from_handle OK minor=0 cid=0
DBG: gmemcpy_to_device ctx.fence.map=0x7fcbc4d64000 ctx.fence.phy=0x102072000
DBG: dma_buf=0x7fcbc3465000 dma_phys=0x102076000 dma_buf_size=2097152
DBG: gphysget addr=0x61000100c000 ty=1 vas.mem_list.len()=1
DBG:   mem[0] vaddr=0x61000100c000 phy=0x7100c000 sz=0x33b86
DBG: gphysget DEVICE result=0x7100c000
DBG: dev_base=0x7100c000 (physical via gphysget) dst=0x61000100c000
DBG: copying chunk=211846 offset=0 src=0x7fcbc4ceb010 -> dma=0x7fcbc3465000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=211846 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x7100c000 src=0x102076000 dst=0x7100c000 sz=211846
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: register_function kernel code_addr=0x0 code_pc=0x45b0 code_sz=0x240
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: gmalloc vaddr=0x61000100c000 phy=0x7100c000 sz=0x10000
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: gmalloc vaddr=0x61000101c000 phy=0x7101c000 sz=0x10000
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: gmalloc vaddr=0x61000102c000 phy=0x7102c000 sz=0x10000
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: memcpy handle=0x1 dc=4
DBG: H2D h=0x1 dst=0x61000100c000 src=0x1deb010 sz=65536
DBG: gmemcpy_to_device token=0x1 dst=0x61000100c000 src=0x1deb010 sz=65536
DBG: gmemcpy_to_device calling ctx_from_handle(0x1)
DBG: ctx_from_handle OK minor=0 cid=0
DBG: gmemcpy_to_device ctx.fence.map=0x7fcbc4d64000 ctx.fence.phy=0x102072000
DBG: dma_buf=0x7fcbc3465000 dma_phys=0x102076000 dma_buf_size=2097152
DBG: gphysget addr=0x61000100c000 ty=1 vas.mem_list.len()=3
DBG:   mem[0] vaddr=0x61000100c000 phy=0x7100c000 sz=0x10000
DBG:   mem[1] vaddr=0x61000101c000 phy=0x7101c000 sz=0x10000
DBG:   mem[2] vaddr=0x61000102c000 phy=0x7102c000 sz=0x10000
DBG: gphysget DEVICE result=0x7100c000
DBG: dev_base=0x7100c000 (physical via gphysget) dst=0x61000100c000
DBG: copying chunk=65536 offset=0 src=0x1deb010 -> dma=0x7fcbc3465000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=65536 seq=5
DBG: fence_reset seq=5
DBG: calling compute.memcpy dev=0x7100c000 src=0x102076000 dst=0x7100c000 sz=65536
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=5
DBG: gsync returned 0
DBG: H2D gmemcpy_to_device returned 0
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: memcpy handle=0x1 dc=4
DBG: H2D h=0x1 dst=0x61000101c000 src=0x1dfb020 sz=65536
DBG: gmemcpy_to_device token=0x1 dst=0x61000101c000 src=0x1dfb020 sz=65536
DBG: gmemcpy_to_device calling ctx_from_handle(0x1)
DBG: ctx_from_handle OK minor=0 cid=0
DBG: gmemcpy_to_device ctx.fence.map=0x7fcbc4d64000 ctx.fence.phy=0x102072000
DBG: dma_buf=0x7fcbc3465000 dma_phys=0x102076000 dma_buf_size=2097152
DBG: gphysget addr=0x61000101c000 ty=1 vas.mem_list.len()=3
DBG:   mem[0] vaddr=0x61000100c000 phy=0x7100c000 sz=0x10000
DBG:   mem[1] vaddr=0x61000101c000 phy=0x7101c000 sz=0x10000
DBG:   mem[2] vaddr=0x61000102c000 phy=0x7102c000 sz=0x10000
DBG: gphysget DEVICE result=0x7101c000
DBG: dev_base=0x7101c000 (physical via gphysget) dst=0x61000101c000
DBG: copying chunk=65536 offset=0 src=0x1dfb020 -> dma=0x7fcbc3465000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=65536 seq=7
DBG: fence_reset seq=7
DBG: calling compute.memcpy dev=0x7101c000 src=0x102076000 dst=0x7101c000 sz=65536
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=7
DBG: gsync returned 0
DBG: H2D gmemcpy_to_device returned 0
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: glaunch enter token=0x1 code_addr=0x0 code_pc=17840 param_size=32 name=add_test
DBG: glaunch launch_seq=9 cid=0 sid=0
DBG: glaunch params (4 x u64):
DBG:   param[0] = 0x61000100c000
DBG:   param[1] = 0x61000101c000
DBG:   param[2] = 0x61000102c000
DBG:   param[3] = 0x4000
DBG: glaunch fence_reset seq=9
DBG: glaunch calling aidev_launch
DBG: glaunch aidev_launch returned 0
DBG: glaunch fence_write seq=9
DBG: glaunch success fence_token=9
DBG: launch gsync id=9
launch_sync time: 26.203000
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: memcpy handle=0x1 dc=4
DBG: D2H h=0x1 src=0x61000102c000 dst=0x1e0b030 sz=65536
DBG: gphysget addr=0x61000102c000 ty=1 vas.mem_list.len()=3
DBG:   mem[0] vaddr=0x61000100c000 phy=0x7100c000 sz=0x10000
DBG:   mem[1] vaddr=0x61000101c000 phy=0x7101c000 sz=0x10000
DBG:   mem[2] vaddr=0x61000102c000 phy=0x7102c000 sz=0x10000
DBG: gphysget DEVICE result=0x7102c000
DBG: D2H round offset=0 chunk=65536 dma_phys=0x102076000 dev_phys=0x7102c000
DBG: D2H fence_reset seq=11
DBG: D2H memcpy dst=0x102076000 src=0x7102c000 sz=65536 dir=1
DBG: D2H fence_write seq=11
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=65536
out[0]=0 in[0]=0 in2[0]=0
out[1]=5 in[1]=2 in2[1]=3
out[2]=10 in[2]=4 in2[2]=6
out[3]=15 in[3]=6 in2[3]=9
out[4]=20 in[4]=8 in2[4]=12
out[5]=25 in[5]=10 in2[5]=15
out[6]=30 in[6]=12 in2[6]=18
out[7]=35 in[7]=14 in2[7]=21
out[8]=40 in[8]=16 in2[8]=24
out[9]=45 in[9]=18 in2[9]=27
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
Test passed: size = 0x10000, dev_num = 0

第二套源头数据

 78    in = (unsigned int *) malloc(size);
 79    in2 = (unsigned int *) malloc(size);
 80    out = (unsigned int *) malloc(size);
 81         for (i = 0; i < size / 4; i++) {
 82                 in[i] = i*2 ;
 83                 in2[i] = i;
 84                 out[i] = 0;
 85         }

测试结果:

[root@qemu-ai-ep launch2]# ./user_test 
size = 0x10000, dev_num = 0
file size =211846
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: enumerate_devices enter
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: sd_init
DBG: dev_count=4
DBG: gopen(0)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f37fd026000, token=0x1)
DBG: gphysget addr=0x7f37fd026000 ty=2 vas.mem_list.len()=0
DBG: gopen(0) -> 0x1
DBG: gopen(1)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f37fce26000, token=0x2)
DBG: gphysget addr=0x7f37fce26000 ty=2 vas.mem_list.len()=0
DBG: gopen(1) -> 0x2
DBG: gopen(2)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f37fcc26000, token=0x3)
DBG: gphysget addr=0x7f37fcc26000 ty=2 vas.mem_list.len()=0
DBG: gopen(2) -> 0x3
DBG: gopen(3)
DBG: gopen DMA pre-alloc OK (2 MiB, ptr=0x7f37fca26000, token=0x4)
DBG: gphysget addr=0x7f37fca26000 ty=2 vas.mem_list.len()=0
DBG: gopen(3) -> 0x4
DBG: enumerate_devices push ctx handle=0x1 tid_raw=1623 tid_i32=1623
DBG: ctx_list_push OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
dev count 4
--------cubin = 0x7f37fe8e6010-------
DBG: register_fat_binary enter, fat_cubin=0x7ffef0d3ba40, name=unnamed
DBG: register_fat_binary locking runtime
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: register_fat_binary insert handle=0
DBG: register_fat_binary done, handle=0
DBG: register_function enter handle=0 name=add_test
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: register_function parsed 3 funcs file_size=211846
DBG: gmalloc vaddr=0x61000100c000 phy=0x7100c000 sz=0x33b86
DBG: register_function code_addr=0x61000100c000 sz=0x33b86
DBG: gmemcpy_to_device token=0x1 dst=0x61000100c000 src=0x7f37fe8ac010 sz=211846
DBG: gmemcpy_to_device calling ctx_from_handle(0x1)
DBG: ctx_from_handle OK minor=0 cid=0
DBG: gmemcpy_to_device ctx.fence.map=0x7f37fe925000 ctx.fence.phy=0x100002000
DBG: dma_buf=0x7f37fd026000 dma_phys=0x100006000 dma_buf_size=2097152
DBG: gphysget addr=0x61000100c000 ty=1 vas.mem_list.len()=1
DBG:   mem[0] vaddr=0x61000100c000 phy=0x7100c000 sz=0x33b86
DBG: gphysget DEVICE result=0x7100c000
DBG: dev_base=0x7100c000 (physical via gphysget) dst=0x61000100c000
DBG: copying chunk=211846 offset=0 src=0x7f37fe8ac010 -> dma=0x7f37fd026000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=211846 seq=1
DBG: fence_reset seq=1
DBG: calling compute.memcpy dev=0x7100c000 src=0x100006000 dst=0x7100c000 sz=211846
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=1
DBG: gsync returned 0
DBG: register_function kernel code_addr=0x0 code_pc=0x45b0 code_sz=0x240
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: gmalloc vaddr=0x61000100c000 phy=0x7100c000 sz=0x10000
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: gmalloc vaddr=0x61000101c000 phy=0x7101c000 sz=0x10000
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: gmalloc vaddr=0x61000102c000 phy=0x7102c000 sz=0x10000
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: memcpy handle=0x1 dc=4
DBG: H2D h=0x1 dst=0x61000100c000 src=0x1e2c010 sz=65536
DBG: gmemcpy_to_device token=0x1 dst=0x61000100c000 src=0x1e2c010 sz=65536
DBG: gmemcpy_to_device calling ctx_from_handle(0x1)
DBG: ctx_from_handle OK minor=0 cid=0
DBG: gmemcpy_to_device ctx.fence.map=0x7f37fe925000 ctx.fence.phy=0x100002000
DBG: dma_buf=0x7f37fd026000 dma_phys=0x100006000 dma_buf_size=2097152
DBG: gphysget addr=0x61000100c000 ty=1 vas.mem_list.len()=3
DBG:   mem[0] vaddr=0x61000100c000 phy=0x7100c000 sz=0x10000
DBG:   mem[1] vaddr=0x61000101c000 phy=0x7101c000 sz=0x10000
DBG:   mem[2] vaddr=0x61000102c000 phy=0x7102c000 sz=0x10000
DBG: gphysget DEVICE result=0x7100c000
DBG: dev_base=0x7100c000 (physical via gphysget) dst=0x61000100c000
DBG: copying chunk=65536 offset=0 src=0x1e2c010 -> dma=0x7f37fd026000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=65536 seq=5
DBG: fence_reset seq=5
DBG: calling compute.memcpy dev=0x7100c000 src=0x100006000 dst=0x7100c000 sz=65536
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=5
DBG: gsync returned 0
DBG: H2D gmemcpy_to_device returned 0
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: memcpy handle=0x1 dc=4
DBG: H2D h=0x1 dst=0x61000101c000 src=0x1e3c020 sz=65536
DBG: gmemcpy_to_device token=0x1 dst=0x61000101c000 src=0x1e3c020 sz=65536
DBG: gmemcpy_to_device calling ctx_from_handle(0x1)
DBG: ctx_from_handle OK minor=0 cid=0
DBG: gmemcpy_to_device ctx.fence.map=0x7f37fe925000 ctx.fence.phy=0x100002000
DBG: dma_buf=0x7f37fd026000 dma_phys=0x100006000 dma_buf_size=2097152
DBG: gphysget addr=0x61000101c000 ty=1 vas.mem_list.len()=3
DBG:   mem[0] vaddr=0x61000100c000 phy=0x7100c000 sz=0x10000
DBG:   mem[1] vaddr=0x61000101c000 phy=0x7101c000 sz=0x10000
DBG:   mem[2] vaddr=0x61000102c000 phy=0x7102c000 sz=0x10000
DBG: gphysget DEVICE result=0x7101c000
DBG: dev_base=0x7101c000 (physical via gphysget) dst=0x61000101c000
DBG: copying chunk=65536 offset=0 src=0x1e3c020 -> dma=0x7f37fd026000
DBG: copy done
DBG: firering memcpy+reset+fence_write for chunk=65536 seq=7
DBG: fence_reset seq=7
DBG: calling compute.memcpy dev=0x7101c000 src=0x100006000 dst=0x7101c000 sz=65536
DBG: memcpy done
DBG: fence_write done
DBG: calling gsync token=0x1 seq=7
DBG: gsync returned 0
DBG: H2D gmemcpy_to_device returned 0
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: glaunch enter token=0x1 code_addr=0x0 code_pc=17840 param_size=32 name=add_test
DBG: glaunch launch_seq=9 cid=0 sid=0
DBG: glaunch params (4 x u64):
DBG:   param[0] = 0x61000100c000
DBG:   param[1] = 0x61000101c000
DBG:   param[2] = 0x61000102c000
DBG:   param[3] = 0x4000
DBG: glaunch fence_reset seq=9
DBG: glaunch calling aidev_launch
DBG: glaunch aidev_launch returned 0
DBG: glaunch fence_write seq=9
DBG: glaunch success fence_token=9
DBG: launch gsync id=9
launch_sync time: 20.157000
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: memcpy handle=0x1 dc=4
DBG: D2H h=0x1 src=0x61000102c000 dst=0x1e4c030 sz=65536
DBG: gphysget addr=0x61000102c000 ty=1 vas.mem_list.len()=3
DBG:   mem[0] vaddr=0x61000100c000 phy=0x7100c000 sz=0x10000
DBG:   mem[1] vaddr=0x61000101c000 phy=0x7101c000 sz=0x10000
DBG:   mem[2] vaddr=0x61000102c000 phy=0x7102c000 sz=0x10000
DBG: gphysget DEVICE result=0x7102c000
DBG: D2H round offset=0 chunk=65536 dma_phys=0x100006000 dev_phys=0x7102c000
DBG: D2H fence_reset seq=11
DBG: D2H memcpy dst=0x100006000 src=0x7102c000 sz=65536 dir=1
DBG: D2H fence_write seq=11
DBG: D2H gsync returned 0
DBG: D2H copy_nonoverlapping done chunk=65536
out[0]=0 in[0]=0 in2[0]=0
out[1]=3 in[1]=2 in2[1]=1
out[2]=6 in[2]=4 in2[2]=2
out[3]=9 in[3]=6 in2[3]=3
out[4]=12 in[4]=8 in2[4]=4
out[5]=15 in[5]=10 in2[5]=5
out[6]=18 in[6]=12 in2[6]=6
out[7]=21 in[7]=14 in2[7]=7
out[8]=24 in[8]=16 in2[8]=8
out[9]=27 in[9]=18 in2[9]=9
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
DBG: lock_runtime() locking RUNTIME...
DBG: lock_runtime() locked OK
Test passed: size = 0x10000, dev_num = 0

两次测试的out结果表明:即使源数据不一样,计算结果也是正确的,没有假数据的情况。

五、后谈

从删除所有 SW_DEVMEM 软件模拟代码开始,到三个测试全部通过,中间经历了上下文溢出导致的对话重建、QEMU 环境更换、以及十一个 Bug 的逐个击破。旧 QEMU 用宽松掩盖了错误,新 QEMU 用严格逼迫我们直面真相。

在这个过程中,ai的软件模拟严重耽误拉长了翻译时间。SW_DEVMEM 当初让测试"通过"得如此轻松,以至于我完全没意识到底层根本没有跟硬件对话。它拖延了问题暴露的时间,当问题最终爆发时,需要同时应对十一个而非两三个 Bug。后边会继续补充性能测试相关,完善相关文档。

Logo

欢迎加入 MCP 技术社区!与志同道合者携手前行,一同解锁 MCP 技术的无限可能!

更多推荐