Matrix–Vector Operation

Matrix–vector operations leverage the CIM array for efficient parallel multiplication between weight matrices and input vectors, enabling high-performance deep learning inference.

CIM_MVM

Performs matrix-vector multiplication (MVM) using the CIM array, computing the product of a weight matrix stored in CIM and an input feature vector from local memory. Supports batch operations via configurable flag bits.

31:26

25:21

20:16

15:11

10:6

5:0

000000

opcode

input addr

input len

weight addr

batch count

flags

6-bit

Syntax

CIM_MVM rs, rt, re, rf, [flags]

Operation

CIM[GRF[re]] × MEM[GRF[rs]:GRF[rt]]

Operation Flags

The flags field controls execution modes for CIM matrix-vector multiplication. Multiple flags can be combined to optimize computation.

GRPGroup Mode

Partitions computation into groups for parallel execution. Each group handles a subset of the weight matrix

GRP_IGroup Input Mode

Controls input vector distribution across computation groups. Works with GRP flag to determine data flow patterns

BATCHBatch Processing

Processes multiple input vectors consecutively. Works with GRF[rf] batch count parameter for improved throughput

Accumulation (ACC) is always enabled by default - results are accumulated to the output buffer. This flag is implicit and not exposed in assembly syntax.

Examples

; Example 1: Basic matrix-vector multiplication
; y = W × x, where W is 128×256, x is 256×1
G_LI  r1, 0x1000        ; Input vector address
G_LI  r2, 256           ; Input vector length
G_LI  r3, 0x0           ; Weight matrix in CIM[0]
G_LI  r4, 1             ; Single operation
CIM_MVM r1, r2, r3, r4  ; Result stored in CIM output buffer

; Example 2: Batch processing
; Process 16 input vectors
G_LI  r1, 0x2000        ; First input vector
G_LI  r2, 512           ; Vector length
G_LI  r3, 0x1000        ; Weight matrix in CIM[0x1000]
G_LI  r4, 16            ; Batch size = 16
CIM_MVM r1, r2, r3, r4, BATCH  ; Batch processing mode

; Example 3: Grouped computation
; Use grouped mode for parallel processing
G_LI  r1, 0x3000        ; Input address
G_LI  r2, 1024          ; Vector length
G_LI  r3, 0x0           ; Weight matrix
G_LI  r4, 1             ; Single operation
CIM_MVM r1, r2, r3, r4, GRP  ; Grouped computation

; Example 4: Multi-layer inference
; Sequential MVM for 3 layers
S_LI  INPUT_BITWIDTH, 8   ; Configure CIM: INT8 input
S_LI  OUTPUT_BITWIDTH, 32 ; INT32 output

; Layer 1: 784 → 512
G_LI  r1, 0x1000
G_LI  r2, 784
G_LI  r3, 0x0
G_LI  r4, 1
CIM_MVM r1, r2, r3, r4

; Apply activation (omitted for brevity)

; Layer 2: 512 → 256
G_LI  r1, 0x2000        ; Layer 1 output
G_LI  r2, 512
G_LI  r3, 0x10000       ; Layer 2 weights
G_LI  r4, 1
CIM_MVM r1, r2, r3, r4

; Layer 3: 256 → 10
G_LI  r1, 0x3000        ; Layer 2 output
G_LI  r2, 256
G_LI  r3, 0x20000       ; Layer 3 weights
G_LI  r4, 1
CIM_MVM r1, r2, r3, r4

; Example 5: Batch inference for throughput
; Process 32 images in a batch
G_LI  r1, 0x10000       ; First image features
G_LI  r2, 784           ; Feature dimension
G_LI  r3, 0x0           ; Weight matrix
G_LI  r4, 32            ; Batch count
CIM_MVM r1, r2, r3, r4, BATCH  ; Batch processing

Matrix–Vector Operation

CIM_MVM

Parameters

Operation Flags

Examples

On this page