Numpy Tips

数据存储大小

对于单个数据类型而言:对于 np.int, np.float 等,不指明位数的类型,依赖于平台是 32 位还是 64 位。对于 np.int32, np.float64 等指明位数的类型,字节数便是其指明的位数/8,比如:np.int32 数据存储占用 4 bytes, np.float64 数据存储占用 8 bytes。

对于数组而言,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
In [39]: a
Out[39]:
array([[1.2, 1.3],
[0. , 0. ]])

In [40]: a.dtype
Out[40]: dtype('float64')

In [41]: a.itemsize
Out[41]: 8

In [42]: a.nbytes
Out[42]: 32

In [43]: a.size
Out[43]: 4

In [44]: a.itemsize * a.size
Out[44]: 32

In [45]: a.itemsize * a.size == a.nbytes
Out[45]: True

arr.nbytes = arr.itemsize * arr.size arr.itemsize 返回一个数组元素的字节大小 arr.size 返回数组元素的个数 arr.size= np.prod(arr.shape) arr.nbytes 返回整个数组存储数据所需要的字节大小

nbytes: Total bytes consumed by the elements of the array. Does not include memory consumed by non-element attributes of the array object.

注:sys.getsizeof 返回任意 python 对象的内存占用大小,一个 numpy 数组所占用的内存=数组的数据存储 + overhead(shape,dtype,strides)等相关信息,overhead 对于大数组来说,可以忽略不计。

参考:

python - How much memory is used by a numpy ndarray? - Stack Overflow python - nbytes and getsizeof return different values - Stack Overflow

数据存储类型以及大小端

大端字节序(big endian)和小端字节序(little endian)

Endian-ness.

MSB stands for most significant bit, while LSB is least significant bit. In binary terms, the MSB is the bit that has the greatest effect on the number, and it is the left-most bit.

Little-endian (LSB first) means we start with the least significant part in the lowest address.

Big-endian (MSB first) means we start with the most significant part. 大端模式,是指数据的高字节位保存在内存的低地址中,而数据的低字节位保存在内存的高地址中。 这样的存储模式有点儿类似于把数据当作字符串顺序处理。大端模式顺序跟字符串一样。左边是低地址,然后顺序和我们日常书写一样。

numpy 的 dtype 可以简要指定大小端以及类型,叫做 One-character strings/Array-protocol type strings. 每一个类型都有一个缩写字母与之对应。

大小端表示:

numpy.dtype.byteorder — NumPy v1.24 Manual

‘=’ :native, ‘<’:little-endian, ‘>’:big-endian, ‘|’:not applicable

dt = np.dtype('<i4') # 32-bit little-endian signed integer dt = np.dtype('>f8') # 64-bit big-endian floating-point number, np.float64 dt = np.dtype('c16') # 128-bit complex floating-point number dt = np.dtype('a25') # 25-length zero-terminated bytes dt = np.dtype('U25') # 25-character string

字母后面数字是字节数

Data type objects (dtype) — NumPy v1.24 Manual

矩阵分块乘积

矩阵分块乘法,使用 np.tensordot 指定约减 axis (10,3) (3,2,3) -> (10,2,3)

1
np.tensordot(arr_a, arr_b, axes=((1), (0))) # return eval_ys

numpy padding

1
2
3
4
def random_pad(vec, pad_width, *_, **__):
vec[:pad_width[0]] = np.random.randint(0, p, size=pad_width[0])
vec[vec.size-pad_width[1]:] = np.random.randint(0,p, size=pad_width[1])
np.pad(img, (0,1), mode=random_pad) # add 1 additional row in axis-0

pytorch 中的 torch.repeat()函数与 numpy.tile() 功能相似