A generic, composable multidimensional array library.
(Work In Progress)
The projects is still work in progress, expect missing docs, features, and benchmarks.
- Array computation library
- Array/Tensor computational graph
- Tile-based kernel embedded DSL
nmtools is written in c++ and provide numpy-like array computation in c++. It can also capture the computational graph at compile-time.
auto x = nm::array.arange(15, /*dtype=*/nm::float32).reshape(array{3,5});
auto t1 = (x - x.max(/*axis=*/1,/*dtype=*/nm::None,/*initial=*/nm::None,/*keepdims=*/true)).exp();
auto t2 = t1.sum(-1,nm::None,nm::None,true);
auto softmax = t1 / t2;
print(x);
print(t1);
print(t2);
print(softmax);Sample output:
shape: [ 3, 5]
[[ 0.000000, 1.000000, 2.000000, 3.000000, 4.000000],
[ 5.000000, 6.000000, 7.000000, 8.000000, 9.000000],
[ 10.000000, 11.000000, 12.000000, 13.000000, 14.000000]]
shape: [ 3, 5]
[[ 0.018316, 0.049787, 0.135335, 0.367879, 1.000000],
[ 0.018316, 0.049787, 0.135335, 0.367879, 1.000000],
[ 0.018316, 0.049787, 0.135335, 0.367879, 1.000000]]
shape: [ 3, 1]
[[ 1.571317],
[ 1.571317],
[ 1.571317]]
shape: [ 3, 5]
[[ 0.011656, 0.031685, 0.086129, 0.234122, 0.636409],
[ 0.011656, 0.031685, 0.086129, 0.234122, 0.636409],
[ 0.011656, 0.031685, 0.086129, 0.234122, 0.636409]]
auto input = nm::random(array{3,4},dtype,gen);
auto axis = -1;
auto res = view::softmax(input,axis);
/* Static computational graph: */
auto graph = fn::get_computational_graph(res);
/* Compile-time computational graph: */
constexpr auto graph_v = nm::to_value_v<decltype(unwrap(graph))>;Check full code: examples/tilekit/add.cpp
/* includes */
/* Multicore + SIMD */
using v128_mt = tk::thread_pool<tk::vector::context_t>;
struct add_kernel_t
{
template <typename tile_shape_t=tuple<nm::ct<2>,nm::ct<4>>, typename context_t, typename out_t, typename a_t, typename b_t>
auto operator()(context_t ctx, out_t& out, const a_t& a, const b_t& b, const tile_shape_t t_shape=tile_shape_t{})
{
auto [t_id] = tk::worker_id(ctx);
auto [t_size] = tk::worker_size(ctx);
auto a_shape = shape(a);
auto offset = tk::ndoffset(a_shape,t_shape);
// t_size num workers
auto n_iter = (offset.size()/t_size);
for (nm_size_t i=0; i<n_iter; i++) {
auto tile_offset = offset[(t_id*n_iter)+i];
auto block_a = tk::load(ctx,a,tile_offset,t_shape);
auto block_b = tk::load(ctx,b,tile_offset,t_shape);
auto result = block_a + block_b;
tk::store(ctx,out,tile_offset,result);
}
}
};
inline auto add_kernel = add_kernel_t{};
int main(int argc, char** argv)
{
/* setup a,b,c*/
auto tile_shape = tuple{2_ct,16_ct};
auto num_threads = 8;
auto ctx = v128_mt(num_threads);
auto worker_size = num_threads;
ctx.eval(worker_size,add_kernel,c,a,b,tile_shape);
/* check or use result */
return 0;
}
As you can see, we have 8 worker threads saturated with works.
As you can see, the add is vectorized using simd instruction.
Optionally, instrument profiling using tracy is also supported. It provides timeline view, assembly and source.
auto gen = nm::random_engine();
auto dtype = nm::float32;
auto input = nm::random(array{4096,4},dtype,gen);
// Change hip to cuda/sycl if using nvidia/neutral
// auto ctx = nm::cuda::default_context();
// auto ctx = nm::sycl::default_context();
auto ctx = nm::hip::default_context();
auto gpu_res = nm::tanh(input,ctx);
auto cpu_res = nm::tanh(input);Sample output:
./a.out
[nmtools hip] driver version: 60443484
[nmtools hip] runtime version: 60443484
[nmtools hip] number of hip devices: 1
- compute capability: major: 11 minor: 0
- device name: Radeon RX 7900 XTX
- total mem (bytes): 25753026560
- ecc enabled: 0
- async engine count: 8
- can map to host memory: 1
- can use host pointer for registered memory: 1
- maximum clock (kHz): 2482000
- compute mode: 0
- compute preemption supported: 0
- concurrent kernel execution supported: 1
- coherent access managed memory concurrently with cpu: 1
- cooperative launch supported: 1
- cooperative device launch supported: 1
- caching globals in L1 supported: 1
- host device operation is native atomic: 1
- is integrated GPU: 0
- is multi GPUs: 0
- kernel execution timeout limit: 0
- L2 cache size (bytes): 6291456
- caching locals in L1 supported: 1
- supports allocating managed memory: 1
- maximum block size per multiprocessors: 2
- maximum block size in width (x): 1024
- maximum block size in height (y): 1024
- maximum block size in depth (z): 1024
- maximum grid size in width (x): 2147483647
- maximum grid size in height (y): 65536
- maximum grid size in depth (z): 65536
- maximum threads per block: 1024
- maximum threads per multiprocessor: 2048
- global memory bus width (bits): 384
- maximum memory clock frequency (kHz): 1249000
- multiprocessor count: 48
isclose: true
- Compile-time shape inference
constexprcompile-time evaluation- CUDA, HIP, SYCL support
- Support for zero dynamic allocation
- Bare-metal Microcontrollers
- C++17 (full language features)
Supported compilers:
- gcc 9+
- clang 10+
Clone the repository somewhere:
git clone https://github.com/alifahrri/nmtools.git
Write some code:
// file array.cpp
#include "nmtools/nmtools.hpp"
#include "nmtools/ndarray.hpp"
namespace nm = nmtools;
namespace utils = nmtools::utils;
using namespace nm::literals;
using nmtools_array;
using nmtools_tuple;
template <typename array_t>
auto print(const array_t& x)
{
std::cout << "shape: " << utils::to_string(nm::shape(x))
<< std::endl
<< utils::to_string(x)
<< std::endl;
}
int main(int argc, char** argv)
{
auto gen = nm::random_engine();
auto dtype = nm::float32;
auto a = nm::Array::random(array{2,3,2},dtype,gen);
print(a);
// similar to a[1:,1:2,...]
auto sa = a.slice("1:"_ct,"1:2"_ct,"..."_ct);
print(sa);
std::cout << "a:\n";
a.slice("1:"_ct,"1:2"_ct,"..."_ct) = nm::ones(array{1,1,2});
print(a);
std::cout << "b:\n";
auto b = nm::Array::arange(2,dtype);
print(b);
std::cout << "c=dot(a,b):\n";
auto c = nm::dot(a,b);
print(c);
return 0;
}nmtools is a header only library, it can be used by simply informing the include path to the compiler. Then compile it.
# adjust the path as necessary
export NMTOOLS_INCLUDE_PATH=${HOME}/projects/nmtools/include
g++ -I$NMTOOLS_INCLUDE_PATH array.cppThen you run it:
./a.outsample result:
shape: [ 2, 3, 2]
[[[ 0.846539, 0.547375],
[ 0.150028, 0.481849],
[ 0.761298, 0.949123]],
[[ 0.382625, 0.582128],
[ 0.845037, 0.914821],
[ 0.660163, 0.464962]]]
shape: [ 1, 1, 2]
[[[ 0.845037, 0.914821]]]
a:
shape: [ 2, 3, 2]
[[[ 0.846539, 0.547375],
[ 0.150028, 0.481849],
[ 0.761298, 0.949123]],
[[ 0.382625, 0.582128],
[ 1.000000, 1.000000],
[ 0.660163, 0.464962]]]
b:
shape: [ 2]
[ 0.000000, 1.000000]
c=dot(a,b):
shape: [ 2, 3]
[[ 0.547375, 0.481849, 0.949123],
[ 0.582128, 1.000000, 0.464962]]