[Dynet-92]. Multi-device support#704
Conversation
… cpu impl temporarily.
… format(--dynet-devices CPU,GPU:0,GPU:2)
|
In general, this is great: I think multi-device support will be a great feature for DyNet to have. First, I have a high level comment. In my mind, there are two design decisions here: How do we specify the "default" device of a graph node when it is not specified explicitly
The first has the advantage of perhaps being easier to understand, but may result in hidden memory moves where people aren't expecting them. It also adds some code complexity. When some of the inputs are not on the same device what do you do?
My opinion: I tend to prefer 2./2. respectively, but could be convinced otherwise. |
|
This is really great! I like options 2 and 2 also, but would like to propose a variation of the second (2): the name letting both Another proposal (maybe its already there, I didn't look at the code) is to also allow multiple CPU devices. There, the copying will be NOPs, but we can still run different cpu devices as different threads. How do things look in terms of synchronization in the current implementation? |
|
Very helpful comments!! I also prefer 2. / 2. then. And I think to support both copy like Currently, we don't support specifying CPU id. I will think about that in the near future. |
…device interface instead of do memcpy in executor.
|
The remark of |
|
@xunzhang Yes, |
|
@neubig Right, cool. I will finish this soon. |
|
This pull request is review-ready. It will not affect old code and interface and I will split remaining work in the future pull requests. The remaining things include,
|
neubig
left a comment
There was a problem hiding this comment.
Thanks, this is great! I have a bunch of small comments, but I think once they're resolved and I can confirm that this works in my environment, I think we can merge. Also, some of my comments might just be oversights, so if there's anything that you don't think needs to be fixed, just tell me.
|
|
||
| // check consistent device | ||
| for (auto & xs_v : xs) { | ||
| DYNET_ASSERT(xs_v->device == nfxs[num_nodes_evaluated].device, "Attemp to do tensor forward in different devices"); |
There was a problem hiding this comment.
Attemp -> Attempt. Similarly for all below.
| float* my_src = batches[node2batch[id]].nfx.v + node2offset[id]; | ||
| if (tout.device->type == DeviceType::CPU) { | ||
| memcpy(dest, my_src, sz * sizeof(float)); | ||
| } else { |
There was a problem hiding this comment.
For this "else", I think we should add a tout.device->type == DeviceType::GPU, then throw an error if we get a device other than the one we're expecting. We might add other device types later, and if we do this logic will break (and it's in a somewhat inconspicuous place, so we should make sure that it's easy to catch through an error).
| return Expression(x.pg, x.pg->add_function<ScalarAdd>({x.i, y.i})); | ||
| else | ||
| return Expression(x.pg, x.pg->add_function<Sum>({x.i, y.i})); | ||
| return Expression(x.pg, x.pg->add_function<Sum>({x.i, y.i}, x.pg->nodes[x.i]->device)); |
There was a problem hiding this comment.
This shouldn't need a "device" argument, right? It can just inherit its device from its inputs.
| Expression abs(const Expression& x) { return Expression(x.pg, x.pg->add_function<Abs>({x.i})); } | ||
| Expression erf(const Expression& x) { return Expression(x.pg, x.pg->add_function<Erf>({x.i})); } | ||
| Expression tanh(const Expression& x) { return Expression(x.pg, x.pg->add_function<Tanh>({x.i})); } | ||
| Expression tanh(const Expression& x, Device *device) { |
There was a problem hiding this comment.
Similarly, we should remove the device argument from this and all others below, except the ones that deal with input, or explicitly changing devices.
| const bool is_stale() const {return (get_number_of_active_graphs() != 1 || graph_id != get_current_graph_id());} | ||
| Expression() : pg(nullptr), i(0), graph_id(0) {} | ||
|
|
||
| Expression(Device *device) : pg(nullptr), i(0), graph_id(0) {} |
There was a problem hiding this comment.
This is identical to the empty constructor, so we should delete it.
| // x = an invertible matrix | ||
| struct MatrixInverse : public Node { | ||
| explicit MatrixInverse(const std::initializer_list<VariableIndex>& a) : Node(a) {} | ||
| explicit MatrixInverse(const std::initializer_list<VariableIndex>& a, Device *device) : Node(a, device) {} |
| // y = x_1 * x_2 | ||
| struct MatrixMultiply : public Node { | ||
| explicit MatrixMultiply(const std::initializer_list<VariableIndex>& a) : Node(a) {} | ||
| explicit MatrixMultiply(const std::initializer_list<VariableIndex>& a, |
| // y = tanh x_1 | ||
| struct Tanh : public Node { | ||
| explicit Tanh(const std::initializer_list<VariableIndex>& a) : Node(a) {} | ||
| explicit Tanh(const std::initializer_list<VariableIndex>& a, Device *device) : Node(a, device) {} |
| if (v.device->type == DeviceType::CPU) { | ||
| return (*v)(index[0], index[1]); | ||
| } else { | ||
| #if HAVE_CUDA |
There was a problem hiding this comment.
Similarly here, I think we should check if the device is GPU, and throw an error if we get an unknown device.
| cudaMemcpyAsync(v.v, v_src.v, sizeof(real) * v.d.size(), cudaMemcpyDeviceToHost); | ||
| #endif | ||
| } | ||
| } else { |
There was a problem hiding this comment.
Similarly here, check if device is GPU explicitly.
|
Thanks! I confirmed that this is working as expected, so I merged. This is great to have :) |
|
Is it working actually? I tried the "./examples/train_xor-multidevice" example and got the following error:
|
|
I think documentation is not finished yet, but you need to add |
|
It works. Thanks! |
--dynet-devicesargument.cg.change_expr_devicebefore defining an expression.V * tanh(affine_transform(b, W, x)) + a.V * tanh(affine_transform(b, W, x)) + a.Debug hang issue using multiple GPUsAdd more feature tests.Original Usage:
Modified Usage:
To reviewer @neubig , you can have a quick test using code below: