[P2] security: optional signature verification for MLnode PoC endpoints#537
[P2] security: optional signature verification for MLnode PoC endpoints#537x0152 wants to merge 2 commits into
Conversation
akup
left a comment
There was a problem hiding this comment.
It is uncomplete patch, as there are following functions in MLNodeClient:
// Training operations
StartTraining(ctx context.Context, taskId uint64, participant string, nodeId string, masterNodeAddr string, rank int, worldSize int) error
GetTrainingStatus(ctx context.Context) error
// Node state operations
Stop(ctx context.Context, num *int) error
NodeState(ctx context.Context) (*StateResponse, error)
// PoC operations
GetPowStatus(ctx context.Context, num *int) (*PowStatusResponse, error)
InitGenerate(ctx context.Context, dto InitDto, num *int) error
InitValidate(ctx context.Context, dto InitDto, num *int) error
ValidateBatch(ctx context.Context, batch ProofBatch, num *int) error
// Inference operations
InferenceHealth(ctx context.Context) (bool, error)
InferenceUp(ctx context.Context, model string, args []string) error
// GPU operations
GetGPUDevices(ctx context.Context) (*GPUDevicesResponse, error)
GetGPUDriver(ctx context.Context) (*DriverInfo, error)
// Model management operations
CheckModelStatus(ctx context.Context, model Model) (*ModelStatusResponse, error)
DownloadModel(ctx context.Context, model Model) (*DownloadStartResponse, error)
DeleteModel(ctx context.Context, model Model) (*DeleteResponse, error)
ListModels(ctx context.Context) (*ModelListResponse, error)
GetDiskSpace(ctx context.Context) (*DiskSpaceInfo, error)Stop and InferenceUp are also critical to protect, as they are changing node state.
Other are get commands, but it could be more complete, to sign all the MLNode requests.
|
Current PR covers POC as the most critical - direct reward theft. Firstly would be good to understand if this direction makes sense. I should update PR description to clarify scope |
I'm just pointing that before cleaning it this PR cannot be merged, because it is bad practice to merge something semi-ready |
|
I think it is useful PR, but it should be aligned with #717 The artifacts that are sent to callback URL also should be signed, as the attacker can send bad artifacts if the URL is not protected, and node will not pass validation on PoC. I think we should first update main repo with new MLNode PoC apis (now PoC completely lives at https://github.com/gonka-ai/vllm repo). Then look and compare approaches with Jf/token auth and finally take into account with websocket that should go next |
|
@akup If the poc api update is coming, let's postpone #537 and #417 until after that. #717 can go in this upgrade - it's optional, shouldn't be affected by PoC API changes and enables cloud-hosted MLNode setups #537 and #417 for security and stability should go on top of the new api once it is validated on mainnet, to avoid regressions what do you think about that? |
|
I think that certification of communication between devshard (or dapi) and mlnode should be certified. Because there are always some attacks on open ports and this is negative to adoption and newcomers. |
|
Agreed - let's include this in v0.2.13 |
|
Hi @x0152, are you ready to include this PR in the next upgrade? |
MLnode api endpoints are currently unauthenticated, allowing an attacker who can reach the mlnode port to hijack POC by sending their own callback_url
Adds request signing using the existing signerAccount key. Network Node signs requests with
X-Signatureheader, mlnode verifies using SIGNER_PUBKEY environment variable. For standard deployment where mlnode is on the same machine, it works automatically since SIGNER_PUBKEY defaults to ACCOUNT_PUBKEYFor remote mlnode setups, you need to set SIGNER_PUBKEY manually. If SIGNER_PUBKEY is not set, verification is skipped for backward compatibility
Get public key on network node:
Start mlnode with signature verification: