SecondTouchReality is a derivative of a “VR hand-gesture based teaching-object system”. Its focus is:
Natural-language → object generation
- pinch-gesture grabbing in 3D
- single-channel servo feedback of the pinch state,
with a more modular and extensible architecture.
It’s a small but reasonably complete end-to-end system that keeps the whole chain while “extracting the skeleton” and making it lightweight:
- Python side: hand tracking + depth estimation + text classification model
- Unity side: 3D hand skeleton reconstruction + grabbing interaction + camera control
- Hardware side: Arduino / servo / glove (interface reserved)
Wired together as one pipeline:
Camera → Python understands your hand + your sentence → Unity generates an interactive 3D scene → drives hardware feedback
The repo is still at prototype stage, but it already shows a full loop of:
Perception → Semantics → Interaction → Hardware
From an end-user point of view, the system does three main things:
-
Understand your hand
- Python uses MediaPipe Hands to detect 21 hand keypoints.
- It computes palm width, palm length, finger curl, side pose, palm/backs-of-hand orientation.
- A one-time calibration converts palm width / length into real-world wrist-to-camera distance (meters), with filtering.
- It packs the wrist 3D position, 20 bone direction vectors and the
pinchstate into JSON and sends it to Unity over UDP.
-
Understand your words
- Unity pops up a dialog where you type an English description (e.g.
"a small red apple"). - A lightweight text model on Python side uses
HashingVectorizer + SGDClassifierfor multi-class classification, and outputs a discrete label (e.g."101"). - A TCP connection returns
"101"to Unity.
- Unity pops up a dialog where you type an English description (e.g.
-
Let you grab things
- Unity uses
HandFromVectorsto reconstruct the 3D hand skeleton from the UDP JSON, drawing it with spheres and lines. PinchGrabBalllets you pinch objects in the scene so they follow your hand.HandOrbitCameralets you rotate / zoom the camera with pinch + hand movement when you’re not grabbing anything.ModelLibrary/RuntimeModelLoaderload the corresponding 3D model (prefab or GLB) based on the label.
- Unity uses
-
Directory Structure
SecondTouchReality/
├── README.md # English / mixed-language readme (overall design)
├── README_CHN.md # Chinese readme (overall design)
├── requirements.txt # Python dependencies
├── text_model.pkl # Trained text classification model
├── main.py # combined_server, one-click to start the whole pipeline
├── .gitignore
├── unecessary/ # Old scripts or unused resources
├── tools/
│ ├── hand_easy.py # Hand distance estimation + calibration logic
│ ├── hand_udp.py # Multi-hand + bone vectors + pinch → UDP JSON
│ ├── arduino_udp_receive.py # Simple bridge: UDP → Arduino
│ ├── text_infer_server.py # Text model inference TCP server
│ ├── run_model.py # Load text_model.pkl, provide CLI inference
│ └── __pycache__/ # Python cache
├── test/
│ ├── collect_data.py # Collect “description text + label” data
│ ├── clean_dataset.py # Clean JSONL dataset
│ ├── train_model.py # Train text classification model and output text_model.pkl
│ ├── text_object_dataset.jsonl
│ ├── cleaned_text_object_dataset.jsonl
│ └── object_models_csv.csv # Text labels ↔ model ID mapping table
├── Game/ # Unity demo project (can be opened directly)
│ ├── SampleScene.unity # Main scene
│ ├── models/ # Several glb models (apple, banana, bowl, etc.)
│ └── Scripts/ # Main C# scripts
│ ├── HandFromVectors.cs
│ ├── HandOrbitCamera.cs
│ ├── PinchGrabBall.cs
│ ├── ModelLibrary.cs
│ ├── RuntimeModelLoader.cs
│ └── TextQueryClient_TMP.cs
└── ...
Root directory
main.py: Recommended entry script. Opens the camera, starts the UDP → Unity hand data stream, registers theon_payloadcallback to map pinch state to serial commands, and starts the text inference server.requirements.txt: All Python dependencies; usually installed viapip install -r requirements.txt.text_model.pkl: Trained bytest/train_model.py, used to map natural language descriptions to object labels.
tools/ – runtime tools layer
-
hand_easy.py: Encapsulates distance calibration and filtering logic; used byhand_udp.pyand other scripts. -
hand_udp.py:- Uses MediaPipe, supports multiple hands.
- Outputs wrist depth, bone directions, pinch state.
- Sends JSON packets to Unity via UDP (default
127.0.0.1:5065). - Allows external
on_payloadcallback.
-
text_infer_server.py: Starts asocketserver-based multithreaded TCP server, usesinfer_oncefromrun_model.pyto call the text model in memory. -
arduino_udp_receive.py: Simplified bridge program; reads hand JSON over UDP, cares only abouthands[].pinch, detects state changes and sends'0'/'1'to the Arduino serial port.
test/ – data & model playground
collect_data.py: Interactive CLI tool to quickly collect training data.clean_dataset.py: Filters out samples containing Chinese, keeps only clean English text + labels, outputscleaned_text_object_dataset.jsonl.train_model.py: Trains a model on the cleaned data and saves it astext_model.pklfor the main program.
Game/ – Unity demo
HandFromVectors.cs: UDP client; parses JSON from Python, reconstructs joint positions and visualizes them with spheres and lines.PinchGrabBall.cs: Turns any 3D object into a “pinchable” object, handling grab/follow/release and smooth motion.HandOrbitCamera.cs: Uses pinch to control camera rotation and zoom.ModelLibrary.cs: Maintains a name → GameObject dictionary and providesShowModelByLabel(label)for directly using text inference results.TextQueryClient_TMP.cs: TextMeshPro-based text input client, talks to the Python text server.RuntimeModelLoader.cs: Loads GLB models by index fromStreamingAssets/modelsat runtime to expand the model library.
- Python 3.x
- OpenCV (
cv2) – camera capture + HUD drawing - MediaPipe Hands – hand keypoint detection
- NumPy – vector operations, statistics (median, EMA, etc.)
- scikit-learn – text features (
HashingVectorizer) + linear classifier (SGDClassifier) - joblib – serialize model + label encoder (
text_model.pkl) - socket / socketserver – UDP + TCP communication
- pyserial – serial communication with Arduino
Main Python files:
hand_udp.py– main hand tracking + UDP streaminghand_easy.py– depth estimation demo / debuggingcollect_data.py/clean_dataset.py/train_model.py/run_model.py– text model data & training toolchaintext_infer_server.py– text inference TCP servermain.py– combines hand tracking + text server + serial bridge into a single processarduino_udp_receive.py– alternative: standalone UDP → serial bridge
-
Unity 202x
-
C# scripts:
HandFromVectors.cs– UDP receiver + hand skeleton reconstruction + GUI tuningPinchGrabBall.cs– grab logic for objectsHandOrbitCamera.cs– orbit/zoom camera around a scene targetModelLibrary.cs– treat children/prefabs as a “model dictionary”RuntimeModelLoader.cs– dynamic.glbloading with GLTFastTextQueryClient.cs(classTextQueryClient_TMP) – Unity-side TCP client + UI for text
-
TextMeshPro – input and text display
-
GLTFast – runtime loading of .glb / .gltf models
- Arduino (Uno / Nano, etc.)
- One or more simple servos (for demo)
- Very simple serial protocol: send one ASCII char per update, e.g.
'0'/'1'.
A typical workflow (basically what you’re doing now) is:
-
Start Python side first (
main.py). -
In the camera window that pops up, press
cto calibrate,rto reset,qto quit. -
Then open the Unity project, and load the
Skinscene (or your own demo scene). -
In Unity, click Play:
- The 3D skeleton hand follows your real hand.
- Pinch to rotate the camera or grab objects.
- Enter a sentence in the dialog box and the system will generate the corresponding 3D model in front of you.
Details follow.
-
Create a virtual environment in the project root (recommended):
python -m venv .venv .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Make sure the camera is accessible by OpenCV / MediaPipe.
In the project root:
python main.pyYou’ll get:
-
A camera preview window with a HUD (FPS, calibration status, etc.).
-
In the background:
- UDP hand data server (for Unity).
- Text TCP server (listening on
127.0.0.1:9009). - Serial port (if an Arduino is connected).
In the camera window:
-
Press
c:- Open your palm, face the camera, keep still; it samples about 50 frames.
- The terminal asks you for the real wrist-to-camera distance (meters), e.g.
0.45. - It uses the median palm width/length to compute
k_w/k_l, which are later used to estimate Z.
-
Press
r: reset calibration. -
Press
q: exit Python side.
After calibration, HUD text changes from “Calib: NOT SET” to something like “Calib: OK”.
-
Open the project in Unity. The main demo scene is typically
Skin.unity. The scene should contain:-
An object with
HandFromVectors:listenPort = 5065(must match Python).targetCameraset (usually the main camera).
-
The main camera with
HandOrbitCameraattached. -
A node with
ModelLibrary; its children are model templates (their names usually match label names). -
A UI Canvas with
TextQueryClient_TMPattached, pointing to the TMP input field and buttons.
-
-
Click Play:
-
You’ll see 3D hand bones (spheres + lines) in the camera view.
-
When you pinch (thumb + index):
- If a
PinchGrabBallobject is nearby, it gets grabbed and follows your hand. - If nothing is grabbed,
HandOrbitCamerainterprets pinch as camera control—moving your hand rotates the view.
- If a
-
-
In the UI, click the button to open the dialog, enter an English description, for example:
a green appleand click confirm:
- Unity sends this string to
127.0.0.1:9009. - Python runs the text model and returns something like
102|0.93. TextQueryClient_TMPparseslabel = "102"and callsModelLibrary.ShowModelByLabel("102").ModelLibrary/RuntimeModelLoaderspawn the corresponding model in front of the camera and auto-attachPinchGrabBallso you can pinch it.
- Unity sends this string to
-
Flash a simple serial control sketch on Arduino, e.g.:
Serial.begin(9600);if (Serial.available()) char c = Serial.read();- If
c == '1'→ servo turns to 45°, ifc == '0'→ servo returns to 0°.
-
In
main.pyorarduino_udp_receive.py, changeCOM9to your actual serial port. -
Run Python:
- When hand tracking works,
on_payloadorarduino_udp_receivewill send'1'/'0'according to pinch state. - You should see the servo move as you pinch / release.
- When hand tracking works,
Summary:
-
Opens the camera and uses MediaPipe Hands to detect multiple hands.
-
For each frame it computes:
- Palm width / palm length (pixels)
- Finger curl
curl(0–1) - Side pose
side(0–1) - Palm/backs-of-hand orientation
palm_front - Wrist depth
Z(meters) - 20 bone direction vectors (unit vectors)
pinch(thumb + index pinch or not)
-
Packs it into JSON and sends via UDP to:
- Unity port (default 5065)
- Arduino UDP bridge port (default 5066)
Key points:
-
Calibration logic
- Uses a
CalibStatestruct to store sampled palm widths/lengths,k_w,k_l, etc. - When you press
c, it samples for a while, then asks for the real distance and computesk_w = d_real * w_medand similar. - For depth estimation, it uses two channels
Z ≈ k_w / palm_widthandZ ≈ k_l / palm_length, then fuses them based on curl / side.
- Uses a
-
Pinch detection
- Typically checks the distance between
thumb_tipandindex_tip. If below threshold, it’s pinch. - Writes
"pinch": true/falsedirectly into the JSON.
- Typically checks the distance between
-
on_payloadcallback (registered inmain.py)- Receives the whole JSON per frame, can count how many hands are pinching and do post-processing (e.g. serial output).
This is for “pulling the depth estimation piece out” and playing with it separately, without UDP or Unity.
Important functions:
compute_palm_width_and_length(...)– computes palm width & length in pixels given landmarks; used as depth proxies.compute_curl(...)– uses finger joint angles to determine whether the hand is open or in a fist.compute_side(...)– detects whether the hand faces the camera or is turned sideways.fuse_depth(Zw, Zl, curl, side, palm_front, ...)– fuses the two depth channels into a finalZ_finalwith weighting and correction terms.draw_hud(...)– prints all intermediate values on the image for easier tuning and understanding.
-
Raw data:
text_object_dataset.jsonlEach line is a JSON record:{"text": "...", "label": "101"}, containing both Chinese and English. -
collect_data.py: interactively add data. -
clean_dataset.py: filters out samples containing Chinese characters, writes tocleaned_text_object_dataset.jsonl. -
train_model.py: trains / incrementally trains anSGDClassifieron the cleaned data, saves totext_model.pkl, and prints training metrics. -
run_model.py: tests the model on the command line, printing top-k labels + probabilities for given sentences. -
text_infer_server.py:-
Loads the model once at startup.
-
Exposes a TCP server:
- For each line of text it receives → runs inference → returns
"label|prob\n".
- For each line of text it receives → runs inference → returns
-
main.py glues three directions together:
-
Hand tracking + UDP: starts the tracking loop from
tools.hand_udp. -
Serial bridge: registers
on_payload(payload):- Counts whether any hand is pinching in the JSON.
- If yes →
ser.write(b"1"); otherwise →ser.write(b"0").
-
Text TCP server: uses
TextInferHandler+ThreadedTCPServerto listen on port 9009 for text from Unity.
So you only need to run python main.py to support Unity hand tracking + text-driven object generation + hardware feedback all at once.
If you don’t want to mix too much logic into main.py, you can run this script separately:
- Listens on UDP port 5066 for the same JSON as Unity.
- Parses the current pinch state.
- When the state changes, sends
'0'or'1'to the Arduino serial port. - Good for debugging the “hardware bridge” in isolation.
Responsibilities:
-
Creates a
UdpClientto listen on the given port (default 5065). -
Parses JSON from Python:
wristpixel coordinates + normalized coords +z_m(depth)- 20 bone direction vectors (unit vectors)
pinch/is_leftflags
-
Uses:
- Camera intrinsics (via Unity
Cameraprojection). - Pre-configured bone lengths.
to reconstruct 21 joint positions in Unity world space.
- Camera intrinsics (via Unity
-
Dynamically creates:
- Sphere array
jointObjectsto visualize joints. LineRendererarrayboneLinesto draw bones.
- Sphere array
-
Exposes API:
TryGetJointPosition(handIndex, jointIndex, out Vector3 pos)bool IsPinching(handIndex)bool AnyHandPinching
It also draws a GUI window in-scene that lets you:
- Adjust each bone’s length.
- Toggle debug options.
- See how many hands are active and their pinch states.
Attach this script to any GameObject and assign a handTracker; the object becomes pinch-grabbable:
-
When not yet grabbed:
-
Iterate all hands and check if any pinching hand has its control joint (
controlJointIndex, index fingertip by default) withingrabDistanceof the object. -
If so, treat that as “grabbed”, and record:
- Which hand grabbed it:
grabbedHandIndex - Initial offset from the follow joint to the object:
grabOffset
- Which hand grabbed it:
-
-
While grabbed:
-
If
usePhysicsis enabled:- Disable gravity on its
Rigidbody, zero out velocity, and drive it by position interpolation.
- Disable gravity on its
-
If not using physics:
- Directly use
Vector3.Lerpto movetransform.positiontoward the target, controlled byfollowSmoothing.
- Directly use
-
-
pinchReleaseGraceprotects against MediaPipe glitches:- Short pinch dropouts that immediately recover will not drop the object.
- The object is released only if the pinch has been off for longer than the grace time.
The script has a static counter grabbedCount; other scripts (e.g. camera control) can query AnyObjectGrabbed to know if anything is currently being held.
Attach this to the camera so that pinch gestures control the camera whenever nothing is grabbed:
-
Choose a joint (default: index fingertip) as the control point.
-
Record the hand position and yaw/pitch at the moment pinch starts.
-
While pinch is held:
- Map hand movement on screen to yaw / pitch.
- Clamp pitch to avoid flipping behind the head.
- Adjust radius
radiusbased on zoom gestures or depth change to push/pull the camera.
Final camera update:
orbitCamera.transform.position = pivot + dir.normalized * radius;
orbitCamera.transform.LookAt(pivot, Vector3.up);Design: attach this script to an empty GameObject and put all model prefabs as its children:
-
In
Awake():- Collect all child
GameObjects into aDictionary<string, GameObject>keyed by their names. SetActive(false)on all children, treating them as templates.
- Collect all child
-
ShowModelByLabel(string label):-
Find the template by label.
-
If there’s already a displayed instance, hide or deactivate it.
-
Spawn the new object near
spawnAnchor.position + spawnOffset. -
Ensure it has
PinchGrabBallattached and configure:handTrackergrabDistanceusePhysics
-
With this, the label returned by the text model directly determines which prefab appears in the scene.
Used to “load models on the fly instead of baking all of them into the scene”.
Main interfaces:
-
MakeFileName(int index):- Default is
101 → "101.glb", but you can replace this with more complex mapping (e.g. via a table).
- Default is
-
LoadByIndexAsync(int index):- Builds a path (usually under
Application.streamingAssetsPath). - Uses
GltfImportto load the GLB. - Instantiates it as a
GameObject. - If
currentInstanceexists, destroys the old one first. - Parents the new object under the loader and zeroes
localPosition/localRotation.
- Builds a path (usually under
-
LoadByIndex(int index):- Convenience wrapper:
_ = LoadByIndexAsync(index);(ignoresawait).
- Convenience wrapper:
If you have a batch of .glb files in StreamingAssets, you can directly map labels to filenames and truly load them on demand in Unity.
Attach this to a UI GameObject; it uses TMP input + buttons to talk to the Python text service.
Flow:
-
OpenDialog():dialogPanel.SetActive(true)and focus the input field.
-
Button click →
OnClickSend():- Read user text from
descriptionInput.text. - Start
SendQueryCoroutine(q).
- Read user text from
-
Inside
SendQueryCoroutine:- Use
TcpClientto connect toserverIp:serverPort(default127.0.0.1:9009). - Send
q + "\n"in UTF-8. - Block until one line of response is read.
- Parse
<label>|<prob>. - If
modelLibraryis bound, callmodelLibrary.ShowModelByLabel(label). - Optionally show the prediction on a TMP text widget.
- Use
-
OnDestroy()closes the stream and socket.
-
Ports: 5065 (Unity) / 5066 (Arduino UDP bridge)
-
Encoding: UTF-8 JSON
-
Top-level fields:
timestamp,fps,num_hands,hands -
Each hand contains:
id– hand indexis_left– whether it is a left handwrist–{px, py, nx, ny, z_m}bones– list of{from, to, vx, vy, vz}pinch– bool
- Address:
127.0.0.1:9009 - Request: one line of text +
\n - Response:
<label>|<probability>\n
Example:
a small red apple\n
→ 101|0.923\n
-
Baud rate: 9600
-
Data: single-byte ASCII char
'1'– at least one hand is pinching'0'– no hand is pinching
How Arduino interprets this is up to you; the example here is driving a servo to different angles.
-
object_models_csv.csv:-
Maintains a table “object ID → English name → category”, e.g.:
101, red apple, fruit203, tomato, vegetable- …
-
Used to align:
- Text dataset labels
- GLB filenames
- Unity prefab names
-
-
text_object_dataset.jsonl/cleaned_text_object_dataset.jsonl:- Can be extended over time to train stronger text models.
- Current cleaning logic is: simply filter out Chinese samples and keep only English sentences.
From an engineering perspective, this repo already connects three “worlds”:
Camera world → Algorithm world → Virtual world → (future) Hardware world
Possible extensions:
-
Semantic upgrades
- Replace the current simple
HashingVectorizer + SGDClassifierwith semantic retrieval or large-model embeddings. - Truly support mixed Chinese/English input so children can describe in Chinese and the system internally maps to English.
- Replace the current simple
-
Gesture upgrades
- Beyond pinch, add more dynamic gestures like fist, pointing, waving.
- Map gestures to different teaching interactions (select, confirm, delete, etc.).
-
Content generation
- Generate whole teaching levels at once instead of single objects.
- Use the object CSV + text descriptions to generate contextual scenes for kids (kitchen, supermarket, classroom, …).
-
Hardware feedback
- More complex gloves, vibration motors, brake/force devices to “materialize” virtual objects in the real world.
- Map Unity collisions and task completion events to multi-channel hardware feedback.
-
Online learning
- Log user sentences and chosen objects, and continue fine-tuning the text classifier.
- Let the system gradually adapt to each user’s way of speaking.
OneTouchReality started as a full “VR hand-gesture controlled robotic arm system”: Camera + MediaPipe for hand recognition → UDP sends 21 keypoints to Unity → Unity reconstructs the 3D hand and does collision detection → TCP sends “which finger, how much force” back to Python → Arduino controls 5 servos to turn virtual collisions into real pulling forces on the fingers.
-
Vision pipeline preserved but cleaned up
-
Python still uses MediaPipe for hand keypoints and pose estimation, but no longer dumps all 21 points into Unity. Instead it extracts:
- Real-world wrist distance (meters)
- 20 bone direction vectors
- Discrete states like pinch / no pinch
-
The distance estimation module inherits OneTouchReality’s idea of “palm width / palm length dual channels + weighted fusion + median/EMA filtering”, but wraps it as
hand_easy.py, withcfor calibrate,rfor reset,qfor quit, so it’s easy to reuse in other projects.
-
-
Interaction logic: from “robotic arm” to “teaching objects + grabbing + camera control”
-
Unity side uses
HandFromVectors.csto receive JSON from Python, reconstruct 21 joint positions from bone directions and lengths, and draw a virtual hand with spheres and lines. -
PinchGrabBall.csimplements “pinch to make it follow your hand”:- Among all pinching hands, find the one closest to the object as the “grabbing hand”.
- Record the finger-to-object offset.
- Update the model position based on joint position + offset with smoothing, using either rigidbody physics or direct interpolation.
-
HandOrbitCamera.csturns pinch into a “general gesture mouse”:- Single-hand pinch → orbit the camera around the target.
- Two-hand pinch → change camera radius (pinch-to-zoom).
- While any object is grabbed by
PinchGrabBall, camera temporarily stops reacting to avoid conflicts.
-
-
From “collision-sensing glove” to “scene that understands language”
-
In the original project, Unity detected which tags finger bones collided with, then sent which fingers and how much force to Python/Arduino.
-
In SecondTouchReality we shift the focus to “text → object”:
collect_data.pyinteractively gathers “description + label” pairs.clean_dataset.pyfilters non-English samples, leaving onlytext/label.train_model.pytrains a lightweight multi-class model withHashingVectorizer + SGDClassifierand producestext_model.pkl.text_infer_server.pyopens a TCP service that receives descriptions from Unity and returns labels in real time.
-
On Unity side,
TextQueryClient_TMP.cs:- Shows a TMP input dialog.
- Sends user text over TCP to the Python text server.
- On receiving a label, calls
ModelLibrary.ShowModelByLabel()to activate the corresponding 3D model. - Displays the result in the UI and pops a short “success” panel.
-
-
Hardware chain: from multi-servo cable tightening → single-channel pinch signal
-
OneTouchReality’s end goal is 5-servo cable pulling to simulate fingertip haptics, requiring full mechanical design, cable management, springs and anti-twist structure.
-
In this derived project, we first complete the signal chain:
hand_udp.pymarks each hand’s pinch state in every JSON frame.arduino_udp_receive.pyormain.pyregister a callback: when any hand transitions between “not pinching → pinching” or vice versa, they send a single character over serial:'1'means tighten,'0'means release.- Arduino runs a minimal servo sketch:
'1'→ 45°,'0'→ 0°.
-
With this, you can tie one string to your fingertip and close the loop:
Camera → Python → Unity → Arduino → Finger
to test latency, stability and safety first, then gradually scale to multi-servo and more complex mechanisms.
-
-
Integration:
main.py= one-click full pipeline-
main.py(essentiallycombined_server.py) runs four things in one process:- Start
tools.hand_udp: camera + MediaPipe + hand depth + pinch detection, send JSON to Unity via UDP. - In
on_payload, extract pinch state and send'0'/'1'to Arduino on edge changes. - Start the text inference TCP server so Unity can request models by natural language.
- Manage serial lifecycle and exceptions (auto-retry on disconnection / safe close).
- Start
-
Conceptually, SecondTouchReality is a teachable, extensible, fast-experiment mini trunk carved out of the full OneTouchReality chain:
- Camera → hand geometry & depth
- Text description → object label
- Unity scene → grabbing / camera / teaching levels
- Pinch → single or multi-servo feedback
Later you can plug these modules back into a “full-size” force-feedback glove system, or treat this as a base for an AI-driven interactive teaching platform.
SecondTouchReality is essentially an end-to-end playground prototype: every module is simple enough to hack on freely, yet the chain is complete enough for you to experience the full loop from camera to virtual object to physical feedback.