Dist Robotics

Building affordable and safe humanoid robots through collective human guidance and distributed learning.
Github

main



Autonomous robots through distributed reinforcement learning


In the coming years, humanoid robots will be deployed in many aspects of our lives. But the question is: who should decide what these robots can or cannot do? The simple answer: the community, through voting, just like how we vote on new laws. The means: an open-source model—for both hardware and software.

Project goal: To pave the way to an affordable open source robot that anyone can quickly assemble. This robot will be capable of sharing real world training data, guided by human feedback, with a centralized server or even better a decentralized training system. All robots of the network will benefit from updated model weights.

Through open source code development, task demonstrations from humans and the mimicking capabilities of the robots, each individual will be able to suggest and teach new tasks that the robot can learn. The community will then be able to validate and reinforce, or reject the tasks suggested by other people.

A voting mechanism could ensure that tasks can be ruled out from the system if deemed dangerous, and a reward mechanism could remunerate contributers (writing reward functions, ranking real world robot trajectories, recording videos for immitation learning, etc).

The project's name, "Dist Robotics", reflects this core idea of distributed, community-driven robotic learning. But before any of this, the robot has to be built.

Challenge: Building a humanoid robot from scratch for less than 1000 USD (about 900 CHF) within 12 months (currently at 7 months) to demonstrate successful transfer of learned behaviors, real-world reinforcement learning and mimicking capabilities.

Success Criteria: The robot should pick up a light object, like an apple, from point A and deliver it to point B, autonomously.

Technical challenges still need to be overcome for such a project. Deep reinforcement learning of simulated robots with subsequent transfer of the trained weights to the physical robots is nowadays regarded as the state of the art, but this requires powerful infrastructure and highly accurate modeling of the robots and its environment.

In this project, we settle for just a moderate modeling accuracy but investigate how real world reinforcement learning (fine-tuning the simulation weights) could make robots achieve reliable control without the excessive simulation overhead.

To achieve scalability in this setting, ideally multiple robots should provide feedback in parallel to generate sufficient data for updating the model weights globally, which can then be redistributed to all participating robots.

There are significant benefits of such an approach, here is a recap:

So this project isn’t just about building a robot - it’s about building a system where the community decides how autonomous agents behave.




Intelligent Control System


The robot's main cognitive component is a deep neural network, that processes proprioceptive feedback from the joints (angle and angular velocity), stability (root orientation), and task-specific inputs at 30Hz to derive optimal actuators speed. This was chosen to simplify the hardware requirement and the model integration. However, to really achieve a smooth motion, torque control will probably be chosen in the long run, with a higher frequency of probably 120Hz or more.

The output of the neural network is processed through a driver to modulate the voltage going through each DC motors from a 24V power supply. Each motor can apply a maximum 10Nm torque. The ankle and knee motors will soon be replaced with 20Nm ones.

To learn the optimal control output with respect to all possible state that the robot could encounter, a reinforcement learning (Proximal Policy Optimization - PPO) agent computes the appropriate feedback for each step of multiple simulated trajectories (or the real world ones) to train the deep neural network. With this AI-driven learning strategy, the robot should achieve precise and stable motion for complex and versatile tasks with minimal hardcoding required. The feedback is based on a reward function: for example in the case of learning to walk, part of the reward might be the position at time t minus the position at t-1. If positive, the robot will reinforce the behavior. If negative, the robot will punish the behavior to lower the probability of reproducing such action under a somehow similar state.




Methods


The project applies proven concepts balancing performance and simplicity, to ensure fast development. The main inspirations came from the following:


Title Authors Published Link Overview
Introduction to Reinforcement Learning Richard S. Sutton, Andrew G. Barto 1998 (2nd Edition: 2018) Official Book Website Fundamental concepts and algorithms in reinforcement learning, providing the theoretical basis for training robots to learn through interaction with their environment.
Proximal Policy Optimization (PPO) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov 2017 Read the Paper PPO, a robust and efficient policy gradient method, commonly used for training robot control policies due to its stability and simplicity.
DeepMimic: Example-Guided Deep Reinforcement Learning Xue Bin Peng, Pieter Abbeel, Sergey Levine, Michiel van de Panne 2018 Project Page An approach for imitation learning that trains robots to mimic human-like motions from motion capture data, useful for humanoid robotics applications.
Continuous Control with Deep Reinforcement Learning Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, et al. 2016 (arXiv) Read the Paper Continuous action spaces, critical for motor control tasks.
Understanding Domain Randomization for Sim-to-Real Transfer Chen, Wang, Zhang, and others 2021 Read the Paper Domain randomization techniques for transferring RL policies from simulation to real-world robots by randomizing environment parameters during training.



Deep Learning Models


Deep neural networks are used for various parts of the robot software, from controlling the motors to detecting objects in space, the main ones are the following:

Model Type Description Used For
Critic Model (custom built) Evaluates the expected return from a state (or state-action pair), helping assess how advantageous a particular action was. RL (PPO)
Actor Model (custom built) Outputs a probability distribution over actions given a state, from which actions are sampled. RL (PPO)
Curiosity Model (custom built) Enables model-based learning and "dreaming". Planning
Inception-ResNet-v2 Object position estimation in 3D space. Computer Vision
Google MediaPipe Pose estimation for joint orientation. Deep Mimic



Robots


The robots are designed to balance ressemblance of human morphology (to mimic behaviors easily) and simplicity for fast prototyping. For now, only the lower body has been built. The rest will be assembled once basic locomotion is achieved on the lower body.

Feature Legs (Lower Body) - Done Lower and upper body - Design in process - ToDo Full Body (With Fingers) - ToDo
Action Size (joints)81838
State Size4878150+
Primary FunctionBipedal locomotionLocomotion & basic upper body tasksHigh level of dexterity
ElectronicsRaspberry Pi, DC motors, encoders, gyroscopeCamera, microphone, loudspeakerServo motors
Size and weight70cm - 7kg131cm - 13.5kg131cm - 14kg



Hardware


The development of the robot started with the CAD design of the lower body. The design underwent three major iterations, leading to significant improvements in stability, motor placement, and articulation. In total, plastic parts (PLA) have been printed with the Prusa and Creality 3D printers. Everything was assembled with M3, M4 and M6 bolts.

The electronics includes a 24 Volt power supply, a Raspberry Pi, jumper wires, DC motors, hall effect encoders, a gyroscope and multiplexers. Flange bearing are used at each joint to support the torque induced by the motors.

For now the robot can auto calibrate itself and do simple flexion and extensions - but the transferability of the weights learned through the simulation still faces challenges, mainly because of motor play and unpredictable motor dynamics.


CAD design of the lower body

CAD Design

Robot v2

Version 2

Version 3




Simulation


Pybullet was chosen for the simulation for its conveniance and wide support. The "humanoid.urdf" available from the the Pybullet library was rearrangement and upgraded to make it as close as possible to the designed robot (links lengths, type of joints, mass, inertia, friction, etc).

During reinforcement learning training, the simulation generates multiple agents (robots) in parallel (that are automatically regenerated either when falling or after the maximum amount of steps per episode - An episode is basically a sequence of steps that a robot takes during a trajectory). A parallelization of 64 to 256 was used and the batch size was set to 50'000, meaning that after that amount of steps, the PPO algorithm recalculates the new optimal weights of the neural network. Episodes were set to last 300 steps, so trajectories are equivalent to 10 seconds or less at 30Hz.

parallel training

Parallel training




Results


The first task for the robot was learning to stay standing, which ended up taking about 20'000 episodes for the lower body (about 10 minutes of parallel training on the GPU Nvidia RTX 4090).
Learning to walk takes over a 100'000 episodes and several hours. The "walking" task actually uses the weights learned from the "standing" task, something referred to as "curriculum learning" - increasing the difficulty of tasks over time.

Lower body learning to stand

Learning to stay standing - rewards over time

Lower body learning to walk

Learning to mimic a squat

Robustness techniques like domain randomization are necessary to achieve successful weights transfer to the real robot. During such a training setting, external forces are applied to random parts of random joints and a noise of up to 30% is applied to all of the following: mass, inertia, state readings, max torque, max velocity, velocity gain, terrain shape and friction.
This ensures that even if the real robot has different dynamics compared to the simulation, the robot will be ok because it learned to do the tasks "no matter the imprecision of the real world". But still, simulation-to-real transfer remains an extremely hard challenge.

Standing and resisting perturbations

Walking

Walking and resisting perturbations

As mentioned earlier, another aspect of learning is imitation. The approach used here is borrowed from Deep Mimic, but instead of using a motion capture system, we use pose estimation from videos to extract target angles and joints positions at each time step. For now, squat was the only task transferred to the robot.


Pose estimation of a squat

Lower body mimicking the squat

Full body mimicking the squat




Next Steps


The progress so far is promising, with successful implementation of many core functionalities. However, the primary challenge remains: achieving transferability from the simulated environment to the physical robot. Once this is achieved, the upper body will be assembled. Other key challenges coming:



Interested in this project?




Annexe I: Electronics schema


Some symbols are missing in the kicad library, so similar ones have been used instead: AS5047D represents the encoder AS5600, TCA9534 represents the multiplexer TCA9748A and PCA9685BS represents the motors driver PCA685. The wires coming from the Raspberry Pi to each motor on the right side are direction signals. The wires coming from the left of the motors are the PWM from the driver.

electronics

Kicad electronics schema (lower body only - velocity control)




Annexe II: Development Timeline


Month Phase Key Developments
April 2024 Foundation and Initial Setup Built PyBullet simulations, implemented PPO, studied Sim2Real transfer
May Standing and Early Motion Implemented RL for standing based on torque control, added parallelization, refined reward shaping, started the robot design
June Neural Network Optimization Implemented DeepMimic, ordered electronic components, created FreeCAD parts, implemented new reward functions (walking, decreasing energy, smoothness)
July Hardware Integration Cleaned codebase, programmed the Raspberry Pi, conducted velocity & position control experiments, hyperparameter tuning
August Vision and Sensor Improvements Implemented image recognition and depth estimation, added gyroscope, encoders, multiplexers & IO extenders.
September Structural Improvements Added domain randomization, developed a standing-up task, printed lower-body components, estimated motor dynamics through experiments.
4-Month Break - working on another project
February 2025 Resuming Development Tested higher velocity & torque, improved DeepMimic implementation, improved robot parts



Annexe III: Project Cost Breakdown


Item Price ($) Units Total ($)
PLA Filament - 1Kg22244
Raspberry Pi 5 - 8GB88188
PCA9685 Servo Driver339
DC Motors 5840-36502414336
DC Motor 36GP-365030130
MPU6050 Gyroscope212
Rubber Grip Tape - 100cm212
MicroSD Card - 64GB919
Small Servo Motor with Encoder720140
DC Buck Converter (12/24V to 5V)111
Jumper Wires Set326
Flange Coupling13131
TCA9548A I2C Multiplexer133
24V Battery33133
USB Camera (OV3660)9218
Mini USB Microphone212
AS5600 Magnetic Encoder11515
Speaker Driver Module111
Screws and Bolts Assortment11111
Large Servo Motors (Neck and Wrist)16348
ADS1115 Analog-to-Digital Converter166
External 5V Amplifier with Volume Control111
Bearings for Fingers Pack111
Bearings for Motors12929
2mm Shaft for Fingers Pack111
1mm Screws for Fingers Pack111
Silicone111
Thread for Fingers111
1m Jack Audio Extender111
0.5m USB Extender133
Total 874