Body, hand and finger gestures
I suggest that you rely primarily on body, hand and finger gestures to control VR environments completely.
Typing by moving fingers in the air and controlling characters is much better than VR controllers and with better accuracy than mouse and keyboard.
There are already machine learning models that have good quality at estimating body, hand and finger pose and gestures from normal RGB cameras. You can try to improve it by self-supervised (Monocular or binocular) depth estimation models.
FaceBook is contributing to AI research with many papers and projects (The famous one is PyTorch), so it won't be hard to make it reach very good quality.
You can publish models to help improve the community and allowing others to contribute. You can also contribute to Monado.
