Empowering Vision, Enabling Freedom Through AI Assistance
Built with the tools and technologies:
SeeSense-AI is a browser-based assistive vision tool designed for blind and low-vision users.
Using a camera feed, AI vision models, and text-to-speech, it turns visual information into spoken feedback so users can:
- Understand their surroundings
- Read printed text in the environment
- Identify everyday objects
All of this is driven through simple voice commands such as:
“Nova, describe”
“Nova, read”
“Nova, identify”
The goal is not to be a generic AI demo, but a focused accessibility tool that gives users more independence in unfamiliar or visually complex environments.
-
Voice-activated assistant (Nova)
Hands-free interaction using speech recognition to trigger actions:describe– scene descriptionread– text readingidentify– object identificationrepeat– repeat the last response
-
Scene Description
Captures a frame from the camera and sends it to a vision model, which returns a concise spoken description of what is in front of the user. -
Text Reading
Reads menus, signs, labels or other printed text visible to the camera and speaks it back to the user. -
Object Identification
Identifies the main object in view and provides details like type, color and context. -
Lighting Awareness
Detects when the scene is too dark or too bright for reliable analysis and gives spoken guidance on how to adjust. -
Accessible Interaction Design
- Full-screen “tap anywhere to start” onboarding
- Large, high-contrast controls
- Voice-first flow so users never need to rely on precise mouse/trackpad interaction
- The user opens the web app and grants camera and microphone access.
- The browser listens for wake phrases such as:
“Nova, describe” - JavaScript captures a frame from the camera and sends it to the Flask backend.
- The Flask backend forwards the image to a Google Gemini vision endpoint with a task-specific prompt.
- The model response (scene description / text / object info) is returned as JSON.
- The frontend:
- Displays the result in the “AI Analysis” panel
- Uses the Web Speech API to speak the result aloud
- Users can say
“Nova, repeat”to hear the last response again.
- Backend: Python, Flask
- AI: Google Gemini Vision API (via HTTP requests)
- Frontend: HTML, CSS, JavaScript
- Speech Input: Web Speech API (speech recognition)
- Speech Output: Web Speech API (text-to-speech)
- Environment:
.env-based config,requirements.txtfor dependencies
SeeSense-AI/
├── app.py # Main Flask application and routes
├── requirements.txt # Python dependencies
├── .env # Local environment variables (not committed)
├── static/
│ ├── index/
│ │ ├── index.html # Main UI for the assistant
│ │ └── style.css # Main styles
│ └── demo/
│ ├── demo.html # Optional demo/experimental view
│ └── demo.css
└── venv/ # Local virtual environment (ignored in git)- Python 3.9+
- A Google Gemini API key
- Git
git clone https://github.com/hannahjan06/SeeSense-AI.git
cd SeeSense-AIpython -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activatepip install -r requirements.txtCreate a .env file in the project root:
GEMINI_API_KEY=your_api_key_here
GEMINI_MODEL=your_preferred_model_name # optional or you can change it in app.py python app.pyBy default, the app should be available at:
http://127.0.0.1:5000
Open it in Chrome for best support of camera and speech APIs.
-
Open the web page and allow camera and microphone access when prompted.
-
Wait for Nova to indicate that it is listening.
-
Use one of the voice commands:
"Nova, describe""Nova, read""Nova, identify""Nova, repeat"
-
Listen to the spoken response. The same text will also appear in the “AI Analysis” panel.
-
If the lighting is too dark or bright, Nova will inform you and suggest adjustments.
- Requires a stable internet connection for Gemini API access.
- Currently uses two different voices for system response and analysis output.
- Accuracy depends on camera quality, lighting and model performance.
- Works best in Chrome or Chromium-based browsers with full support for Web Speech and media APIs.
- Not a medical or safety-certified device; intended as a proof-of-concept assistive tool.
-
Continuous “explore” mode with periodic scene updates
-
Improved alignment guidance for framing objects and text
-
Dedicated modes for:
- Reading menus
- Reading medicine labels
- Locating specific objects
-
Multi-language support for output
-
Mobile-first layout and PWA packaging
-
Option to swap Gemini with self-hosted open models for offline/edge use
This project started as a hackathon prototype. If you have ideas around accessibility, voice interaction, or multimodal AI and want to iterate on it:
- Fork the repo
- Create a feature branch
- Submit a pull request with a clear description of your changes
- Google Gemini for multimodal AI capabilities
- The broader accessibility community for continual advocacy and design principles
- Horizon Hacks for the theme “AI for Accessibility and Equity” that inspired this build