Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 9 additions & 67 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -1,76 +1,18 @@
# Version control
.git
.gitignore
.gitattributes

# Python
__pycache__
__pycache__/
*.pyc
*.pyo
*.pyd
.Python
*.so
*.egg
*.egg-info
dist
build

# Virtual Environment
venv/
env/
ENV/
.venv/

# IDE
.vscode/
.idea/
*.swp
*.swo
*~
.DS_Store

# Documentation
README.md
LICENSE
*.egg-info/
dist/
build/
.git/
.gitignore
*.md
!README.md
docs/

# Docker
Dockerfile
docker-compose.yml
docker-compose.override.yml
.dockerignore

# CI/CD
.github/
.gitlab-ci.yml
.travis.yml
Jenkinsfile

# Testing
tests/
test/
*.test.py
.pytest_cache/
.coverage
htmlcov/
.tox/

# Logs
*.log
logs/

# Temporary files
tmp/
temp/
*.tmp
*.temp

# OS files
*.png
LICENSE
.DS_Store
Thumbs.db

# Environment files
.env
.env.local
.env.*.local
185 changes: 45 additions & 140 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
<div align="center">

# GPU Hot
### **Real-time NVIDIA GPU Monitoring Dashboard**

Monitor NVIDIA GPUs from any browser. No SSH, no configuration – just start and view in real-time.
Real-time NVIDIA GPU monitoring dashboard. Web-based, no SSH required.

[![Python](https://img.shields.io/badge/Python-3.8+-3776AB?style=flat-square&logo=python&logoColor=white)](https://www.python.org/)
[![Docker](https://img.shields.io/badge/Docker-Ready-2496ED?style=flat-square&logo=docker&logoColor=white)](https://www.docker.com/)
Expand All @@ -14,108 +13,69 @@ Monitor NVIDIA GPUs from any browser. No SSH, no configuration – just start an

</div>

---

## Quick Start
## Usage

### Docker (recommended)
Monitor a single machine or an entire cluster with the same Docker image.

**Single machine:**
```bash
docker run -d --name gpu-hot --gpus all -p 1312:1312 ghcr.io/psalias2006/gpu-hot:latest
docker run -d --gpus all -p 1312:1312 ghcr.io/psalias2006/gpu-hot:latest
```

**Force nvidia-smi mode (for older GPUs):**
**Multiple machines:**
```bash
docker run -d --name gpu-hot --gpus all -p 1312:1312 -e NVIDIA_SMI=true ghcr.io/psalias2006/gpu-hot:latest
# On each GPU server
docker run -d --gpus all -p 1312:1312 -e NODE_NAME=$(hostname) ghcr.io/psalias2006/gpu-hot:latest

# On a hub machine (no GPU required)
docker run -d -p 1312:1312 -e GPU_HOT_MODE=hub -e NODE_URLS=http://server1:1312,http://server2:1312,http://server3:1312 ghcr.io/psalias2006/gpu-hot:latest
```

Open `http://localhost:1312`

### From source
**Older GPUs:** Add `-e NVIDIA_SMI=true` if metrics don't appear.

**From source:**
```bash
git clone https://github.com/psalias2006/gpu-hot
cd gpu-hot
docker-compose up --build
```

### Local dev

```bash
pip install -r requirements.txt
python app.py
```

**Requirements:** Docker + NVIDIA Container Toolkit ([install guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html))
**Requirements:** Docker + [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)

---

## Features

**Sub-Second Updates:**
- **Lightning-fast refresh rates**
- Historical data tracking
- WebSocket real-time streaming

**Charts:**
- Utilization, Temperature, Memory, Power
- Fan Speed, Clock Speeds, Power Efficiency

**Monitoring:**
- Multi-GPU detection
- Process tracking (PID, memory usage)
- System CPU/RAM
- WebSocket real-time updates

**Metrics:**
- GPU & Memory Utilization (%)
- Temperature (GPU core, memory)
- Memory (used/free/total)
- Power draw & limits
- Fan Speed (%)
- Clock Speeds (graphics, SM, memory, video)
- PCIe Gen & width
- Performance State (P-State)
- Compute Mode
- Encoder/Decoder sessions
- Throttle status
- Real-time metrics (sub-second)
- Automatic multi-GPU detection
- Process monitoring (PID, memory usage)
- Historical charts (utilization, temperature, power, clocks)
- System metrics (CPU, RAM)
- Scale from 1 to 100+ GPUs

**Metrics:** Utilization, temperature, memory, power draw, fan speed, clock speeds, PCIe info, P-State, throttle status, encoder/decoder sessions

---

## Configuration

Optional. Edit `core/config.py`:

```python
UPDATE_INTERVAL = 0.5 # NVML polling interval (fast)
NVIDIA_SMI_INTERVAL = 2.0 # nvidia-smi polling interval (slower to reduce overhead)
PORT = 1312 # Web server port
DEBUG = False
```

Environment variables:
**Environment variables:**
```bash
NVIDIA_VISIBLE_DEVICES=0,1 # Specific GPUs (default: all)
NVIDIA_SMI=true # Force nvidia-smi mode for all GPUs
```

**nvidia-smi Fallback:**
- Automatically detects GPUs that don't support NVML utilization metrics
- Falls back to nvidia-smi for those GPUs
- Compatible with older GPUs (Quadro P1000, Tesla, etc.)

**Force nvidia-smi for all GPUs:**
- Docker: `docker run -e NVIDIA_SMI=true ...`
- Config: Set `NVIDIA_SMI = True` in `core/config.py`

Frontend tuning in `static/js/socket-handlers.js`:
```javascript
DOM_UPDATE_INTERVAL = 1000 // Text updates frequency (ms)
SCROLL_PAUSE_DURATION = 100 // Scroll optimization (ms)
NVIDIA_SMI=true # Force nvidia-smi mode for older GPUs
GPU_HOT_MODE=hub # Set to 'hub' for multi-node aggregation (default: single node)
NODE_NAME=gpu-server-1 # Node display name (default: hostname)
NODE_URLS=http://host:1312... # Comma-separated node URLs (required for hub mode)
```

Chart history in `static/js/charts.js`:
```javascript
if (data.labels.length > 120) // Data points to keep
**Backend (`core/config.py`):**
```python
UPDATE_INTERVAL = 0.5 # Polling interval
PORT = 1312 # Server port
```

---
Expand All @@ -131,41 +91,10 @@ GET /api/gpu-data # JSON metrics
### WebSocket
```javascript
socket.on('gpu_data', (data) => {
// Updates every 0.5s
// data.gpus, data.processes, data.system
// Updates every 0.5s (configurable)
// Contains: data.gpus, data.processes, data.system
});
```

---

## Extending

Add new metrics:

**Backend (`core/metrics/collector.py`):**
```python
# Add NVML query
value = pynvml.nvmlDeviceGetYourMetric(handle)
gpu_data['your_metric'] = value
```

**Frontend (`static/js/gpu-cards.js`):**
```javascript
// Add to card template
<div class="metric-value" id="your-metric-${gpuId}">
${gpuInfo.your_metric}
</div>

// Add to update function
if (yourMetricEl) yourMetricEl.textContent = gpuInfo.your_metric;
```

**Chart (optional):**
```javascript
// static/js/charts.js
chartConfigs.yourMetric = { type: 'line', ... };
```

---

## Project Structure
Expand Down Expand Up @@ -196,53 +125,29 @@ gpu-hot/

---

## Performance

Frontend uses `requestAnimationFrame` batching to minimize reflows. Scroll detection pauses DOM updates during scrolling.

For heavy workloads or many GPUs, increase update intervals in `core/config.py`.

---

## Troubleshooting

**GPU not detected:**
**No GPUs detected:**
```bash
# Verify drivers
nvidia-smi

# Test Docker GPU access
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

# Restart Docker
sudo systemctl restart docker
nvidia-smi # Verify drivers work
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi # Test Docker GPU access
```

**Performance issues:**
- Increase `UPDATE_INTERVAL` in `core/config.py`
- Reduce chart history in `static/js/charts.js`
- Check browser console for errors

**Debug mode:**
```python
# core/config.py
DEBUG = True
**Hub can't connect to nodes:**
```bash
curl http://node-ip:1312/api/gpu-data # Test connectivity
sudo ufw allow 1312/tcp # Check firewall
```

**Performance issues:** Increase `UPDATE_INTERVAL` in `core/config.py`

---

## Contributing

PRs welcome. For major changes, open an issue first.
PRs welcome. Open an issue for major changes.

## License

MIT - see [LICENSE](LICENSE)

---

<div align="center">

[Report Bug](https://github.com/psalias2006/gpu-hot/issues) • [Request Feature](https://github.com/psalias2006/gpu-hot/issues)

</div>

Loading