Agent-Based ML Platform
The Agent-Based ML Platform is an AI-driven backend that automates end-to-end machine learning workflows. It orchestrates dataset ingestion, feature engineering, model training, explainability, and pipeline creation — using LLM reasoning to decide and execute steps autonomously.
LLM Providers
The service is configured with API keys for two LLM providers:
| Provider | Use |
|---|---|
| Mistral | Primary reasoning — workflow planning, feature selection, code generation |
| Anthropic (Claude) | Secondary / fallback or specialized tasks |
Data Integration
| Integration | Role |
|---|---|
| Trino | Queries data for ingestion and feature engineering — connects to Trino Gateway in-cluster (trino-gateway.trino.svc.cluster.local:8080) |
| MLflow | Logs experiments — run metrics, parameters, and models under a configured experiment name |
| Spark | Executes large-scale feature engineering and training jobs — connects via the Spark master URL |
| S3 | Reads and writes datasets and artifacts via IRSA (full S3 access) |
Data Tier Thresholds
The platform classifies datasets by size and routes them to the appropriate compute path:
| Tier | Threshold | Compute |
|---|---|---|
| Small | Below small_data_threshold_mb | In-process or lightweight execution |
| Large | Up to massive_data_threshold_mb | Spark distributed processing |
| Small row cap | max_rows_small_tier | Maximum row count for small-tier sampling |
Go Deeper
- MLflow — experiment tracking used by the platform
- Spark Team — provides the Spark compute used for large-scale jobs
- Trino — data access layer for ingestion and feature queries