Agent-Based ML Platform

The Agent-Based ML Platform is an AI-driven backend that automates end-to-end machine learning workflows. It orchestrates dataset ingestion, feature engineering, model training, explainability, and pipeline creation — using LLM reasoning to decide and execute steps autonomously.

LLM Providers

The service is configured with API keys for two LLM providers:

Provider	Use
Mistral	Primary reasoning — workflow planning, feature selection, code generation
Anthropic (Claude)	Secondary / fallback or specialized tasks

Data Integration

Integration	Role
Trino	Queries data for ingestion and feature engineering — connects to Trino Gateway in-cluster (`trino-gateway.trino.svc.cluster.local:8080`)
MLflow	Logs experiments — run metrics, parameters, and models under a configured experiment name
Spark	Executes large-scale feature engineering and training jobs — connects via the Spark master URL
S3	Reads and writes datasets and artifacts via IRSA (full S3 access)

Data Tier Thresholds

The platform classifies datasets by size and routes them to the appropriate compute path:

Tier	Threshold	Compute
Small	Below `small_data_threshold_mb`	In-process or lightweight execution
Large	Up to `massive_data_threshold_mb`	Spark distributed processing
Small row cap	`max_rows_small_tier`	Maximum row count for small-tier sampling

Go Deeper

MLflow — experiment tracking used by the platform
Spark Team — provides the Spark compute used for large-scale jobs
Trino — data access layer for ingestion and feature queries

LLM Providers​

Data Integration​

Data Tier Thresholds​

Go Deeper​

LLM Providers

Data Integration

Data Tier Thresholds

Go Deeper