Skip to main content

Agent-Based ML Platform

The Agent-Based ML Platform is an AI-driven backend that automates end-to-end machine learning workflows. It orchestrates dataset ingestion, feature engineering, model training, explainability, and pipeline creation — using LLM reasoning to decide and execute steps autonomously.

LLM Providers

The service is configured with API keys for two LLM providers:

ProviderUse
MistralPrimary reasoning — workflow planning, feature selection, code generation
Anthropic (Claude)Secondary / fallback or specialized tasks

Data Integration

IntegrationRole
TrinoQueries data for ingestion and feature engineering — connects to Trino Gateway in-cluster (trino-gateway.trino.svc.cluster.local:8080)
MLflowLogs experiments — run metrics, parameters, and models under a configured experiment name
SparkExecutes large-scale feature engineering and training jobs — connects via the Spark master URL
S3Reads and writes datasets and artifacts via IRSA (full S3 access)

Data Tier Thresholds

The platform classifies datasets by size and routes them to the appropriate compute path:

TierThresholdCompute
SmallBelow small_data_threshold_mbIn-process or lightweight execution
LargeUp to massive_data_threshold_mbSpark distributed processing
Small row capmax_rows_small_tierMaximum row count for small-tier sampling

Go Deeper

  • MLflow — experiment tracking used by the platform
  • Spark Team — provides the Spark compute used for large-scale jobs
  • Trino — data access layer for ingestion and feature queries