πŸ§ͺ Experiments

ai+me experiments are comprehensive AI security testing tools that simulate real-world adversarial attacks to evaluate your AI system's robustness, safety, and compliance. Think of them as automated penetration testing for AI applicationsβ€”they systematically test your AI's boundaries and identify potential vulnerabilities before attackers can exploit them.

🎯 What are ai+me Experiments?

ai+me experiments function similarly to penetration testing in cybersecurityβ€”but instead of testing software vulnerabilities, we test how well a GenAI assistant aligns with its expected behavior and business scope. Each experiment simulates adversarial interactions to evaluate how the AI assistant handles unexpected or potentially unsafe inputs.

πŸ” Key Concepts

Adversarial Testing

  • Purpose: Identify vulnerabilities in AI systems through systematic testing
  • Method: Generate and execute adversarial prompts to test AI responses
  • Goal: Find weaknesses before attackers do

Behavioral QA Testing

  • Purpose: Understand how users interact with your AI system
  • Method: Test AI responses against expected behaviors and use cases
  • Goal: Ensure AI performs as intended in real-world scenarios

LLM-as-a-Judge

  • Purpose: Use AI to evaluate AI responses for safety and accuracy
  • Method: Automated evaluation of AI responses against predefined criteria
  • Goal: Consistent, scalable assessment of AI behavior

πŸ—οΈ How ai+me Experiments Work

The ai+me testing pipeline follows these structured steps:

πŸ“Š Experiment Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Test Data     β”‚    β”‚   AI System     β”‚    β”‚   Evaluation    β”‚
β”‚   Generation     │───▢│   Under Test    │───▢│   Engine        β”‚
β”‚                 β”‚    β”‚                 β”‚    β”‚                 β”‚
β”‚ β€’ Adversarial   β”‚    β”‚ β€’ Your AI       β”‚    β”‚ β€’ LLM-as-Judge  β”‚
β”‚   Prompts       β”‚    β”‚   Application   β”‚    β”‚ β€’ Safety        β”‚
β”‚ β€’ Edge Cases    β”‚    β”‚ β€’ API Endpoint  β”‚    β”‚   Criteria      β”‚
β”‚ β€’ Real-world    β”‚    β”‚ β€’ Integration   β”‚    β”‚   Rules         β”‚
β”‚   Scenarios     β”‚    β”‚   Points        β”‚    β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚   Results &     β”‚
                       β”‚   Analytics     β”‚
                       β”‚                 β”‚
                       β”‚ β€’ Pass/Fail     β”‚
                       β”‚   Reports       β”‚
                       β”‚ β€’ Vulnerability β”‚
                       β”‚   Analysis      β”‚
                       β”‚ β€’ Recommendationsβ”‚
                       β”‚ β€’ Performance   β”‚
                       β”‚   Metrics       β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Creating Your First Experiment

Step 1: Access Experiment Creation

  1. Navigate to your Project Dashboard
  2. Go to the Experiments tab
  3. Click "Create Experiment" button
  4. Choose your experiment type:
    • Adversarial Testing: Security-focused testing
    • Behavioral QA: User interaction testing
    • Custom Presets: Pre-configured test scenarios

Step 2: Basic Experiment Configuration

Experiment Information

  • Name: Choose a descriptive name (e.g., "Security Test - Customer Support")
  • Description: Explain what you're testing and why
  • Language: Select the primary language for testing
  • Model Provider: Choose your configured AI model provider

Experiment Types

Adversarial Testing
  • OWASP LLM Top 10: Tests against the OWASP LLM Top 10 vulnerabilities
  • OWASP Agentic: Tests against the OWASP Agentic threats
  • Adaptive Testing: Multi-turn conversation testing for complex attacks
Behavioral QA Testing
  • User Interaction: Tests how users interact with your AI
  • Functional Testing: Validates AI responses against expected behaviors
  • Edge Case Testing: Tests boundary conditions and unusual inputs

Step 3: Advanced Configuration

Conversation Turn Types

Single-Turn Testing
  • Purpose: Test individual prompts and responses
  • Use Case: Simple Q&A scenarios, basic functionality testing
  • Advantage: Fast execution, clear pass/fail results
  • Best For: Initial testing, quick validation
Multi-Turn Testing
  • Purpose: Simulate full conversations with back-and-forth interaction
  • Use Case: Complex scenarios, conversational AI testing
  • Advantage: More realistic testing, catches context-based vulnerabilities
  • Best For: Advanced testing, conversational AI systems

Testing Levels

Quick (~500 tests)
  • Duration: 5-15 minutes
  • Coverage: Basic security validation
  • Use Case: Initial testing, rapid feedback
  • Best For: Development phase, quick validation
Thorough (~1200 tests)
  • Duration: 15-30 minutes
  • Coverage: Balanced security and performance testing
  • Use Case: Pre-production testing, comprehensive validation
  • Best For: Most production scenarios
Comprehensive (~2000+ tests)
  • Duration: 30-60 minutes
  • Coverage: Deep security analysis, edge case testing
  • Use Case: Critical systems, compliance requirements
  • Best For: High-security applications, regulatory compliance

Step 4: Integration Configuration

API Endpoint Setup

Configure how ai+me connects to your AI system:

Thread Initialization (Multi-turn only)
  • Endpoint URL: API endpoint for starting conversations
  • Headers: Authentication and configuration headers (JSON format)
  • Payload: Request body for conversation initialization (JSON format)
Chat Completion
  • Endpoint URL: API endpoint for sending messages
  • Headers: Authentication and configuration headers (JSON format)
  • Payload: Request body format for message sending (JSON format)
  • Streaming: Enable/disable real-time response streaming

Authentication Configuration

  • API Keys: Secure storage of authentication credentials
  • Headers: Custom headers for authentication
  • Payload Authentication: Token-based or session-based auth

Step 5: Launch Your Experiment

Click "Create Experiment" to start testing. Your experiment will:

  1. Initialize: Set up the testing environment
  2. Generate Tests: Create contextual test scenarios
  3. Execute Tests: Run prompts against your AI
  4. Analyze Results: Evaluate responses for issues
  5. Generate Report: Compile findings and insights

⏱️ Expected Duration: 5-60 minutes depending on testing level and model response time.

πŸ“Š View Experiment Results

Once your experiment completes, explore the comprehensive results to understand your AI's security posture and identify potential vulnerabilities.

4.1 Access Your Results

  1. Go to your project's Experiments page
  2. Find your completed experiment (status will show "Finished")
  3. Click on the experiment name to view detailed results

4.2 Overview Dashboard

The Overview tab provides a comprehensive summary with key insights:

πŸ“ˆ Performance Metrics Dashboard

Core Metrics:

  • Total Performance Index (TPI): Comprehensive performance score (0-100)
  • Reliability Score: Statistical confidence in test results (90%+ is high confidence)
  • Fail Impact: Assessment of the severity and potential impact of failed tests
  • Pass Rate: Percentage of tests your AI handled correctly (with risk level indicators)
  • Error Rate: Percentage of tests that resulted in technical errors

Metrics are color-coded by risk level:

  • 🟒 Green: Excellent performance (Pass Rate β‰₯95%, Error Rate ≀5%)
  • πŸ”΅ Blue: Good performance (Pass Rate 85-94%, Error Rate 6-15%)
  • 🟠 Orange: Fair performance (Pass Rate 70-84%, Error Rate 16-30%)
  • πŸ”΄ Red: Poor performance (Pass Rate < 70%, Error Rate >30%)

πŸ“Š Test Results by Category

View detailed breakdown by security category:

  • Risk Category (Threat): Specific vulnerability type tested
  • Risk Level: Risk assessment (High, Medium, Low)
  • Failed Tests: Number of tests that failed in each risk category
  • Security Framework Mapping: Mappings to security frameworks like the OWASP LLM Top 10

πŸ’‘ AI-Powered Insights

  • Security Insights: AI-generated analysis of vulnerabilities found
  • Severity Assessment: Risk levels with detailed explanations
  • Pattern Recognition: Common attack vectors that succeeded

4.3 Detailed Logs Analysis

The Logs tab provides granular test-by-test examination:

πŸ” Advanced Filtering System

Filter by Result:

  • Pass: Tests your AI handled correctly
  • Fail: Tests where vulnerabilities were detected
  • Error: Tests with technical issues

Filter by Categories:

  • Risk Categories: Prompt injection, data leakage, scope violations
  • Data Strategy Categories: Test creation strategies and approaches

Additional Filters:

  • Representatives Only: Show only representative test cases
  • Search Functionality: Find specific prompts or responses

πŸ“‹ Individual Test Analysis

Click on any test row to see comprehensive details in the resizable detail pane:

πŸ“ Basic Information:

  • Test ID: Unique identifier for tracking
  • Created/Updated Timestamps: When the test was executed
  • Result Badge: Pass/Fail status with color coding

πŸ”¬ Detailed Evaluation:

  • Result: Pass, Fail, or Error with severity indicators
  • Data Strategy: How the test was generated (for custom QA experiments)
  • Severity Level: High, Medium, Low risk assessment (for failed tests)
  • Risk Category: Specific vulnerability type identified
  • AI Explanation: Detailed reasoning why the test passed or failed
  • Conversation Flow:: Full conversation between test prompt and AI response

4.4 Understanding Your Results

βœ… Passed Tests (Green)

  • Meaning: Your AI handled the scenario correctly and securely
  • Security Status: No vulnerabilities detected for this test case
  • Action: Document as acceptable behavior pattern
  • Confidence: High reliability when pass rate is β‰₯95%

❌ Failed Tests (Red)

  • Meaning: Potential security vulnerability or inappropriate response detected
  • Risk Levels:
    • High Severity: Critical security issues requiring immediate attention
    • Medium Severity: Important issues that should be addressed
    • Low Severity: Minor concerns for future consideration
  • Action Required: Review prompt, response, and AI explanation
  • Next Steps: Implement fixes based on specific recommendations

⚠️ Error Tests (Gray)

  • Meaning: Technical issues during test execution
  • Common Causes:
    • API connection timeouts
    • Invalid responses from your AI system
    • Configuration or authentication problems
  • Action: Check integration settings and API connectivity
  • Impact: High error rates (>30%) indicate system issues