🧪 Running AIandMe Experiments - Contextual AI Pen-Testing

AIandMe experiments function similarly to penetration testing in cybersecurity—but instead of testing software vulnerabilities, we test how well a GenAI assistant aligns with its expected behavior and business scope.

Each experiment simulates adversarial interactions to evaluate how the AI assistant handles unexpected or potentially unsafe inputs.

How AIandMe Experiments Work

The AIandMe testing pipeline follows these structured steps:

Experiment Workflow:

Experiment Pipeline

Running an Experiment

Once your project is set up, you can start an experiment by following these steps:

Create an Experiment

Go to the Experiments page and click "Create Experiment".
Fill in the experiment details (name, description, etc.).
Select the model provider for LLM-as-a-Judge evaluations.
Configure the GenAI assistant integration for testing.
Click "Create" to launch the experiment.

Experiment Execution: Step-by-Step

Once started, the experiment runs automatically in the background, executing the following steps:

🔹 Step 1: Adversarial Data Generation

If no dataset exists for the current project scope, AIandMe auto-generates adversarial synthetic prompts.
These prompts simulate real-world edge cases and unexpected user interactions.

🔹 Step 2: AI Assistant Testing

Each adversarial prompt is sent to the GenAI assistant.
The response is evaluated against the expected business behavior.
AIandMe’s LLM-as-a-Judge assigns a pass/fail verdict based on predefined guidelines.

🔹 Step 3: Experiment Completion & Insights

AIandMe compiles a final report with:
🔍 Findings from the experiment
📊 Detailed logs for auditing & human review
🛠 Recommendations for improving AI robustness

Experiment Completion Overview:

🔗 Next Steps

⚖️ Understanding LLM-as-a-Judge
🔥 AIandMe Firewall: Safe AI Responses
⚙️ AIandMe Integration

💡 Need help? Check out FAQs or Join the AIandMe Community.

Creating a Project Understanding LLM-as-a-Judge