midscene/apps/site/docs/en/API.mdx

# API Reference


> In the documentation below, you might see function calls prefixed with `agent.`. If you utilize destructuring in Playwright (e.g., `async ({ ai, aiQuery }) => { /* ... */ }`), you can call these functions without the `agent.` prefix. This is merely a syntactical difference.

## Constructors

Each Agent in Midscene has its own constructor.

* In Puppeteer, use [PuppeteerAgent](./integrate-with-puppeteer)
* In Bridge Mode, use [AgentOverChromeBridge](./bridge-mode-by-chrome-extension#constructor)

These Agents share some common constructor parameters:

* `generateReport: boolean`: If true, a report file will be generated. (Default: true)
* `autoPrintReportMsg: boolean`: If true, report messages will be printed. (Default: true)
* `cacheId: string | undefined`: If provided, this cacheId will be used to save or match the cache. (Default: undefined, means cache feature is disabled)
* `actionContext: string`: Some background knowledge that should be sent to the AI model when calling `agent.aiAction()`, like 'close the cookie consent dialog first if it exists' (Default: undefined)

In Puppeteer, there are 3 additional parameters:

* `forceSameTabNavigation: boolean`: If true, page navigation is restricted to the current tab. (Default: true)
* `waitForNetworkIdleTimeout: number`: The timeout for waiting for network idle between each action. (Default: 2000ms, set to 0 to disable the timeout)
* `waitForNavigationTimeout: number`: The timeout for waiting for navigation finished. (Default: 5000ms, set to 0 to disable the timeout)

## Interaction Methods

Below are the main APIs available for the various Agents in Midscene.

:::info Auto Planning v.s. Instant Action

In Midscene, you can choose to use either auto planning or instant action.

* `agent.ai()` is for Auto Planning: Midscene will automatically plan the steps and execute them. It's more smart and looks like more fashionable style for AI agents. But it may be slower and heavily rely on the quality of the AI model.
* `agent.aiTap()`, `agent.aiHover()`, `agent.aiInput()`, `agent.aiKeyboardPress()`, `agent.aiScroll()` are for Instant Action: Midscene will directly perform the specified action, while the AI model is responsible for basic tasks such as locating elements. It's faster and more reliable if you are certain about the action you want to perform.

:::

### `agent.aiAction()` or `.ai()`

This method allows you to perform a series of UI actions described in natural language. Midscene automatically plans the steps and executes them.

* Type

```typescript
function aiAction(prompt: string, options?: {
  cacheable?: boolean;
}): Promise<void>;
function ai(prompt: string): Promise<void>; // shorthand form
```

* Parameters:
  * `prompt: string` - A natural language description of the UI steps.
  * `options?: Object` - Optional, a configuration object containing:
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.

* Return Value:
  * Returns a Promise that resolves to void when all steps are completed; if execution fails, an error is thrown.

* Examples:

```typescript
// Basic usage
await agent.aiAction('Type "JavaScript" into the search box, then click the search button');

// Using the shorthand .ai form
await agent.ai('Click the login button at the top of the page, then enter "test@example.com" in the username field');

// When using UI Agent models like ui-tars, you can try a more high-level prompt
await agent.aiAction('Post a Tweet "Hello World"');
```

:::tip

Under the hood, Midscene uses AI model to split the instruction into a series of steps (a.k.a. "Planning"). It then executes these steps sequentially. If Midscene determines that the actions cannot be performed, an error will be thrown. 

For optimal results, please provide clear and detailed instructions for `agent.aiAction()`. For guides about writing prompts, you may read this doc: [Tips for Writing Prompts](./prompting-tips).

Related Documentation:
* [Choose a model](./choose-a-model)

:::

### `agent.aiTap()`

Tap something.

* Type

```typescript
function aiTap(locate: string, options?: Object): Promise<void>;
```

* Parameters:
  * `locate: string` - A natural language description of the element to tap.
  * `options?: Object` - Optional, a configuration object containing:
    * `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. 
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.

* Return Value:
  * Returns a `Promise<void>`

* Examples:

```typescript
await agent.aiTap('The login button at the top of the page');

// Use deepThink feature to precisely locate the element
await agent.aiTap('The login button at the top of the page', { deepThink: true });
```

### `agent.aiHover()`

Move mouse over something.

* Type

```typescript
function aiHover(locate: string, options?: Object): Promise<void>;
```

* Parameters:
  * `locate: string` - A natural language description of the element to hover over.
  * `options?: Object` - Optional, a configuration object containing:
    * `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. 
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.

* Return Value:
  * Returns a `Promise<void>`

* Examples:

```typescript
await agent.aiHover('The version number of the current page');
```

### `agent.aiInput()`

Input text into something.

* Type

```typescript
function aiInput(text: string, locate: string, options?: Object): Promise<void>;
```

* Parameters:
  * `text: string` - The final text content that should be placed in the input element. Use blank string to clear the input.
  * `locate: string` - A natural language description of the element to input text into.
  * `options?: Object` - Optional, a configuration object containing:
    * `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. 
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.

* Return Value:
  * Returns a `Promise<void>`

* Examples:

```typescript
await agent.aiInput('Hello World', 'The search input box');
```

### `agent.aiKeyboardPress()`

Press a keyboard key.

* Type

```typescript
function aiKeyboardPress(key: string, locate?: string, options?: Object): Promise<void>;
```

* Parameters:
  * `key: string` - The web key to press, e.g. 'Enter', 'Tab', 'Escape', etc. Key Combination is not supported. 
  * `locate?: string` - Optional, a natural language description of the element to press the key on.
  * `options?: Object` - Optional, a configuration object containing:
    * `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. 
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.

* Return Value:
  * Returns a `Promise<void>`

* Examples:

```typescript
await agent.aiKeyboardPress('Enter', 'The search input box');
```

### `agent.aiScroll()`

Scroll a page or an element.

* Type

```typescript
function aiScroll(scrollParam: PlanningActionParamScroll, locate?: string, options?: Object): Promise<void>;
```

* Parameters:
  * `scrollParam: PlanningActionParamScroll` - The scroll parameter
    * `direction: 'up' | 'down' | 'left' | 'right'` - The direction to scroll.
    * `scrollType: 'once' | 'untilBottom' | 'untilTop' | 'untilRight' | 'untilLeft'` - Optional, the type of scroll to perform.
    * `distance: number` - Optional, the distance to scroll in px.
  * `locate?: string` - Optional, a natural language description of the element to scroll on. If not provided, Midscene will perform scroll on the current mouse position.
  * `options?: Object` - Optional, a configuration object containing:
    * `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. 
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.

* Return Value:
  * Returns a `Promise<void>`

* Examples:

```typescript
await agent.aiScroll({ direction: 'up', distance: 100, scrollType: 'once' }, 'The form panel');
```

:::tip About the `deepThink` feature

The `deepThink` feature is a powerful feature that allows Midscene to call AI model twice to precisely locate the element. It is useful when the AI model find it hard to distinguish the element from its surroundings.

:::

## Data Extraction

### `agent.aiQuery()`

This method allows you to extract data directly from the UI using multimodal AI reasoning capabilities. Simply define the expected format (e.g., string, number, JSON, or an array) in the `dataDemand`, and Midscene will return a result that matches the format.

* Type

```typescript
function aiQuery<T>(dataShape: string | Object): Promise<T>;
```

* Parameters:
  * `dataShape: T`: A description of the expected return format.

* Return Value:
  * Returns any valid basic type, such as string, number, JSON, array, etc.
  * Just describe the format in `dataDemand`, and Midscene will return a matching result.

* Examples:

```typescript
const dataA = await agent.aiQuery({
  time: 'The date and time displayed in the top-left corner as a string',
  userInfo: 'User information in the format {name: string}',
  tableFields: 'An array of table field names, string[]',
  tableDataRecord: 'Table records in the format {id: string, [fieldName]: string}[]',
});

// You can also describe the expected return format using a string:

// dataB will be an array of strings
const dataB = await agent.aiQuery('string[], list of task names');

// dataC will be an array of objects
const dataC = await agent.aiQuery('{name: string, age: string}[], table data records');
```

### `agent.aiBoolean()`

Extract a boolean value from the UI.  

* Type

```typescript
function aiBoolean(prompt: string): Promise<boolean>;
```

* Parameters:
  * `prompt: string` - A natural language description of the expected value.

* Return Value:
  * Returns a `Promise<boolean>` when AI returns a boolean value.

* Examples:

```typescript
const bool = await agent.aiBoolean('Whether there is a login dialog');
```

### `agent.aiNumber()`

Extract a number value from the UI.

* Type

```typescript
function aiNumber(prompt: string): Promise<number>;
```

* Parameters:
  * `prompt: string` - A natural language description of the expected value.

* Return Value:
  * Returns a `Promise<number>` when AI returns a number value.

* Examples:

```typescript
const number = await agent.aiNumber('The remaining points of the account');
```

### `agent.aiString()`

Extract a string value from the UI.

* Type

```typescript
function aiString(prompt: string): Promise<string>;
```

* Parameters:
  * `prompt: string` - A natural language description of the expected value.

* Return Value:
  * Returns a `Promise<string>` when AI returns a string value.

* Examples:

```typescript
const string = await agent.aiString('The first item in the list');
```

## More APIs

### `agent.aiAssert()`

Specify an assertion in natural language, and the AI determines whether the condition is true. If the assertion fails, the SDK throws an error that includes both the optional `errorMsg` and a detailed reason generated by the AI.

* Type

```typescript
function aiAssert(assertion: string, errorMsg?: string): Promise<void>;
```

* Parameters:
  * `assertion: string` - The assertion described in natural language.
  * `errorMsg?: string` - An optional error message to append if the assertion fails.

* Return Value:
  * Returns a Promise that resolves to void if the assertion passes; if it fails, an error is thrown with `errorMsg` and additional AI-provided information.

* Example:
```typescript
await agent.aiAssert('The price of "Sauce Labs Onesie" is 7.99');
```

:::tip
Assertions are critical in test scripts. To reduce the risk of errors due to AI hallucination (e.g., missing an error), you can also combine `.aiQuery` with standard JavaScript assertions instead of using `.aiAssert`.

For example, you might replace the above code with:

```typescript
const items = await agent.aiQuery(
  '"{name: string, price: number}[], return product names and prices'
);
const onesieItem = items.find(item => item.name === 'Sauce Labs Onesie');
expect(onesieItem).toBeTruthy();
expect(onesieItem.price).toBe(7.99);
```
:::

### `agent.aiLocate()`

Locate an element using natural language.

* Type

```typescript
function aiLocate(locate: string, options?: Object): Promise<{
  rect: {
    left: number;
    top: number;
    width: number;
    height: number;
  };
  center: [number, number];
}>;
```

* Parameters:
  * `locate: string` - A natural language description of the element to locate.
  * `options?: Object` - Optional, a configuration object containing:
    * `deepThink?: boolean` - If true, Midscene will call AI model twice to precisely locate the element. 
    * `cacheable?: boolean` - Whether cacheable when enabling [caching feature](./caching.mdx). True by default.

* Return Value:
  * Returns a `Promise` when the element is located parsed as an locate info object.

* Examples:

```typescript
const locateInfo = await agent.aiLocate('The login button at the top of the page');
console.log(locateInfo);
```

### `agent.aiWaitFor()`

Wait until a specified condition, described in natural language, becomes true. Considering the cost of AI calls, the check interval will not exceed the specified `checkIntervalMs`.

* Type

```typescript
function aiWaitFor(
  assertion: string, 
  options?: { 
    timeoutMs?: number;
    checkIntervalMs?: number;
  }
): Promise<void>;
```

* Parameters:
  * `assertion: string` - The condition described in natural language.
  * `options?: object` - An optional configuration object containing:
    * `timeoutMs?: number` - Timeout in milliseconds (default: 15000).
    * `checkIntervalMs?: number` - Interval for checking in milliseconds (default: 3000).

* Return Value:
  * Returns a Promise that resolves to void if the condition is met; if not, an error is thrown when the timeout is reached.

* Examples:

```typescript
// Basic usage
await agent.aiWaitFor("There is at least one headphone information displayed on the interface");

// Using custom options
await agent.aiWaitFor("The shopping cart icon shows a quantity of 2", {
  timeoutMs: 30000,    // Wait for 30 seconds
  checkIntervalMs: 5000  // Check every 5 seconds
});
```

:::tip
Given the time consumption of AI services, `.aiWaitFor` might not be the most efficient method. Sometimes, using a simple sleep function may be a better alternative.
:::

### `agent.runYaml()`

Execute an automation script written in YAML. Only the `tasks` part of the script is executed, and it returns the results of all `.aiQuery` calls within the script.

* Type

```typescript
function runYaml(yamlScriptContent: string): Promise<{ result: any }>;
```

* Parameters:
  * `yamlScriptContent: string` - The YAML-formatted script content.

* Return Value:
  * Returns an object with a `result` property that includes the results of all `.aiQuery` calls.

* Example:

```typescript
const { result } = await agent.runYaml(`
tasks:
  - name: search weather
    flow:
      - ai: input 'weather today' in input box, click search button
      - sleep: 3000

  - name: query weather
    flow:
      - aiQuery: "the result shows the weather info, {description: string}"
`);
console.log(result);
```

:::tip
For more information about YAML scripts, please refer to [Automate with Scripts in YAML](./automate-with-scripts-in-yaml).
:::

### `agent.setAIActionContext()`

Set the background knowledge that should be sent to the AI model when calling `agent.aiAction()`.

* Type

```typescript
function setAIActionContext(actionContext: string): void;
```

* Parameters:
  * `actionContext: string` - The background knowledge that should be sent to the AI model.

* Example:

```typescript
await agent.setAIActionContext('Close the cookie consent dialog first if it exists');
```

### `agent.evaluateJavaScript()`

Evaluate a JavaScript expression in the web page context.

* Type

```typescript
function evaluateJavaScript(script: string): Promise<any>;
```

* Parameters:
  * `script: string` - The JavaScript expression to evaluate.

* Return Value:
  * Returns the result of the JavaScript expression.

* Example:  

```typescript
const result = await agent.evaluateJavaScript('document.title');
console.log(result);
```

## Properties

### `.reportFile`

The path to the report file.

## Additional Configurations

### Setting Environment Variables at Runtime

You can override environment variables at runtime by calling the `overrideAIConfig` method.

```typescript
import { overrideAIConfig } from '@midscene/web/puppeteer'; // or another Agent

overrideAIConfig({
  OPENAI_BASE_URL: "...",
  OPENAI_API_KEY: "...",
  MIDSCENE_MODEL_NAME: "..."
});
```

### Print usage information for each AI call

Set the `DEBUG=midscene:ai:profile:stats` to view the execution time and usage for each AI call.

```bash
export DEBUG=midscene:ai:profile:stats
```

### Customize the run artifact directory

Set the `MIDSCENE_RUN_DIR` variable to customize the run artifact directory.

```bash
export MIDSCENE_RUN_DIR=midscene_run # The default value is the midscene_run in the current working directory, you can set it to an absolute path or a relative path
```

### Using LangSmith

LangSmith is a platform for debugging large language models. To integrate LangSmith, follow these steps:

```bash
# Set environment variables

# Enable debug mode
export MIDSCENE_LANGSMITH_DEBUG=1 

# LangSmith configuration
export LANGSMITH_TRACING_V2=true
export LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
export LANGSMITH_API_KEY="your_key_here"
export LANGSMITH_PROJECT="your_project_name_here"
```

After starting Midscene, you should see logs similar to:

```log
DEBUGGING MODE: langsmith wrapper enabled
```