crawl4ai/docs/md_v2/advanced/identity-based-crawling.md

# Preserve Your Identity with Crawl4AI

Crawl4AI empowers you to navigate and interact with the web using your **authentic digital identity**, ensuring you’re recognized as a human and not mistaken for a bot. This tutorial covers:

1. **Managed Browsers** – The recommended approach for persistent profiles and identity-based crawling.
2. **Magic Mode** – A simplified fallback solution for quick automation without persistent identity.

---

## 1. Managed Browsers: Your Digital Identity Solution

**Managed Browsers** let developers create and use **persistent browser profiles**. These profiles store local storage, cookies, and other session data, letting you browse as your **real self**—complete with logins, preferences, and cookies.

### Key Benefits

- **Authentic Browsing Experience**: Retain session data and browser fingerprints as though you’re a normal user.
- **Effortless Configuration**: Once you log in or solve CAPTCHAs in your chosen data directory, you can re-run crawls without repeating those steps.
- **Empowered Data Access**: If you can see the data in your own browser, you can automate its retrieval with your genuine identity.

---

Below is a **partial update** to your **Managed Browsers** tutorial, specifically the section about **creating a user-data directory** using **Playwright’s Chromium** binary rather than a system-wide Chrome/Edge. We’ll show how to **locate** that binary and launch it with a `--user-data-dir` argument to set up your profile. You can then point `BrowserConfig.user_data_dir` to that folder for subsequent crawls.

---

### Creating a User Data Directory (Command-Line Approach via Playwright)

If you installed Crawl4AI (which installs Playwright under the hood), you already have a Playwright-managed Chromium on your system. Follow these steps to launch that **Chromium** from your command line, specifying a **custom** data directory:

1. **Find** the Playwright Chromium binary:
   - On most systems, installed browsers go under a `~/.cache/ms-playwright/` folder or similar path.
   - To see an overview of installed browsers, run:
     ```bash
     python -m playwright install --dry-run
     ```
     or
     ```bash
     playwright install --dry-run
     ```
     (depending on your environment). This shows where Playwright keeps Chromium.

   - For instance, you might see a path like:
     ```
     ~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome
     ```
     on Linux, or a corresponding folder on macOS/Windows.

2. **Launch** the Playwright Chromium binary with a **custom** user-data directory:
   ```bash
   # Linux example
   ~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome \
       --user-data-dir=/home/<you>/my_chrome_profile
   ```
   ```bash
   # macOS example (Playwright’s internal binary)
   ~/Library/Caches/ms-playwright/chromium-1234/chrome-mac/Chromium.app/Contents/MacOS/Chromium \
       --user-data-dir=/Users/<you>/my_chrome_profile
   ```
   ```powershell
   # Windows example (PowerShell/cmd)
   "C:\Users\<you>\AppData\Local\ms-playwright\chromium-1234\chrome-win\chrome.exe" ^
       --user-data-dir="C:\Users\<you>\my_chrome_profile"
   ```

   **Replace** the path with the actual subfolder indicated in your `ms-playwright` cache structure.
   - This **opens** a fresh Chromium with your new or existing data folder.
   - **Log into** any sites or configure your browser the way you want.
   - **Close** when done—your profile data is saved in that folder.

3. **Use** that folder in **`BrowserConfig.user_data_dir`**:
   ```python
   from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

   browser_config = BrowserConfig(
       headless=True,
       use_managed_browser=True,
       user_data_dir="/home/<you>/my_chrome_profile",
       browser_type="chromium"
   )
   ```
   - Next time you run your code, it reuses that folder—**preserving** your session data, cookies, local storage, etc.

---

## 3. Using Managed Browsers in Crawl4AI

Once you have a data directory with your session data, pass it to **`BrowserConfig`**:

```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig

async def main():
    # 1) Reference your persistent data directory
    browser_config = BrowserConfig(
        headless=True,             # 'True' for automated runs
        verbose=True,
        use_managed_browser=True,  # Enables persistent browser strategy
        browser_type="chromium",
        user_data_dir="/path/to/my-chrome-profile"
    )

    # 2) Standard crawl config
    crawl_config = CrawlerRunConfig(
        wait_for="css:.logged-in-content"
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(url="https://example.com/private", config=crawl_config)
        if result.success:
            print("Successfully accessed private data with your identity!")
        else:
            print("Error:", result.error_message)

if __name__ == "__main__":
    asyncio.run(main())
```

### Workflow

1. **Login** externally (via CLI or your normal Chrome with `--user-data-dir=...`).
2. **Close** that browser.
3. **Use** the same folder in `user_data_dir=` in Crawl4AI.
4. **Crawl** – The site sees your identity as if you’re the same user who just logged in.

---

## 4. Magic Mode: Simplified Automation

If you **don’t** need a persistent profile or identity-based approach, **Magic Mode** offers a quick way to simulate human-like browsing without storing long-term data.

```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com",
        config=CrawlerRunConfig(
            magic=True,  # Simplifies a lot of interaction
            remove_overlay_elements=True,
            page_timeout=60000
        )
    )
```

**Magic Mode**:

- Simulates a user-like experience
- Randomizes user agent & navigator
- Randomizes interactions & timings
- Masks automation signals
- Attempts pop-up handling

**But** it’s no substitute for **true** user-based sessions if you want a fully legitimate identity-based solution.

---

## 5. Comparing Managed Browsers vs. Magic Mode

| Feature                    | **Managed Browsers**                                           | **Magic Mode**                                     |
|----------------------------|---------------------------------------------------------------|-----------------------------------------------------|
| **Session Persistence**    | Full localStorage/cookies retained in user_data_dir           | No persistent data (fresh each run)                |
| **Genuine Identity**       | Real user profile with full rights & preferences              | Emulated user-like patterns, but no actual identity |
| **Complex Sites**          | Best for login-gated sites or heavy config                    | Simple tasks, minimal login or config needed        |
| **Setup**                  | External creation of user_data_dir, then use in Crawl4AI       | Single-line approach (`magic=True`)                 |
| **Reliability**            | Extremely consistent (same data across runs)                  | Good for smaller tasks, can be less stable          |

---

## 6. Summary

- **Create** your user-data directory by launching Chrome/Chromium externally with `--user-data-dir=/some/path`.
- **Log in** or configure sites as needed, then close the browser.
- **Reference** that folder in `BrowserConfig(user_data_dir="...")` + `use_managed_browser=True`.
- Enjoy **persistent** sessions that reflect your real identity.
- If you only need quick, ephemeral automation, **Magic Mode** might suffice.

**Recommended**: Always prefer a **Managed Browser** for robust, identity-based crawling and simpler interactions with complex sites. Use **Magic Mode** for quick tasks or prototypes where persistent data is unnecessary.

With these approaches, you preserve your **authentic** browsing environment, ensuring the site sees you exactly as a normal user—no repeated logins or wasted time.