# Preserve Your Identity with Crawl4AI Crawl4AI empowers you to navigate and interact with the web using your **authentic digital identity**, ensuring you’re recognized as a human and not mistaken for a bot. This tutorial covers: 1. **Managed Browsers** – The recommended approach for persistent profiles and identity-based crawling. 2. **Magic Mode** – A simplified fallback solution for quick automation without persistent identity. --- ## 1. Managed Browsers: Your Digital Identity Solution **Managed Browsers** let developers create and use **persistent browser profiles**. These profiles store local storage, cookies, and other session data, letting you browse as your **real self**—complete with logins, preferences, and cookies. ### Key Benefits - **Authentic Browsing Experience**: Retain session data and browser fingerprints as though you’re a normal user. - **Effortless Configuration**: Once you log in or solve CAPTCHAs in your chosen data directory, you can re-run crawls without repeating those steps. - **Empowered Data Access**: If you can see the data in your own browser, you can automate its retrieval with your genuine identity. --- Below is a **partial update** to your **Managed Browsers** tutorial, specifically the section about **creating a user-data directory** using **Playwright’s Chromium** binary rather than a system-wide Chrome/Edge. We’ll show how to **locate** that binary and launch it with a `--user-data-dir` argument to set up your profile. You can then point `BrowserConfig.user_data_dir` to that folder for subsequent crawls. --- ### Creating a User Data Directory (Command-Line Approach via Playwright) If you installed Crawl4AI (which installs Playwright under the hood), you already have a Playwright-managed Chromium on your system. Follow these steps to launch that **Chromium** from your command line, specifying a **custom** data directory: 1. **Find** the Playwright Chromium binary: - On most systems, installed browsers go under a `~/.cache/ms-playwright/` folder or similar path. - To see an overview of installed browsers, run: ```bash python -m playwright install --dry-run ``` or ```bash playwright install --dry-run ``` (depending on your environment). This shows where Playwright keeps Chromium. - For instance, you might see a path like: ``` ~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome ``` on Linux, or a corresponding folder on macOS/Windows. 2. **Launch** the Playwright Chromium binary with a **custom** user-data directory: ```bash # Linux example ~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome \ --user-data-dir=/home//my_chrome_profile ``` ```bash # macOS example (Playwright’s internal binary) ~/Library/Caches/ms-playwright/chromium-1234/chrome-mac/Chromium.app/Contents/MacOS/Chromium \ --user-data-dir=/Users//my_chrome_profile ``` ```powershell # Windows example (PowerShell/cmd) "C:\Users\\AppData\Local\ms-playwright\chromium-1234\chrome-win\chrome.exe" ^ --user-data-dir="C:\Users\\my_chrome_profile" ``` **Replace** the path with the actual subfolder indicated in your `ms-playwright` cache structure. - This **opens** a fresh Chromium with your new or existing data folder. - **Log into** any sites or configure your browser the way you want. - **Close** when done—your profile data is saved in that folder. 3. **Use** that folder in **`BrowserConfig.user_data_dir`**: ```python from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig browser_config = BrowserConfig( headless=True, use_managed_browser=True, user_data_dir="/home//my_chrome_profile", browser_type="chromium" ) ``` - Next time you run your code, it reuses that folder—**preserving** your session data, cookies, local storage, etc. --- ## 3. Using Managed Browsers in Crawl4AI Once you have a data directory with your session data, pass it to **`BrowserConfig`**: ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig async def main(): # 1) Reference your persistent data directory browser_config = BrowserConfig( headless=True, # 'True' for automated runs verbose=True, use_managed_browser=True, # Enables persistent browser strategy browser_type="chromium", user_data_dir="/path/to/my-chrome-profile" ) # 2) Standard crawl config crawl_config = CrawlerRunConfig( wait_for="css:.logged-in-content" ) async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun(url="https://example.com/private", config=crawl_config) if result.success: print("Successfully accessed private data with your identity!") else: print("Error:", result.error_message) if __name__ == "__main__": asyncio.run(main()) ``` ### Workflow 1. **Login** externally (via CLI or your normal Chrome with `--user-data-dir=...`). 2. **Close** that browser. 3. **Use** the same folder in `user_data_dir=` in Crawl4AI. 4. **Crawl** – The site sees your identity as if you’re the same user who just logged in. --- ## 4. Magic Mode: Simplified Automation If you **don’t** need a persistent profile or identity-based approach, **Magic Mode** offers a quick way to simulate human-like browsing without storing long-term data. ```python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://example.com", config=CrawlerRunConfig( magic=True, # Simplifies a lot of interaction remove_overlay_elements=True, page_timeout=60000 ) ) ``` **Magic Mode**: - Simulates a user-like experience - Randomizes user agent & navigator - Randomizes interactions & timings - Masks automation signals - Attempts pop-up handling **But** it’s no substitute for **true** user-based sessions if you want a fully legitimate identity-based solution. --- ## 5. Comparing Managed Browsers vs. Magic Mode | Feature | **Managed Browsers** | **Magic Mode** | |----------------------------|---------------------------------------------------------------|-----------------------------------------------------| | **Session Persistence** | Full localStorage/cookies retained in user_data_dir | No persistent data (fresh each run) | | **Genuine Identity** | Real user profile with full rights & preferences | Emulated user-like patterns, but no actual identity | | **Complex Sites** | Best for login-gated sites or heavy config | Simple tasks, minimal login or config needed | | **Setup** | External creation of user_data_dir, then use in Crawl4AI | Single-line approach (`magic=True`) | | **Reliability** | Extremely consistent (same data across runs) | Good for smaller tasks, can be less stable | --- ## 6. Summary - **Create** your user-data directory by launching Chrome/Chromium externally with `--user-data-dir=/some/path`. - **Log in** or configure sites as needed, then close the browser. - **Reference** that folder in `BrowserConfig(user_data_dir="...")` + `use_managed_browser=True`. - Enjoy **persistent** sessions that reflect your real identity. - If you only need quick, ephemeral automation, **Magic Mode** might suffice. **Recommended**: Always prefer a **Managed Browser** for robust, identity-based crawling and simpler interactions with complex sites. Use **Magic Mode** for quick tasks or prototypes where persistent data is unnecessary. With these approaches, you preserve your **authentic** browsing environment, ensuring the site sees you exactly as a normal user—no repeated logins or wasted time.