
Reorganize documentation into core/advanced/extraction sections for better navigation. Update terminal theme styles and add rich library for better CLI output. Remove redundant tutorial files and consolidate content into core sections. Add personal story to index page for project context. BREAKING CHANGE: Documentation structure has been significantly reorganized
180 lines
8.0 KiB
Markdown
180 lines
8.0 KiB
Markdown
# Preserve Your Identity with Crawl4AI
|
||
|
||
Crawl4AI empowers you to navigate and interact with the web using your **authentic digital identity**, ensuring you’re recognized as a human and not mistaken for a bot. This tutorial covers:
|
||
|
||
1. **Managed Browsers** – The recommended approach for persistent profiles and identity-based crawling.
|
||
2. **Magic Mode** – A simplified fallback solution for quick automation without persistent identity.
|
||
|
||
---
|
||
|
||
## 1. Managed Browsers: Your Digital Identity Solution
|
||
|
||
**Managed Browsers** let developers create and use **persistent browser profiles**. These profiles store local storage, cookies, and other session data, letting you browse as your **real self**—complete with logins, preferences, and cookies.
|
||
|
||
### Key Benefits
|
||
|
||
- **Authentic Browsing Experience**: Retain session data and browser fingerprints as though you’re a normal user.
|
||
- **Effortless Configuration**: Once you log in or solve CAPTCHAs in your chosen data directory, you can re-run crawls without repeating those steps.
|
||
- **Empowered Data Access**: If you can see the data in your own browser, you can automate its retrieval with your genuine identity.
|
||
|
||
---
|
||
|
||
Below is a **partial update** to your **Managed Browsers** tutorial, specifically the section about **creating a user-data directory** using **Playwright’s Chromium** binary rather than a system-wide Chrome/Edge. We’ll show how to **locate** that binary and launch it with a `--user-data-dir` argument to set up your profile. You can then point `BrowserConfig.user_data_dir` to that folder for subsequent crawls.
|
||
|
||
---
|
||
|
||
### Creating a User Data Directory (Command-Line Approach via Playwright)
|
||
|
||
If you installed Crawl4AI (which installs Playwright under the hood), you already have a Playwright-managed Chromium on your system. Follow these steps to launch that **Chromium** from your command line, specifying a **custom** data directory:
|
||
|
||
1. **Find** the Playwright Chromium binary:
|
||
- On most systems, installed browsers go under a `~/.cache/ms-playwright/` folder or similar path.
|
||
- To see an overview of installed browsers, run:
|
||
```bash
|
||
python -m playwright install --dry-run
|
||
```
|
||
or
|
||
```bash
|
||
playwright install --dry-run
|
||
```
|
||
(depending on your environment). This shows where Playwright keeps Chromium.
|
||
|
||
- For instance, you might see a path like:
|
||
```
|
||
~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome
|
||
```
|
||
on Linux, or a corresponding folder on macOS/Windows.
|
||
|
||
2. **Launch** the Playwright Chromium binary with a **custom** user-data directory:
|
||
```bash
|
||
# Linux example
|
||
~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome \
|
||
--user-data-dir=/home/<you>/my_chrome_profile
|
||
```
|
||
```bash
|
||
# macOS example (Playwright’s internal binary)
|
||
~/Library/Caches/ms-playwright/chromium-1234/chrome-mac/Chromium.app/Contents/MacOS/Chromium \
|
||
--user-data-dir=/Users/<you>/my_chrome_profile
|
||
```
|
||
```powershell
|
||
# Windows example (PowerShell/cmd)
|
||
"C:\Users\<you>\AppData\Local\ms-playwright\chromium-1234\chrome-win\chrome.exe" ^
|
||
--user-data-dir="C:\Users\<you>\my_chrome_profile"
|
||
```
|
||
|
||
**Replace** the path with the actual subfolder indicated in your `ms-playwright` cache structure.
|
||
- This **opens** a fresh Chromium with your new or existing data folder.
|
||
- **Log into** any sites or configure your browser the way you want.
|
||
- **Close** when done—your profile data is saved in that folder.
|
||
|
||
3. **Use** that folder in **`BrowserConfig.user_data_dir`**:
|
||
```python
|
||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||
|
||
browser_config = BrowserConfig(
|
||
headless=True,
|
||
use_managed_browser=True,
|
||
user_data_dir="/home/<you>/my_chrome_profile",
|
||
browser_type="chromium"
|
||
)
|
||
```
|
||
- Next time you run your code, it reuses that folder—**preserving** your session data, cookies, local storage, etc.
|
||
|
||
---
|
||
|
||
## 3. Using Managed Browsers in Crawl4AI
|
||
|
||
Once you have a data directory with your session data, pass it to **`BrowserConfig`**:
|
||
|
||
```python
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||
|
||
async def main():
|
||
# 1) Reference your persistent data directory
|
||
browser_config = BrowserConfig(
|
||
headless=True, # 'True' for automated runs
|
||
verbose=True,
|
||
use_managed_browser=True, # Enables persistent browser strategy
|
||
browser_type="chromium",
|
||
user_data_dir="/path/to/my-chrome-profile"
|
||
)
|
||
|
||
# 2) Standard crawl config
|
||
crawl_config = CrawlerRunConfig(
|
||
wait_for="css:.logged-in-content"
|
||
)
|
||
|
||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||
result = await crawler.arun(url="https://example.com/private", config=crawl_config)
|
||
if result.success:
|
||
print("Successfully accessed private data with your identity!")
|
||
else:
|
||
print("Error:", result.error_message)
|
||
|
||
if __name__ == "__main__":
|
||
asyncio.run(main())
|
||
```
|
||
|
||
### Workflow
|
||
|
||
1. **Login** externally (via CLI or your normal Chrome with `--user-data-dir=...`).
|
||
2. **Close** that browser.
|
||
3. **Use** the same folder in `user_data_dir=` in Crawl4AI.
|
||
4. **Crawl** – The site sees your identity as if you’re the same user who just logged in.
|
||
|
||
---
|
||
|
||
## 4. Magic Mode: Simplified Automation
|
||
|
||
If you **don’t** need a persistent profile or identity-based approach, **Magic Mode** offers a quick way to simulate human-like browsing without storing long-term data.
|
||
|
||
```python
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun(
|
||
url="https://example.com",
|
||
config=CrawlerRunConfig(
|
||
magic=True, # Simplifies a lot of interaction
|
||
remove_overlay_elements=True,
|
||
page_timeout=60000
|
||
)
|
||
)
|
||
```
|
||
|
||
**Magic Mode**:
|
||
|
||
- Simulates a user-like experience
|
||
- Randomizes user agent & navigator
|
||
- Randomizes interactions & timings
|
||
- Masks automation signals
|
||
- Attempts pop-up handling
|
||
|
||
**But** it’s no substitute for **true** user-based sessions if you want a fully legitimate identity-based solution.
|
||
|
||
---
|
||
|
||
## 5. Comparing Managed Browsers vs. Magic Mode
|
||
|
||
| Feature | **Managed Browsers** | **Magic Mode** |
|
||
|----------------------------|---------------------------------------------------------------|-----------------------------------------------------|
|
||
| **Session Persistence** | Full localStorage/cookies retained in user_data_dir | No persistent data (fresh each run) |
|
||
| **Genuine Identity** | Real user profile with full rights & preferences | Emulated user-like patterns, but no actual identity |
|
||
| **Complex Sites** | Best for login-gated sites or heavy config | Simple tasks, minimal login or config needed |
|
||
| **Setup** | External creation of user_data_dir, then use in Crawl4AI | Single-line approach (`magic=True`) |
|
||
| **Reliability** | Extremely consistent (same data across runs) | Good for smaller tasks, can be less stable |
|
||
|
||
---
|
||
|
||
## 6. Summary
|
||
|
||
- **Create** your user-data directory by launching Chrome/Chromium externally with `--user-data-dir=/some/path`.
|
||
- **Log in** or configure sites as needed, then close the browser.
|
||
- **Reference** that folder in `BrowserConfig(user_data_dir="...")` + `use_managed_browser=True`.
|
||
- Enjoy **persistent** sessions that reflect your real identity.
|
||
- If you only need quick, ephemeral automation, **Magic Mode** might suffice.
|
||
|
||
**Recommended**: Always prefer a **Managed Browser** for robust, identity-based crawling and simpler interactions with complex sites. Use **Magic Mode** for quick tasks or prototypes where persistent data is unnecessary.
|
||
|
||
With these approaches, you preserve your **authentic** browsing environment, ensuring the site sees you exactly as a normal user—no repeated logins or wasted time. |