Crawl4AI in Action: Real-World Use Cases for Smarter Web Scraping

9th Jul 2025 crawl4ai scraping crawling ChipHuyen llama-police amazon factbook

On https://huyenchip.com/llama-police, Chip Huyen maintains an up-to-date analysis of the open-source AI ecosystem, with a special focus on tools and repositories built around foundation models.

I really enjoy exploring these tools and try to get hands-on with them. Every week, I pick one and build a small toy project to better understand how it works. Each project teaches me something new about AI and large language models (LLMs).

Last week's tool was Crawler4AI. I wanted to create a toy project using it, and chose FactBook as the website to crawl.

Let me show you what I did in more detail!

The first thing I did was set up a Python virtual environment and install crawler4ai using:

pip install crawl4ai

Then I ran:

crawl4ai-setup

This command installs or updates the required Playwright browsers that crawler4ai depends on.

A Simple Crawl Example

To test the setup, I ran a basic crawl example:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    # Create an instance of AsyncWebCrawler
    async with AsyncWebCrawler() as crawler:
        # Run the crawler on a URL
        result = await crawler.arun(url="https://crawl4ai.com")

        # Print the extracted content
        print(result.markdown)

# Run the async main function
asyncio.run(main())

Crawling the CIA World Factbook

Now let's move on to something more interesting—crawling the CIA World Factbook. Here's a simple deep crawling example:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy

async def main():
    config = CrawlerRunConfig(
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=2,
            include_external=False
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        verbose=True
    )

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun("https://www.cia.gov/the-world-factbook/", config=config)

        print(f"Crawled {len(results)} pages in total")

        for result in results[:3]:
            print(f"URL: {result.url}")
            print(f"Depth: {result.metadata.get('depth', 0)}")

if __name__ == "__main__":
    asyncio.run(main())

Saving Crawl Results to Disk

To save the results, I wrote a helper function that creates a safe filename from a URL:

import re
from urllib.parse import urlparse

def filename_from_url(url, extension=""):
    parsed = urlparse(url)
    base = parsed.netloc + parsed.path
    base = base.rstrip('/')
    base = re.sub(r'[<>:"/\\|?*]', '_', base)
    if parsed.query or parsed.fragment:
        extra = parsed.query + parsed.fragment
        extra = re.sub(r'[<>:"/\\|?*]', '_', extra)
        base += "_" + extra
    if extension and not base.endswith(extension):
        base += extension if extension.startswith('.') else '.' + extension
    return base

Adding Filters and Fine-Tuning the Crawl

To focus only on relevant pages and exclude unnecessary ones, I used filters in the crawl configuration:

Tag exclusions: 'form', 'header', 'footer', 'nav', 'img'
URL filtering using URLPatternFilter
MIME type filtering via ContentTypeFilter

Final Version of the Script

Here’s the complete version that crawls the World Factbook, filters pages, and saves the results as Markdown files:

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy, FilterChain, URLPatternFilter, ContentTypeFilter
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy

import os
import re
from urllib.parse import urlparse

def filename_from_url(url, extension=""):
    parsed = urlparse(url)
    # Combine netloc and path
    base = parsed.netloc + parsed.path
    # Remove trailing slashes and unsafe characters
    base = base.rstrip('/')
    base = re.sub(r'[<>:"/\\|?*]', '_', base)  # Windows-safe characters
    # Add query or fragment hash if needed
    if parsed.query or parsed.fragment:
        extra = parsed.query + parsed.fragment
        extra = re.sub(r'[<>:"/\\|?*]', '_', extra)
        base += "_" + extra
    # Add optional file extension
    if extension and not base.endswith(extension):
        base += extension if extension.startswith('.') else '.' + extension
    return base

filter_chain = FilterChain(
    [
        URLPatternFilter(patterns=["*the-world-factbook*"]),
        URLPatternFilter(patterns=["*archives*"],reverse=True),
        ContentTypeFilter(allowed_types=["text/html"])
    ]
)

async def main():
    # Configure a 2-level deep crawl
    config = CrawlerRunConfig(
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=200000, 
            include_external=False,
            filter_chain=filter_chain
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        verbose=True,
        # Tag exclusions
        excluded_tags=['form', 'header', 'footer', 'nav', 'img'],
        # Link filtering
        exclude_external_links=True,    
        exclude_social_media_links=True,
        exclude_external_images=True,
        wait_for_images=False,
    )
    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun("https://www.cia.gov/the-world-factbook/", config=config)
        print(f"Crawled {len(results)} pages in total")
        # Access individual results
        for result in results:
            filepath = filename_from_url(result.url,"md")
            print(f"URL: {result.url}")
            print(f"Depth: {result.metadata.get('depth', 0)}")
            with open(filepath, "w", encoding="utf-8") as f:
                f.write(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

To learn more, refer to https://huyenchip.com/2024/03/14/ai-oss.html.

Previous Post Next Post