Web scraping

Web scraping is the acquisition of data from web servers, typically by automated programs that retrieve and parse HTML pages or call public APIs. The data could be text (news articles, product reviews, government filings), images (Wikipedia photos, satellite imagery), video (YouTube), audio (podcasts), financial records, or weather observations. The training corpus for a large language model is almost entirely scraped from the public web.

The point of scraping, as a Data collection strategy, is that data already in the world is often a faster and cheaper source than data we go out and collect ourselves. If somebody else has already gathered the measurements we need, retrieving them beats building a new collection apparatus.

In Python, the standard tools are requests (for HTTP), BeautifulSoup and lxml (for HTML parsing), Scrapy (a complete crawling framework), and Selenium / Playwright (for JavaScript-heavy sites that need a real browser). API-based scraping uses each service’s own client library.

Scraping has ethical and legal complications that pure sensor data doesn’t. Sites publish a robots.txt that tells well-behaved crawlers which paths are off-limits; ignoring it is technically legal in most places but socially read as bad faith. Many sites’ terms of service explicitly prohibit scraping. The U.S. picture has been shaped by hiQ Labs v. LinkedIn — the Ninth Circuit ruled in 2019, and reaffirmed in 2022, that scraping publicly available data does not violate the Computer Fraud and Abuse Act, narrowing CFAA’s reach. That covers the access question, not the contract or copyright questions: terms-of-service breach and copyright infringement are still live theories of liability. If the data involves people, privacy frameworks (GDPR, HIPAA, PIPEDA) still apply regardless of how the data was acquired. A scraping project that ignores any of these is fast and cheap up front and expensive at the end.

Idriss Rami — Notes

Explorer

Web scraping

Graph View

Backlinks