Crawling
Introduction
Crawling, often called spidering, is the automated process of systematically browsing the World Wide Web.
robots.txt
Technically, robots.txt is a simple text file placed in the root directory of a website (e.g., www.example.com/robots.txt). It adheres to the Robots Exclusion Standard, guidelines for how web crawlers should behave when visiting a website. This file contains instructions in the form of "directives" that tell bots which parts of the website they can and cannot crawl.
A directive might look like this:
User-agent: *
Disallow: /private/This directive tells all user-agents (* is a wildcard) specifying the crawlers or bots which are not allowed to access any URLs that start with /private/.
Some common directives include:
Disallow : Specifies paths or patterns that the bot should not crawl. Eg.
Disallow: /admin/Allow : Explicitly permits the bot to crawl specific paths or patterns, even if they fall under a broader Disallow rule. Eg.
Allow: /public/Crawl-delay : Sets a delay (in seconds) between successive requests from the bot to avoid overloading the server. Eg.
Crawl-delay: 10Sitemap : Provides the URL to an XML sitemap for more efficient crawling. Eg.
Sitemap: https://www.example.com/sitemap.xml
Web Recon for robots.txt
robots.txt can be useful for:
Uncovering hidden directories
Mapping website structure
Well-Known URIs
The .well-known standard, defined in RFC 8615, serves as a standardized directory within a website's root domain. This designated location, typically accessible via the /.well-known/ path on a web server, centralizes a website's critical metadata, including configuration files and information related to its services, protocols, and security mechanisms.
For instance, to access a website's security policy, a client would request https://example.com/.well-known/security.txt.
Some notable examples are:
security.txt: Contains contact information for security researchers to report vulnerabilities/.well-known/change-password: Provides a standard URL for directing users to a password change pageopenid-configuration: Defines configuration details for OpenID Connect, an identity layer on top of the OAuth 2.0 protocolassetlinks.json: Used for verifying ownership of digital assets (e.g., apps) associated with a domainmta-sts.txt: Specifies the policy for SMTP MTA Strict Transport Security (MTA-STS) to enhance email security
The openid-configuration URI is part of the OpenID Connect Discovery protocol, an identity layer built on top of the OAuth 2.0 protocol. This endpoint returns a JSON document containing metadata about the provider's endpoints, supported authentication methods, token issuance, and more:
{
"issuer": "https://example.com",
"authorization_endpoint": "https://example.com/oauth2/authorize",
"token_endpoint": "https://example.com/oauth2/token",
"userinfo_endpoint": "https://example.com/oauth2/userinfo",
"jwks_uri": "https://example.com/oauth2/jwks",
"response_types_supported": ["code", "token", "id_token"],
"subject_types_supported": ["public"],
"id_token_signing_alg_values_supported": ["RS256"],
"scopes_supported": ["openid", "profile", "email"]
}From this we can get information like:
Endpoint discovery
JWKS URI (JWKS - JSON Web Key Set)
Algorithm details
Popular Web Crawlers
Burp Suite Spider
OWASP ZAP
Scrapy (Python Framework)
Apache Nutch (Scalable Crawler)
Last updated