Crawling

Introduction

Crawling, often called spidering, is the automated process of systematically browsing the World Wide Web.

robots.txt

Technically, robots.txt is a simple text file placed in the root directory of a website (e.g., www.example.com/robots.txt). It adheres to the Robots Exclusion Standard, guidelines for how web crawlers should behave when visiting a website. This file contains instructions in the form of "directives" that tell bots which parts of the website they can and cannot crawl.

A directive might look like this:

User-agent: *
Disallow: /private/

This directive tells all user-agents (* is a wildcard) specifying the crawlers or bots which are not allowed to access any URLs that start with /private/.

Some common directives include:

  • Disallow : Specifies paths or patterns that the bot should not crawl. Eg. Disallow: /admin/

  • Allow : Explicitly permits the bot to crawl specific paths or patterns, even if they fall under a broader Disallow rule. Eg. Allow: /public/

  • Crawl-delay : Sets a delay (in seconds) between successive requests from the bot to avoid overloading the server. Eg. Crawl-delay: 10

  • Sitemap : Provides the URL to an XML sitemap for more efficient crawling. Eg. Sitemap: https://www.example.com/sitemap.xml

Web Recon for robots.txt

robots.txt can be useful for:

  • Uncovering hidden directories

  • Mapping website structure

Well-Known URIs

The .well-known standard, defined in RFC 8615, serves as a standardized directory within a website's root domain. This designated location, typically accessible via the /.well-known/ path on a web server, centralizes a website's critical metadata, including configuration files and information related to its services, protocols, and security mechanisms.

For instance, to access a website's security policy, a client would request https://example.com/.well-known/security.txt.

Some notable examples are:

  • security.txt : Contains contact information for security researchers to report vulnerabilities

  • /.well-known/change-password : Provides a standard URL for directing users to a password change page

  • openid-configuration : Defines configuration details for OpenID Connect, an identity layer on top of the OAuth 2.0 protocol

  • assetlinks.json : Used for verifying ownership of digital assets (e.g., apps) associated with a domain

  • mta-sts.txt : Specifies the policy for SMTP MTA Strict Transport Security (MTA-STS) to enhance email security

The openid-configuration URI is part of the OpenID Connect Discovery protocol, an identity layer built on top of the OAuth 2.0 protocol. This endpoint returns a JSON document containing metadata about the provider's endpoints, supported authentication methods, token issuance, and more:

{
  "issuer": "https://example.com",
  "authorization_endpoint": "https://example.com/oauth2/authorize",
  "token_endpoint": "https://example.com/oauth2/token",
  "userinfo_endpoint": "https://example.com/oauth2/userinfo",
  "jwks_uri": "https://example.com/oauth2/jwks",
  "response_types_supported": ["code", "token", "id_token"],
  "subject_types_supported": ["public"],
  "id_token_signing_alg_values_supported": ["RS256"],
  "scopes_supported": ["openid", "profile", "email"]
}

From this we can get information like:

  • Endpoint discovery

  • JWKS URI (JWKS - JSON Web Key Set)

  • Algorithm details

Last updated