Crawling

Introduction

Crawling, often called spidering, is the automated process of systematically browsing the World Wide Web.

robots.txt

Technically, robots.txt is a simple text file placed in the root directory of a website (e.g., www.example.com/robots.txt). It adheres to the Robots Exclusion Standard, guidelines for how web crawlers should behave when visiting a website. This file contains instructions in the form of "directives" that tell bots which parts of the website they can and cannot crawl.

A directive might look like this:

User-agent: *
Disallow: /private/

This directive tells all user-agents (* is a wildcard) specifying the crawlers or bots which are not allowed to access any URLs that start with /private/.

Some common directives include:

Disallow : Specifies paths or patterns that the bot should not crawl. Eg. Disallow: /admin/
Allow : Explicitly permits the bot to crawl specific paths or patterns, even if they fall under a broader Disallow rule. Eg. Allow: /public/
Crawl-delay : Sets a delay (in seconds) between successive requests from the bot to avoid overloading the server. Eg. Crawl-delay: 10
Sitemap : Provides the URL to an XML sitemap for more efficient crawling. Eg. Sitemap: https://www.example.com/sitemap.xml

Web Recon for robots.txt

robots.txt can be useful for:

Uncovering hidden directories
Mapping website structure

Well-Known URIs

The .well-known standard, defined in RFC 8615, serves as a standardized directory within a website's root domain. This designated location, typically accessible via the /.well-known/ path on a web server, centralizes a website's critical metadata, including configuration files and information related to its services, protocols, and security mechanisms.

For instance, to access a website's security policy, a client would request https://example.com/.well-known/security.txt.

Some notable examples are:

security.txt : Contains contact information for security researchers to report vulnerabilities
/.well-known/change-password : Provides a standard URL for directing users to a password change page
openid-configuration : Defines configuration details for OpenID Connect, an identity layer on top of the OAuth 2.0 protocol
assetlinks.json : Used for verifying ownership of digital assets (e.g., apps) associated with a domain
mta-sts.txt : Specifies the policy for SMTP MTA Strict Transport Security (MTA-STS) to enhance email security

The openid-configuration URI is part of the OpenID Connect Discovery protocol, an identity layer built on top of the OAuth 2.0 protocol. This endpoint returns a JSON document containing metadata about the provider's endpoints, supported authentication methods, token issuance, and more:

{
  "issuer": "https://example.com",
  "authorization_endpoint": "https://example.com/oauth2/authorize",
  "token_endpoint": "https://example.com/oauth2/token",
  "userinfo_endpoint": "https://example.com/oauth2/userinfo",
  "jwks_uri": "https://example.com/oauth2/jwks",
  "response_types_supported": ["code", "token", "id_token"],
  "subject_types_supported": ["public"],
  "id_token_signing_alg_values_supported": ["RS256"],
  "scopes_supported": ["openid", "profile", "email"]
}

From this we can get information like:

Endpoint discovery
JWKS URI (JWKS - JSON Web Key Set)
Algorithm details

Popular Web Crawlers

Burp Suite Spider
OWASP ZAP
Scrapy (Python Framework)
Apache Nutch (Scalable Crawler)
ReconSpider
Katana

PreviousFingerprinting NextSearch Engine Discovery

Last updated 1 year ago

hashtagIntroduction

hashtagrobots.txt

hashtagWeb Recon for robots.txt

hashtagWell-Known URIs

hashtagPopular Web Crawlers

Introduction

robots.txt

Web Recon for robots.txt

Well-Known URIs

Popular Web Crawlers