cewl: Wordlist Builder from Web Content


cewl: Wordlist Builder from Web Content

cewl is a lightweight tool for generating wordlists by crawling web content. It’s handy when you’re preparing a password audit or need target-specific brute-force vocabularies. This guide focuses on pragmatic usage for beginners and intermediate users.

Quickstart: practical examples

Start with a simple crawl to a shallow depth and save the words to a file:

cewl --depth 2 --write /path/to/wordlist.txt http://example.com

Tip: Use a small depth first to get a feel for the site structure and word quality. Increase depth only when you need more words and you’re sure you won’t overwhelm the server.

If you want words that include numbers and are at least 5 characters long, combine —with-numbers and —min_word_length:

cewl --with-numbers --min_word_length 5 http://example.com

For debugging or to see where words come from, run in debug mode:

cewl --debug http://example.com

Pro tip: Debug mode can help identify noisy sources or malformed pages that produce junk words.

Advanced usage: auth and proxies

If the target site requires HTTP Basic or Digest authentication, provide credentials:

cewl --auth_type basic --auth_user USER --auth_pass PASS http://protected.example.com

If you’re behind a proxy, pass proxy settings so that cewl can fetch the page:

cewl --proxy_host 127.0.0.1 --proxy_port 8080 http://example.com

Common pitfalls and how to avoid them

  • Overloading a site: Keep depth modest and respect robots.txt and site terms. Use polite delays if you’re scraping a lot.
  • Noise in wordlists: Debug mode helps, but you may still collect boilerplate terms (Home, About, Contact). Consider post-filtering or offline deduplication.
  • Large sites: A deep crawl on a big domain can take a while and consume bandwidth. Prefer targeted subpages or seed URLs.
  • Authentication quirks: Some sites require cookies or multiple auth steps; if basic/digest fails, check headers and consider headless browser scraping (out of scope for cewl).
  • Legal and ethical use: Only crawl sites you own or have permission to test. This tool should be used responsibly.

Practical tips

  • Start small, iterate: test with a single page, then widen the crawl.
  • Combine with other wordlist sources: Merge cewl output with wordlists you already have for better coverage.
  • Normalize words: Consider filtering out common words or converting to lowercase to improve password-cracking efficiency.

Quick reference: option highlights

  • —depth or -d: Set crawl depth (default is shallow). Example: —depth 2
  • —write or -w: Write results to a file. Example: —write /path/words.txt
  • —with-numbers or -m: Include numeric suffixes. Example: —with-numbers 5
  • —min_word_length or -m: Minimum word length. Example: —min_word_length 5
  • —debug or -e: Enable verbose output and debugging info. Example: —debug
  • —auth_type: Authentication method (basic|digest)
  • —auth_user/—auth_pass: Credentials for the site
  • —proxy_host/—proxy_port: Proxy configuration for requests

Putting it together: a small, sensible workflow

  1. Crawl a site with a modest depth and write to a file:
cewl --depth 2 --write my_words.txt https://example.org
  1. If you need numbers in the words:
cewl --with-numbers --min_word_length 5 https://example.org
  1. Access a protected site through a proxy:
cewl --proxy_host 127.0.0.1 --proxy_port 3128 --auth_type basic --auth_user user --auth_pass pass https://private.example.org

Conclusion

cewl is a focused tool for turning web content into practical wordlists. Start with simple crawls, add features as needed, and remember to respect site policies and legal boundaries.

See Also