This post is all about logging in to phpBB through manually constructed requests. Once you're logged in, scraping the forum is up to you. You can use the usual techniques for it, just be sure not to crawl the buttons for logout, report, watch, bookmark, user control panel, moderator control panel, admin control panel, and so on and so forth, all of which are activated with GET requests.
Your logged in session is stored in the cookies. I suggest using the cookie-jar pattern rather than manually constructing the cookies for each request.
The user agent also matters. You cannot change the user agent header at any point or you will be accessing the logged out version of the page.
These curl commands are demonstrations only. They will not work on their own. The purpose of them is to show which URLs, cookies, headers, and body data are necessary. You will want to use your own programming language instead of curl, and you will need to fill in the placeholder data.
Here's the tutorial:
1. GET the login page
(something along these lines) curl -v 'https://example.com/forum/ucp.php?mode=login'
You need to do this to get cookies. Save the returned cookies in the jar.
2. POST the login endpoint
(something along these lines) curl -v 'https://example.com/forum/ucp.php?mode=login' -X POST --data-raw 'username=USERNAME&password=PASSWORD&redirect=.%2Fucp.php%3Fmode%3Dlogin&redirect=index.php&login=Login' -H 'Cookie: THREE_COOKIES' -H 'Content-Type: application/x-www-form-urlencoded'
Save the returned cookies in the jar.
(Note: If your language's cookie jar library is properly programmed, the previously stored cookies with the same name will be overwritten with these new ones. If the cookie jar remembers all versions of each cookie, just make sure the constructed cookie header has the NEW values at the START. For example,
Cookie: foobar_cookies_u=34567; foobar_cookies_u=1; foobar_etc_and_so_on= will work. Hopefully you shouldn't have to worry about this at all!!)
3. GET the main page
curl -v 'https://example.com/forum/' -H 'Cookie: THREE_COOKIES' > am_i_logged_in.html
Open up the page in your web browser and see whether it looks logged in.
You can tell if it didn't work if you aren't logged in on that page. You can't get error messages out of phpBB, so you'll have to keep looking at your process and trying things until it finally works.
Print out the raw HTTP request you sent in step 3. Does the user agent look right? Do the cookies look right? There should be 3 cookies sent here:
foobar_cookies_sid=hexadecimal; foobar_cookies_k=; foobar_cookies_u=34567. If the cookies are wrong here, print out the HTTP request and response you sent in step 2 and examine those.
Another idea for debugging is to try sending this request with curl to the forum's main page, and tweaking it until the message stops saying "Login" (this means you haven't succeeded and aren't logged in) and changes to "Logout" (this means you succeeded and are logged in):
curl https://example.com/forum/ -H 'User-Agent: USER_AGENT' -H 'Cookie: THREE_COOKIES' | grep Log -m 1
- The user agent must match the one that was used for the login process. If the user agent header changes you will receive the logged out version of the page.
- Remember to keep adding the cookies to all subsequent requests.
- If instead of logging in through program code, you decide to export an existing logged in session from your browser via a cookies.txt exporter, the file may have
#HttpOnly_.flags at the start of some lines. You must remove those flags from the file before feeding it into your scraping program. Remember to set your scraping program's user agent to the same user agent as the logged in browser.
Please send any feedback, refinements and corrections to my email with a specific subject line so that I can improve this page!