Any way to download the whole contents of one forum thread in one go?

Post Reply
33772 cr points
Send Message: GB Post
39 / Inside your compu...
Offline
Posted 9/20/19 , edited 9/20/19
Old inactive CR forum threads gets locked and could be deleted later. Even if it isn't deleted, I can't count on any website (Crunchyroll included) being around forever.

Any way to download the whole contents of one forum thread in one go? I know there's the Internet Archive (which MIGHT stay up forever but who knows) but I don't think they go that deep into every site.on the net.

Edit: I see that the Internet Archive DOES go that deep. However, it only archives a few pages off a thread. So if there's maybe 50 pages in a thread, it picks up the first page and then randomly does maybe 4 more pages if you're lucky
34793 cr points
Send Message: GB Post
22 / M / Inside Thanos's anus
Offline
Posted 9/20/19 , edited 9/21/19
unlikely the best way i could think of is a program to basically Screen cap each forum post and page and then organise it into a single file and repeat with every post.

Granted you could save the HTML of the website as it is now but im not 100% sure that works with crunchyroll like it does some sites.

Any way of doing it is going to be a pain in the ass though.
14933 cr points
Send Message: GB Post
51 / M / New England, USA
Offline
Posted 9/21/19 , edited 9/21/19

nanikore2 wrote:

Old inactive CR forum threads gets locked and could be deleted later. Even if it isn't deleted, I can't count on any website (Crunchyroll included) being around forever.

Any way to download the whole contents of one forum thread in one go? I know there's the Internet Archive (which MIGHT stay up forever but who knows) but I don't think they go that deep into every site.on the net.

Edit: I see that the Internet Archive DOES go that deep. However, it only archives a few pages off a thread. So if there's maybe 50 pages in a thread, it picks up the first page and then randomly does maybe 4 more pages if you're lucky


I tend to use my browser's "print to file" feature to make each page into its own PDF then use a merger/combiner to make them into 1 PDF. I just found a forum post that gave some other options I'm now thinking about trying.

https://www.donationcoder.com/forum/index.php?topic=43576.0
2644 cr points
Send Message: GB Post
30 / M / Hampshire
Offline
Posted 9/26/19 , edited 9/26/19
Something tells me you'd have better luck looking for a Python script. Since the the URLs follow a pattern (i.e. ending in pg=0, pg=1...)

You could just find a script that saves one page and loop it. You'll need basic Python knowledge or the will to learn.

I imagine such a script would be fairly straightforward. I'm a bit rusty with Python but might give it a go at some point. But realistically speaking, I'll likely never get to it since it's not necessary to me.

Here's an example search you could perform: https://www.google.com/search?client=ubuntu&channel=fs&q=python+script+to+save+web+page&ie=utf-8&oe=utf-8

The reason why web archives don't usually save everything besides storage capabilities is also because they'd be draining resources from the website and that doesn't please the websites who might end up banning you entirely.
2644 cr points
Send Message: GB Post
30 / M / Hampshire
Offline
Posted 9/26/19 , edited 9/26/19
So I got bored and decided to make you the script. Or rather, I modified one I found but lost the url to the original. You'll need to modify:

- "url" to the one you need. Make sure it's in the same format ending in "pg="

- "download_folder" to your preferred location

- "pages" to the total number of pages in the thread

- "project name" to whatever, avoid special characters

Fair warning, at the end this will open every single page in a new tab. I can't be arsed fixing this as I gotta work soon. At the end you can navigate the pages offline from a single tab.

Another fair warning: in this example each page accounts for 10 MB. You're saving a whole webpage after all. So think well before you go around downloading a 3000 page thread as that would be 30 GB and you'll almost certainly get yourself in trouble. Check the T&C of the website and do not breach it.

Actual code:

#!/usr/bin/python3

from pywebcopy import save_webpage

url = 'https://www.crunchyroll.com/forumtopic-1052581/what-does-friendship-mean-to-you?pg='
download_folder = '/home/p/cr_files/'
pages = 3

kwargs = {'bypass_robots': True, 'project_name': 'friendship'}

for i in range(0,pages):
temp_url = url+str(i)
save_webpage(temp_url, download_folder, **kwargs)
33772 cr points
Send Message: GB Post
39 / Inside your compu...
Offline
Posted 10/8/19 , edited 10/9/19

cosninety wrote:

So I got bored and decided to make you the script. Or rather, I modified one I found but lost the url to the original. You'll need to modify:

- "url" to the one you need. Make sure it's in the same format ending in "pg="

- "download_folder" to your preferred location

- "pages" to the total number of pages in the thread

- "project name" to whatever, avoid special characters

Fair warning, at the end this will open every single page in a new tab. I can't be arsed fixing this as I gotta work soon. At the end you can navigate the pages offline from a single tab.

Another fair warning: in this example each page accounts for 10 MB. You're saving a whole webpage after all. So think well before you go around downloading a 3000 page thread as that would be 30 GB and you'll almost certainly get yourself in trouble. Check the T&C of the website and do not breach it.

Actual code:

#!/usr/bin/python3

from pywebcopy import save_webpage

url = 'https://www.crunchyroll.com/forumtopic-1052581/what-does-friendship-mean-to-you?pg='
download_folder = '/home/p/cr_files/'
pages = 3

kwargs = {'bypass_robots': True, 'project_name': 'friendship'}

for i in range(0,pages):
temp_url = url+str(i)
save_webpage(temp_url, download_folder, **kwargs)


Oh boy. It'd be really dumb if I get banned just for downloading a thread.

Looks like I just have to save one page as a document at a time. If I just save 10 pages a day, I can do a 100-page thread in 10 days...

Ugh I'm getting tired just thinking about it but oh well that's life
33772 cr points
Send Message: GB Post
39 / Inside your compu...
Offline
Posted 10/8/19 , edited 10/9/19

neugenx wrote:


nanikore2 wrote:

Old inactive CR forum threads gets locked and could be deleted later. Even if it isn't deleted, I can't count on any website (Crunchyroll included) being around forever.

Any way to download the whole contents of one forum thread in one go? I know there's the Internet Archive (which MIGHT stay up forever but who knows) but I don't think they go that deep into every site.on the net.

Edit: I see that the Internet Archive DOES go that deep. However, it only archives a few pages off a thread. So if there's maybe 50 pages in a thread, it picks up the first page and then randomly does maybe 4 more pages if you're lucky


I tend to use my browser's "print to file" feature to make each page into its own PDF then use a merger/combiner to make them into 1 PDF. I just found a forum post that gave some other options I'm now thinking about trying.

https://www.donationcoder.com/forum/index.php?topic=43576.0


HOLY TOLEDO THIS GUY'S A CODING GENIUS

https://www.printwhatyoulike.com/pagezipper

I saved a big thread into one 3MB txt file and an 8MB html

Sure everything's one big &%$# mess but HEY BETTER THAN NOTHING

Thanks for the tip man
14933 cr points
Send Message: GB Post
51 / M / New England, USA
Offline
Posted 10/9/19 , edited 10/9/19

nanikore2 wrote:


neugenx wrote:


nanikore2 wrote:

Old inactive CR forum threads gets locked and could be deleted later. Even if it isn't deleted, I can't count on any website (Crunchyroll included) being around forever.

Any way to download the whole contents of one forum thread in one go? I know there's the Internet Archive (which MIGHT stay up forever but who knows) but I don't think they go that deep into every site.on the net.

Edit: I see that the Internet Archive DOES go that deep. However, it only archives a few pages off a thread. So if there's maybe 50 pages in a thread, it picks up the first page and then randomly does maybe 4 more pages if you're lucky


I tend to use my browser's "print to file" feature to make each page into its own PDF then use a merger/combiner to make them into 1 PDF. I just found a forum post that gave some other options I'm now thinking about trying.

https://www.donationcoder.com/forum/index.php?topic=43576.0


HOLY TOLEDO THIS GUY'S A CODING GENIUS

https://www.printwhatyoulike.com/pagezipper

I saved a big thread into one 3MB txt file and an 8MB html

Sure everything's one big &%$# mess but HEY BETTER THAN NOTHING

Thanks for the tip man


Yvw, I'm glad I could help.
You must be logged in to post.