Welcome Guest ( Log In | Register )

2 Pages V  1 2 >  
Closed TopicStart new topic
> Uncensored gallery backup project

 
post Aug 11 2019, 18:44
Post #1
nobodyserio



Lurker
Group: Recruits
Posts: 9
Joined: 20-February 10


After talking a little on the IRC chat, it seems the 50TB site rip won't necessairly be made available and that it is now the duty of the communitzy, to be make sure the unique data preserves.

I know not many of you have 50TB of free space available, but for the few that do (especially the tape users, with archival properties of ~30 years), having a full site rip available to archive could prevent a worst case scenario.

For those of you aware of the tracker scene, think about What.cd or Asiandvdclub, the equivalent of e-hentai in regard to music and asiandvds(who would have thought).
Both got taken down and were tried to be replaced by other trackers, but are just a shadow of the former glory.

Think about Kyoani, who were only able to recover lost data due to pure luck of the fire not damaging a server placed in the same building.
And many many more examples in this area.

So I ask the community to help me creating a backup, since the more backups the better. I am open for ideas, but the most simple scenario I could think of would be the following:

1. Google doc page for uncesnored e hentai galleries is created
2. Users download all galleires from a page.
3. Upload them to mega
4.Post the link in the google doc page

If your page has already been uplaoded go to the next one and so on.

After most of it is uploaded, the links can be placed in jdownloader or megabasterd.I would then upload it in 1TB chunks to Usenet (just like console gaming collections done by other sites).

I know there are 27000 pages and this is by no means the ultimate plan, but since this is a communitzy project, we should be able to slowly and steadily backup the whole site and have several backup ups available in the worst case scenario.

What do you say?

This post has been edited by nobodyserio: Aug 11 2019, 18:47
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Aug 12 2019, 01:18
Post #2
blue penguin



in umbra, igitur, pugnabimus
***********
Group: Gold Star Club
Posts: 10,046
Joined: 24-March 12
Level 500 (Godslayer)


Crawling EH is difficult but not impossible. I strongly believe that the biggest value of EH is its metadata: tags and definitions. Just storing the metadata itself in some reasonable format can be a very valuable task.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Aug 12 2019, 03:12
Post #3
DGze



Headphone Fetishist | Luna's Bride
*******
Group: Catgirl Camarilla
Posts: 1,043
Joined: 12-February 12
Level 358 (Godslayer)


QUOTE(blue penguin @ Aug 11 2019, 19:18) *

Crawling EH is difficult but not impossible. I strongly believe that the biggest value of EH is its metadata: tags and definitions. Just storing the metadata itself in some reasonable format can be a very valuable task.


You should have been there when people were saying "tags are secondary, who gives a shit" in IRC chat.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Aug 12 2019, 04:43
Post #4
Spectre



The Bell Tolls for All.
**********
Group: Global Mods
Posts: 8,640
Joined: 8-February 06
Level 272 (Godslayer)


QUOTE(DGze @ Aug 11 2019, 21:12) *

You should have been there when people were saying "tags are secondary, who gives a shit" in IRC chat.

Poor misguided souls (IMG:[invalid] style_emoticons/default/laugh.gif)
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Aug 12 2019, 07:56
Post #5
RandomGuy1412



Newcomer
*
Group: Members
Posts: 29
Joined: 19-September 16
Level 139 (Lord)


QUOTE(DGze @ Aug 11 2019, 18:12) *

You should have been there when people were saying "tags are secondary, who gives a shit" in IRC chat.


Imagine scrolling through EH without any metadata
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Aug 12 2019, 14:53
Post #6
nobodyserio



Lurker
Group: Recruits
Posts: 9
Joined: 20-February 10


Thank you for the replies. The metadata wasn't my first concern, but absolutely should be archived aswell.

I also realized a huge flaw in my "simple" idea, since the page count isn't unique but growing, so the use of dates would make much more sense.

User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Aug 13 2019, 12:55
Post #7
derp-z2



Casual Poster
****
Group: Members
Posts: 454
Joined: 17-September 14
Level 300 (Ascended)


Greetings

Again , its just my opinion , but i would also chime in my 2 cents , while we can make comparison to other torrent or private tracker sites , the main comparison i will make of this site is the equivalent (or rather Congurent [look up the mathematical term]) of site like [(Emuparadise) && TheISOZOne + Original Nyaa(sukibei) torrent site] all together.

My reference choice

TheISOZone : the e(x)hentai of all PC games (mac,window,linux) + console homebrewing rivaling or even surpassing emuparadise that shit its pants and its creator took down all things when/while emuparadise went basically e-hentai + e(x)hentai route to comply with |\|1|\|T3|\|D00!!!

NYaa sites : basically what happened to sad panda initially ....

Both of above never recovered back from the loss and are forever gone in the darkness abyss while </salute_start> Tenboro Sir somehow ensured somekind of contingency and prevented the same sadistic fate for the panda. (again ... godspeed you magnificent badass) </salute_end>

now here comes exactly the project which is being discussed in this forum topic .... does anyone remember goodolddownloads amd the IGG-GAMES vs GOD doxxing debacle well GOD is back on track on the onion site project while its parent repository is safe and sound and backed up at the CS.RIN different sections sections.

I believe the same archiving mirroring (hemorrhoid constipation level painfull snail paced task) needs to be completed and kept in shadows and only publicly disclosed in seasoned releases on nyaa trackers incase the administration believes some other big bros are coming knocking.

Well those are my 2 cents to this pot

Godspeed and Peace
[again : Never say goodbye .. only say ... have a safe & great next journey He-man movie quote]

This post has been edited by derp-z2: Aug 13 2019, 13:00
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Aug 13 2019, 13:35
Post #8
nobodyserio



Lurker
Group: Recruits
Posts: 9
Joined: 20-February 10


Very good point indeed. Thankfully in regard to Console/PC Games they are easier mentainable thanks to sites like redump and the fact that a lot fewer games are released in an average month, compared to manga, animation etc.

Just making the database available over Usenet for example would the most simple thing to do, but it seems now is the time for the community to show it's sincerity apparently.

So that a secodn what.cd or asiandvdclub doesn't happen again. And tahnkfully 50 TB is "little" compared to the potential pentabytes of other archives.

In regard to my community effort:

Simply replacing my page idea, eith release per day should solve my thinking error.

Something like:

2017-09-11 : mega.nz blabla
2017-09-12: mega.nzblablabla

Tedious, yes. Extremly annyoing, absolutely. But this can make sure we have an external backup just in case.
And the people with 50 TB available (especially teh tape drive users) can archvie the whole thing.

I have used mega before and you can easily make several accounts just by using a different e-mail adress.
1 Mega account = 50GB, so 1000 accounts/users should do the trick.

I will put this first idea into motion soon. But I am till open to idea.
Writing a script obviously comes to mind, but I don't want to banned in the process, by stressing the servers too much.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Aug 13 2019, 14:06
Post #9
Testaccount321



Newcomer
**
Group: Members
Posts: 58
Joined: 24-April 18


QUOTE(RandomGuy1412 @ Aug 12 2019, 05:56) *

Imagine scrolling through EH without any metadata



You really need both. I mean, without metadata finding anything is nearly impossible. But on the other hand, what use is filtering for exactly what you want to see, when none of the results is actually available? Then it just creates frustration by telling you exactly how much was lost.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Aug 19 2019, 15:14
Post #10
Vexxille



Casual Lurker
******
Group: Gold Star Club
Posts: 846
Joined: 20-August 12
Level 375 (Godslayer)


Since E-H keeps getting new content and tags, will the site backup/archives be renewed periodically?

This post has been edited by Vexxille: Aug 19 2019, 15:15
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Aug 19 2019, 15:16
Post #11
Maximum_Joe



Legendary Poster
***********
Group: Gold Star Club
Posts: 24,074
Joined: 17-April 11
Level 500 (Dovahkiin)


The backups should mirror the site (with perhaps a small delay).
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Sep 11 2019, 20:55
Post #12
Goofy Dan



Lurker
Group: Lurkers
Posts: 2
Joined: 5-August 19


Here's a backup: [files.catbox.moe] https://files.catbox.moe/wxwav5.zip

This post has been edited by Goofy Dan: Sep 11 2019, 21:09
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Sep 15 2019, 19:07
Post #13
Nazfellun



Lurker
Group: Recruits
Posts: 8
Joined: 8-October 08
Level 152 (Ascended)


QUOTE(blue penguin @ Aug 12 2019, 09:18) *

Crawling EH is difficult but not impossible. I strongly believe that the biggest value of EH is its metadata: tags and definitions. Just storing the metadata itself in some reasonable format can be a very valuable task.


Crawling it isn't too hard, mostly just a tad annoying due to the lack of documentation around the various rate limits. I started creating a full backup during the period where EH was still ostensibly doomed. I didn't finish, but I did complete the first few phases of the process, including backing up all available gallery and tag group metadata.

[mega.nz] Here's the largely-untouched original JSON metadata in individual files for the galleries, and the tag group data in JSON as well.

[mega.nz] Here's the same in an SQLite database. I created this from the JSON using Python & sqlalchemy; [mega.nz] here's an extract of the code, defining the schema used for the DB.

The JSON version is around 450MB zipped, and like 2-5x that unzipped (depending on how much per-file overhead your filesystem has, and if your filesystem does transparent compression). The SQLite version is around 389MB.

QUOTE(Maximum_Joe @ Aug 19 2019, 23:16) *

The backups should mirror the site (with perhaps a small delay).


Once it was clear that EH wasn't going to be doomed in the short term, I did consider maintaining a live-ish full mirror. But unless I were willing to go to annoying lengths to reduce costs, it'd take me at least $600 USD/month, which isn't financially viable for me.

I guess I could try to crowd-source the money from fellow perverts, but the social and financial aspects involved in managing that are a lot less interesting to me than the technical aspects of building and running a mirror.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Sep 15 2019, 19:41
Post #14
Elevens



Casual Poster
****
Group: Members
Posts: 274
Joined: 18-December 10
Level 130 (Ascended)


Thanks for the packs.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Sep 16 2019, 00:33
Post #15
Mikoyan Gurevich



Newcomer
**
Group: Gold Star Club
Posts: 63
Joined: 6-June 13
Level 74 (Champion)


I'm just about done with backing up the first 10 TB of data, going in oldest chronological order. I'm only backing up doujins, manga, non-H, and artist CG, but it's a start. I actually was meaning to scrape the metadata from every gallery using the API but this is good too. The years of work gone into tagging and adding metadata lost would be a huge blow to the archive, because it would effectively render it unsearchable without another tagging effort.

As for backing up the website, it's really a matter of time. I do not know how the site's iptables works in blocking people, but it is ostensibly feasible to back up the website using several ranges of IPs with dedicated computers working around the clock to back up a small portion of the data every day. Naturally this would cost a decent amount of money to do, but once the initial backup is complete, the costs should shrink significantly. The high operating costs only arise when you intend to have this data hosted online and available 24/7. E-H only works as well as it does since it uses a distributed image hosting platform and offloads its server capacity around the world.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Jan 5 2020, 13:24
Post #16
nobodyserio



Lurker
Group: Recruits
Posts: 9
Joined: 20-February 10


Thank you very much for the effort.

It seems your method is far superior than mine. I actually started backing up, but like you said without the metadata it feels lacking.

Can I join your effort to eventually completely backup the gallery? I assume you used a scraper/crawler?

User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Jan 7 2020, 17:42
Post #17
sakuracircle



Casual Poster
***
Group: Gold Star Club
Posts: 154
Joined: 21-May 10
Level 292 (Godslayer)


QUOTE(blue penguin @ Aug 12 2019, 07:18) *

Crawling EH is difficult but not impossible. I strongly believe that the biggest value of EH is its metadata: tags and definitions. Just storing the metadata itself in some reasonable format can be a very valuable task.


I value these the most until we can have an AI who can read and tag galleries like a pervert.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Jan 7 2020, 18:25
Post #18
Mayriad



SUPER ★ BUSY ★ TIME
*******
Group: Global Mods
Posts: 2,061
Joined: 18-December 10
Level 135 (Lord)


QUOTE(sakuracircle @ Jan 7 2020, 17:42) *
I value these the most until we can have an AI who can read and tag galleries like a pervert.

We are going off-topic, but say hello to [kanotype.iptime.org] Deep Danbooru, your friendly pervert AI tagger. I have been wondering when I should bring it up.

It is quite good most of the time, and handles some tricky cases surprisingly well. It would be interesting if we apply something like this to EH galleries; for example, you can use it to automatically generate tags for all images in a gallery, and then check which tags have enough presence. There are obviously big problems like non-visual tags, definition mismatch between EH and boorus, and 2000-image game CG galleries, and hence we cannot use this particular AI, but in the far future, we might be able to use an AI to suggest or add tags like a pervert.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Jan 7 2020, 18:29
Post #19
Shank



Roll for Initiative
**********
Group: Global Mods
Posts: 9,142
Joined: 19-May 12
Level 500 (Ponyslayer)


QUOTE(mayriad @ Jan 7 2020, 16:25) *

We are going off-topic, but say hello to [kanotype.iptime.org] Deep Danbooru, your friendly pervert AI tagger. I have been wondering when I should bring it up.

It is quite good most of the time, and handles some tricky cases surprisingly well. It would be interesting if we apply something like this to EH galleries; for example, you can use it to automatically generate tags for all images in a gallery, and then check which tags have enough presence. There are obviously big problems like non-visual tags, definition mismatch between EH and boorus, and 2000-image game CG galleries, and hence we cannot use this particular AI, but in the far future, we might be able to use an AI to suggest or add tags like a pervert.

I didn't know that existed. That explains why all the booru's I use are tagged like shit.

Edit: Random image I tested
Attached Image

Edit2: An actual image pulled from a booru faired better.
Attached Image

This post has been edited by Ubershank: Jan 7 2020, 19:08
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Jan 7 2020, 19:35
Post #20
Mayriad



SUPER ★ BUSY ★ TIME
*******
Group: Global Mods
Posts: 2,061
Joined: 18-December 10
Level 135 (Lord)


QUOTE(Ubershank @ Jan 7 2020, 18:29) *
Edit2: An actual image pulled from a booru faired better.

I actually know where this image is from... The tags for this one seem super accurate. The "mature" tag is interesting; I think Takizawa is around 30 years old, so the "mature" tag is reasonable considering the average age of waifus on the boorus.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post


2 Pages V  1 2 >
Closed TopicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 


Lo-Fi Version Time is now: 7th May 2025 - 08:07