Welcome Guest ( Log In | Register )

2 Pages V  1 2 >  
Reply to this topicStart new topic
> Better upload tool

 
post May 15 2020, 16:26
Post #1
genl



Casual Poster
***
Group: Members
Posts: 190
Joined: 17-January 11
Level 54 (Expert)


I've stopped uploading galleries on e-h years ago. One of major reasons is the absence of convenient tool for uploading and managing galleries with multi-folder structure. So, mostly digital galleries in Artist CG category.

It's somewhat difficult and annoying to properly upload contents of all folders, keep the track of all the variations sets, remember all the page numbers and include them as index in the gallery description. This is also why certain mistakes are more apparent in Artist CG section: duplicates, wrong order, missing sets of images etc.

Uploading such galleries is a chore, and it shouldn't be one. I thought I was the only one with this opinion, but it seems more users share it. Some people just stop uploading certain types of galleries. Although I myself may never return to it, I wish the community could use better instruments to do what it does more effectively.

If at some point admins decide to improve the gallery upload tool, here are things I can suggest:
- The best case is when a user uploads a single archive and it gets processed into the gallery with minimal further effort from him.
- Properly process Japanese filenames inside the zip archives.
- Do not throw away folder names and structure from the processing. Use it to visualize the output for the sorting stage and auto-generate the index for the description.
- Detect duplicates and mark them wherever possible.
- Maybe also keep the duplicate pages (but not files) in the gallery itself and keep them marked. This way, the gallery will have the original amount of pages but the archive will come without them.
- Process PDF files. Tricky, but should be possible. In many cases, the pages are just JPEG files and can be extracted with proper tools. In other cases, it should be possible to warn the user that he may need to process it himself to render stuff at proper resolution.
- Warn the user that adding cover is preferable, or help him find it on the internet, based on the product ID or description. Or better - allow him to just insert the original product page URL and suggest a proper title based on it.
- With the inclusion of the product page URL, it should be possible to also automatically add it in the description, to inform viewers where they can purchase it or follow the circle.

Along with the above, inclusion of extra information about the original product should help to clearly mark (automatically or by users) digital galleries with specific attributes (tags or whatever) that may be useful:
- Sample
- Incomplete
- Outdated (then automatically warn the uploader so he may either update the gallery with the latest version or choose to leave it as is)
- Watermarked
- Tainted (e.g. screenshots instead of images of original quality and size)
- Optimized (smaller file size but with 100% original quality, e.g. after using PNGGauntlet)
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 15 2020, 16:47
Post #2
Z.G.



I'm the sukebei, for I am holding all hentai in my hands
*******
Group: Gold Star Club
Posts: 1,286
Joined: 3-December 09
Level 271 (Ascended)


I just to give my 2 cents about something you said. I don't think the upload system should care about extracting pdf or others format ftm. That's your "job" to rip the contents, not the upload system.
Same for the cover, I think it's kind of obvious and even if they didn't people would comment about it. The uploader can then add it, or someone could (?) replace the gallery with the cover.

For the wrong order, most of time it's because people batch upload stuff with badly numbered files. For example making 1, 10, 100 instead of 001, 010, 100.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 15 2020, 17:00
Post #3
genl



Casual Poster
***
Group: Members
Posts: 190
Joined: 17-January 11
Level 54 (Expert)


QUOTE(ero-onizuka @ May 15 2020, 17:47) *
I just to give my 2 cents about something you said. I don't think the upload system should care about extracting pdf or others format ftm. That's your "job" to rip the contents, not the upload system.

I could agree but too often I'm noticing bad PDF rips, and even worse - translations based on them. They can be upscaled and oversized, downscaled, blurry etc.
There is no definitive PDF extraction tutorial and handling it requires certain experience. Even if there was one, there is nothing that would automatically ask the uploader to read it.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 15 2020, 17:01
Post #4
Maximum_Joe



Legendary Poster
***********
Group: Gold Star Club
Posts: 24,074
Joined: 17-April 11
Level 500 (Dovahkiin)


QUOTE(genl @ May 15 2020, 10:26) *

Properly process Japanese filenames inside the zip archives.

IIRC this is more of an OS issue.

QUOTE
Detect duplicates and mark them wherever possible.

Huge waste of server resources. The user should be the one to take the time to do this, especially for near-duplicates and alternate resolutions.

QUOTE
Process PDF files.

Speaking as someone who has processed a lot of PDFs this is impossible to implement correctly. Plenty of PDFs are not encoded to be easily extracted so it would be a massive waste of upload bandwidth only for the system to spit back "I can't read this crap".

QUOTE
Warn the user that adding cover is preferable

The instances of missing covers have sharply declined. This just seems like nagging people.

QUOTE
With the inclusion of the product page URL, it should be possible to also automatically add it in the description, to inform viewers where they can purchase it or follow the circle.

Automatically how? Which storefronts (especially Japanese ones) have an API / are that scraper-friendly?
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 15 2020, 19:57
Post #5
pork:zero



All the World's Evil
********
Group: Catgirl Camarilla
Posts: 2,876
Joined: 10-August 13
Level 339 (Godslayer)


A lot of this is just putting more burden on the site to do what even users struggle to do.
At best, it'll be inaccurate and require clean-up and correcting.

The effort spent in verifying and fixing things is about the same as it would to organize it properly in the first.
There wouldn't that much net gain from this.

This post has been edited by saltcutlet: May 15 2020, 20:04
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 16 2020, 15:10
Post #6
genl



Casual Poster
***
Group: Members
Posts: 190
Joined: 17-January 11
Level 54 (Expert)


QUOTE(Maximum_Joe @ May 15 2020, 18:01) *
IIRC this is more of an OS issue.
This issue is caused by the fact that most archive formats did not support UTF-8 at first and the fact that most Japanese portals still generate archives using older versions of archivers.
As for the e-h, from what I see it removes any non-ANSI characters from filenames, and if some file could not be processed its Japanese name is displayed with non-Japanese encoding (gibberish) in the log.

I propose to recognize such filenames (folder names mostly really) and use them during gallery creation and possibly index creation. Imagine list+tree view where user is able to change the order of both files and folders.

QUOTE(Maximum_Joe @ May 15 2020, 18:01) *
Huge waste of server resources. The user should be the one to take the time to do this, especially for near-duplicates and alternate resolutions.
I'm pretty sure server already generates some kind of hash for each image upon loading, so looking up for the same hashes inside current gallery in preparation should not waste much resources. Also, this can be implemented on client side.

QUOTE(Maximum_Joe @ May 15 2020, 18:01) *
Speaking as someone who has processed a lot of PDFs this is impossible to implement correctly. Plenty of PDFs are not encoded to be easily extracted so it would be a massive waste of upload bandwidth only for the system to spit back "I can't read this crap".
I've processed hundreds of PDFs. One properly configured algorithm path can take care of around 70% of all PDF cases (extract all JPEG and compare them to the number of pages).

Upload bandwidth sounds like a valid concern, but I'm not sure it's fair to say it will be massively wasted. Upscaled PDF rips can be oversized by as much as 400% of the original PDF size. That's a lot of waste too.

Maybe server resources is really more serious concern, since processing PDF files often takes some time. Still:
- The process seems single-threaded and can be queued.
- Even though the unprotection is probably the most CPU demanding process, it can be combined with the extraction and shouldn't waste memory. On my machine it only eats 2 MB RAM.

Also, I guess it's also something that can work on client side.

QUOTE(Maximum_Joe @ May 15 2020, 18:01) *
The instances of missing covers have sharply declined. This just seems like nagging people.
All I can see is that it's still happening. Also, I;m not proposing to nag anyone. Just a field that can be optionally filled to get more details automatically, including cover.

QUOTE(Maximum_Joe @ May 15 2020, 18:01) *
Automatically how? Which storefronts (especially Japanese ones) have an API / are that scraper-friendly?
Right, hardly any Japanese portal provide any API access at all. But maintaining some crawler for just 2 or 3 portals would cover maybe 2/3~3/4 of all digital galleries.
Depending on which data we'are adding, some things can be done with just reading the <title> tag of the provided product page URL/ID. Cover JPEGs are really easy with just ID since they use static URL schemes for many years.

QUOTE(saltcutlet @ May 15 2020, 20:57) *
A lot of this is just putting more burden on the site to do what even users struggle to do.
At best, it'll be inaccurate and require clean-up and correcting.
I don't think so. And again, many of these features can be implemented on client side.

QUOTE(saltcutlet @ May 15 2020, 20:57) *
The effort spent in verifying and fixing things is about the same as it would to organize it properly in the first.
There wouldn't that much net gain from this.
The effort mods spend on solving and verifying seems constant through time. Better upload tool may help many users and prevent many mistakes that mods keep on solving.

This post has been edited by genl: May 16 2020, 15:11
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 16 2020, 15:58
Post #7
blue penguin



in umbra, igitur, pugnabimus
***********
Group: Gold Star Club
Posts: 10,045
Joined: 24-March 12
Level 500 (Godslayer)


Come on genl, you from all people can automate this on your own machine with some 2h of scripting.

QUOTE(genl @ May 16 2020, 14:10) *
This issue is caused by the fact that most archive formats did not support UTF-8 at first and the fact that most Japanese portals still generate archives using older versions of archivers.
As for the e-h, from what I see it removes any non-ANSI characters from filenames, and if some file could not be processed its Japanese name is displayed with non-Japanese encoding (gibberish) in the log.
Sounds like a works-on-my-machine approach.

QUOTE
I propose to recognize such filenames (folder names mostly really) and use them during gallery creation and possibly index creation. Imagine list+tree view where user is able to change the order of both files and folders.

I'm pretty sure server already generates some kind of hash for each image upon loading, so looking up for the same hashes inside current gallery in preparation should not waste much resources. Also, this can be implemented on client side.
Just [www.phash.org] phash, order by hash, count duplicates, retrieve counts, print paths. Easy on a filesysytem, freaking pain on an indexing database.

QUOTE
I've processed hundreds of PDFs. One properly configured algorithm path can take care of around 70% of all PDF cases (extract all JPEG and compare them to the number of pages).

Upload bandwidth sounds like a valid concern, but I'm not sure it's fair to say it will be massively wasted. Upscaled PDF rips can be oversized by as much as 400% of the original PDF size. That's a lot of waste too.
And for the other 30% we would need to add a radio button for every command line option from pdftk and poppler. Then we would be doing support for the people who have no clue how a PDF is constructed inside instead of them going to the real documentation of those libraries.

QUOTE
Right, hardly any Japanese portal provide any API access at all. But maintaining some crawler for just 2 or 3 portals would cover maybe 2/3~3/4 of all digital galleries.
Depending on which data we'are adding, some things can be done with just reading the <title> tag of the provided product page URL/ID. Cover JPEGs are really easy with just ID since they use static URL schemes for many years.
What for? I run my own crawlers that plug into EH and update content automatically. I develop them myself and maintain them myself. No extra dev time for Tenb

You can do it too. It is a community driven effort. Don't be lazy and ask for convenience, you're way more than capable of implementing lots of things yourself. Chip in and lend a hand instead.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 16 2020, 20:47
Post #8
genl



Casual Poster
***
Group: Members
Posts: 190
Joined: 17-January 11
Level 54 (Expert)


QUOTE(blue penguin @ May 16 2020, 16:58) *
Come on genl, you from all people can automate this on your own machine with some 2h of scripting.
The issue is not that I won't be doing this kind of thing to make my own life easier (because I'm not planning to get back to uploading even if it will get implemented, also I'm not that good with JS), but that it has to be enabled globally to have good effect on uploaders and overall quality of their galleries.

QUOTE(blue penguin @ May 16 2020, 16:58) *
Sounds like a works-on-my-machine approach.
One of us is missing something. I'm talking about the fact that if we want to help uploader with sorting and ordering his upload, Japanese folder names inside non-UTF-8 zip archives are the only thing that can be automatically read and used for building any kind of named sequences (or hierarchies). Both for visual sorting and providing adequate index for the gallery description.

QUOTE(blue penguin @ May 16 2020, 16:58) *
Just [www.phash.org] phash, order by hash, count duplicates, retrieve counts, print paths. Easy on a filesysytem, freaking pain on an indexing database.
This doesn't have to do anything with databases because this, again, can be done through visualizing frames over the pictures in the gallery, with a JS script on client side.
Deduplication in file system is another story and can be skipped (also I think it's already being done in some form).

QUOTE(blue penguin @ May 16 2020, 16:58) *
And for the other 30% we would need to add a radio button for every command line option from pdftk and poppler. Then we would be doing support for the people who have no clue how a PDF is constructed inside instead of them going to the real documentation of those libraries.
I'm not proposing to properly process every single PDF case. Just saying that one predefined path can handle most of them, which could turn tons of bad rips into good rips during gallery creation. With no effort from the uploader.

As for what to do with unsupported cases - yes, user should be met by "this case is too difficult for automatic system so please try processing it yourself" message. But then he can also be directed to some forum thread where an adequate guide can be prepared.

QUOTE(blue penguin @ May 16 2020, 16:58) *
What for? I run my own crawlers that plug into EH and update content automatically. I develop them myself and maintain them myself. No extra dev time for Tenb
More covers can be inserted properly in time (less reuploads and expunges), more sorting issues can be solved before the publishing (less edits and obscure updates), more product URLs can be provided from the start (less stress on Google and uploader's PM box, more visitors for creators). + Additional tags that can give more idea about the state of the gallery faster.

QUOTE(blue penguin @ May 16 2020, 16:58) *
You can do it too. It is a community driven effort. Don't be lazy and ask for convenience, you're way more than capable of implementing lots of things yourself. Chip in and lend a hand instead.
I don't know what to say here. Maybe "I'm lazy" would be best for everyone.

You see, e-h is missing a lot of potentialy well-maintained content and a lot of potentially active uploaders. Continuously. Maybe because those uploaders are lazy, or maybe because they are expected to know, script and build stuff in order to provide good error-free galleries. People upload a few things and really just give up after they see it requires the amount of effort they can't offer.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 17 2020, 08:02
Post #9
Tenboro

Admin




QUOTE(genl @ May 16 2020, 15:10) *

I've processed hundreds of PDFs. One properly configured algorithm path can take care of around 70% of all PDF cases (extract all JPEG and compare them to the number of pages).


Anything that works only 70% of the time is worthless, FWIW.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 17 2020, 11:14
Post #10
blue penguin



in umbra, igitur, pugnabimus
***********
Group: Gold Star Club
Posts: 10,045
Joined: 24-March 12
Level 500 (Godslayer)


QUOTE(genl @ May 16 2020, 19:47) *
I don't know what to say here. Maybe "I'm lazy" would be best for everyone.
I agree. And then there's the story of clever lazy and stupid lazy (and yes, I'm definitely placing you in the clever lazy category (IMG:[invalid] style_emoticons/default/smile.gif) )

But then it becomes a philosophical point: When one give raw tools to people they learn how to use them and become better people by improving the tools themselves; when one gives convenient tools to people they become stupid lazy and just ask for more.

I have always been on the former front, give people the raw tools and allow them to work it by themselves. Not everyone can do it but the ones who can are worth gold. But I digress, this is a non-argument statement when one involves philosophy.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 17 2020, 15:29
Post #11
genl



Casual Poster
***
Group: Members
Posts: 190
Joined: 17-January 11
Level 54 (Expert)


QUOTE(Tenboro @ May 17 2020, 09:02) *
Anything that works only 70% of the time is worthless, FWIW.
Think of it like this: if for every PDF-ripped gallery you had the original PDF file, it would be possible to replace many of them with original quality content, through one predefined action. On practice it's pretty much impossible because we wouldn't even know which galleries are made from PDF-ripped images (it might have been useful to mark such galleries with a corresponding attribute or tag from the very start, btw).

It's possible to provide support for 100% of cases. But it would require much more code and effort. And if we want to achieve maximum quality for most difficult cases, the process also has to take some input from the user at several points.

But none of these matter. Processing PDF files is one of suggestions, that's all.

QUOTE(blue penguin @ May 17 2020, 12:14) *
I agree. And then there's the story of clever lazy and stupid lazy (and yes, I'm definitely placing you in the clever lazy category (IMG:[invalid] style_emoticons/default/smile.gif) )

But then it becomes a philosophical point: When one give raw tools to people they learn how to use them and become better people by improving the tools themselves; when one gives convenient tools to people they become stupid lazy and just ask for more.

I have always been on the former front, give people the raw tools and allow them to work it by themselves. Not everyone can do it but the ones who can are worth gold. But I digress, this is a non-argument statement when one involves philosophy.
My primary point: because of how e-h works, more people do mistakes in galleries, which later translates into more mistakes in derivative galleries (e.g translated). People don't really know what to do when they upload their first gallery and proceed repeat the mistakes. They are not greeted by any written guide that would make them understand more about what they should do (maybe because from e-h point, it's better to not nag the user with advices and let him do what he decided to do, even if it results in a series of mistakes). They might even meet harsh reactions because of such mistakes. Some editors still use the 1280x images because they never knew they could get better ones.

Even if you could say that e-h provides raw tools to people, they don't seem to learn about it until it's too late to fix their past mistakes. Do people have to work more in order to contribute better stuff to the community? Some people spent tons of time on preparing hundreds of uploads. Some couldn't last a day. I'd say they all deserve better treatment already because they decided to contribute something.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 17 2020, 15:42
Post #12
Maximum_Joe



Legendary Poster
***********
Group: Gold Star Club
Posts: 24,074
Joined: 17-April 11
Level 500 (Dovahkiin)


We can't automate away stupidity. If you want to assist you can help write a guide to uploading (or improve the existing wiki article) to prevent as many of these mistakes as possible.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 17 2020, 16:20
Post #13
Mayriad



SUPER ★ BUSY ★ TIME
*******
Group: Global Mods
Posts: 2,061
Joined: 18-December 10
Level 127 (Lord)


QUOTE(Maximum_Joe @ May 17 2020, 15:42) *
wiki article

I think the wiki is essentially the biggest problem on EH
QUOTE(Ubershank @ Apr 10 2020, 21:35) *
because who reads the fucking wiki
User is online!Profile CardPM
Go to the top of the page
+Quote Post

 
post May 17 2020, 16:31
Post #14
Agoraphobia



✝️ Ascension of Angel ✝️
***********
Group: Global Mods
Posts: 11,056
Joined: 12-August 19
Level 500 (Ponyslayer)


QUOTE(Ubershank @ Apr 10 2020, 19:35) *
who reads the fucking wiki
this quote needs to be included somewhere in the EHWiki, like making it an easter egg or something like that.

my stomach hurts.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 18 2020, 23:06
Post #15
@43883




************
Group: Gold Star Club
Posts: 31,466
Joined: 6-March 08
Level 500 (Newbie)


1. How do you know filenames are going to be in Japanese? It could be Coronese Chinese, it could be Russian, Greek, Thai, Arabic, and it could even be mojibake. Nowadays, smart devs and artists include the Japanese in the readme files and use ASCII for filenames which in turn leads to zero encoding errors. No one cares if it's Engrish as long as any system can read it just fine - it's not like EH asks you to limit filenames to length 8 and all caps either. And yes, this is why whenever I can, I go for an English (American, Australian, Canadian, British, whatever as long as it's ASCII) operating system.

The current "dirty ASCII trimming" allows users to process archives and send them with a chance of an out of order, which is better than no way to upload at all or mojibake everywhere. Per the wiki (not sure if it was Joe, myself or someone else writing this part) you are advised to archive every subfolder because yes, a single archive is going to be messy. Also, this is usually a good idea as you want to upload multiple files to avoid hitting the recommended (500MB) or absolute (1200MB) file limit as well as use as much of your upload bandwidth as possible.

2. Tools like Bulk Rename Utility can prefix subfolders to filenames in seconds for someone who has never touched a computer in their entire life. Minimal effort here, has been suggested before, almost sure it's on the wiki. People know there's a wiki (nag screens were rejected because that would scare away the userbase, also it'd eventually be bypassed by script kiddies even with advanced stuff like un-OCRable must-read-or-GTFO picture-generated article links), if they don't want to read it, user issue, please replace user and try again.

3. Duplicate File Eraser does this for you in a matter of minutes even with a slow CPU, and that only catches exact dupes. Nothing will catch "close" dupes without a full similarity scan. And all it takes is a single byte to wreck the entire hashes and checksum. The system also will detect already uploaded files on its own.

4. The cover warning can be added as text to the uploading interface but won't prevent people from not reading it, just like the wiki, the content warning (clicks alright already, complains about objectionable content) or the Terms of Services (automatically checks I have read the ToS, "but no one told me about the ToS"). No system can automatically detect if there is a proper cover (relying on filenames is plain stupid and nagging pop-ups will quickly become annoying especially if there is one already).

Also, this accelerates the takedown process by effectively fetching resources from a shopping site as doujindb or whatever that crap is called is about as reliable as 2G/EDGE signals (at least until it was re-edited by TGP and gring). This thread itself is dangerous since it is fully crawlable, and already has been crawled and archived by tons of spiders and bots.

5. Good idea but bad practice. The *user* has to provide that information. They're supposed to own the work so they know where they bought it, right? Then if they need alternate shopping sites, they can use <insert preferred search engine> and insert it in the uploader comment field.

6.
- sample: There is a tag for this crap already. It wasn't even supposed to make it through because we have the bounty system. Some samples are easy to recognize but many are not.
- incomplete: How do you tell? MoeMoe? Sometimes even the shopping sites themselves are wrong on this, also, some pages may have been intentionally left out (and not just the blank pages) without qualifying for a proper incomplete replacement.
- outdated: Older works are not unwanted as long as they don't already exist. It is not easy to check whether they do.
- watermarked: Okay, now watermarks are relatively easy to identify... but still...
- tainted: The upload tool isn't this sentient. It needs to fetch the original somewhere. And then compare. And then ask itself "isn't there a reason to keep both after all?" which is why there is a thread for this...
- optimized: Checking every single file for proper removal of rogue bytes? Telling the user "YOOOO, JPEG100 is retarded, it's still lossy, consider using JPEG80 and fuck you and fuck your artifact OCD or for fuck's sake, look bro, this shit's greyscale, USE FUCKING PNG!!" "Hey wait, 4-color palette animation?! USE GIF MAN! THIS THING'S OLDER THAN OLD!" "Do you even scan?! This is pure black and white, not even greyscale!!" Upload servers would return Derpy Hooves all the time.

Okay, slightly less "negative" feedback because there is good intent in there and you are not stupid.
QUOTE
admins
Well actually, maybe you are, a little bit. Only Tenboro can touch this. There is only one admin. ALL of us are plebs. Even Joe. Or Ghosty. But I'll still give you benefit of doubt on "admins" being used all the time even for casual staffers/helpful community users, yes, like yourself.

Give the community pointers on how to include this and maybe Tenboro will consider it at some point. Like, for scans, the ISBN can be OCR'd in many situations... but is it CPU-efficient? Do we have to nag the user to give some info and potentially lose contributors?

tl;dr: As already stated, most can be solved by reading the wiki and taking a few minutes to fill in some info if the uploader wants to help even more. Linking to at least one shopping site (preferrably the one they bought the work from) is more than enough for crawling purposes.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 20 2020, 02:27
Post #16
genl



Casual Poster
***
Group: Members
Posts: 190
Joined: 17-January 11
Level 54 (Expert)


QUOTE(Luna_Flina @ May 19 2020, 00:06) *
1. How do you know filenames are going to be in Japanese? It could be Coronese Chinese, it could be Russian, Greek, Thai, Arabic, and it could even be mojibake. Nowadays, smart devs and artists include the Japanese in the readme files and use ASCII for filenames which in turn leads to zero encoding errors. No one cares if it's Engrish as long as any system can read it just fine - it's not like EH asks you to limit filenames to length 8 and all caps either. And yes, this is why whenever I can, I go for an English (American, Australian, Canadian, British, whatever as long as it's ASCII) operating system.
I'm not aware of any portal massively using zip archives with non-Japanese encoding. But I'm talking about commercial portals like DLsite here. User is of course free to create an archive with any other encoding, although it's not a case I'm trying to cover. If it matters, it's fairly possible to let the user choose which encoding to use each time, with Japanese set by default.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
The current "dirty ASCII trimming" allows users to process archives and send them with a chance of an out of order, which is better than no way to upload at all or mojibake everywhere.
I'm not saying that Japanese should be preserved in filenames. Although I'd like that too for the sake of completeness (metadata is data too), but a more important issue here is that foldernames and filenames can help with the sorting of the gallery before publishing. It's possible to remove those when the gallery is published if there is such a need.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
you are advised to archive every subfolder because yes, a single archive is going to be messy.
This is what I think should be covered. Creating a system for solving the order of files inside the multi-folder archive should not be rocket science.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
2. Tools like Bulk Rename Utility can prefix subfolders to filenames in seconds for someone who has never touched a computer in their entire life. Minimal effort here, has been suggested before, almost sure it's on the wiki. People know there's a wiki (nag screens were rejected because that would scare away the userbase
You are suggesting to make users know and use additional tools. It's not even about not reading the wiki. With how web evolves in recent years, it's not surprising that absolute most of new users expect to be able to use all the power of service without reading a line of documentation. It's not their fault and they are not wrong. Expecting them to follow some guidelines (since they are submitting the content) is understandable, but providing clear and easy as possible means for that is only logical.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
also it'd eventually be bypassed by script kiddies even with advanced stuff like un-OCRable must-read-or-GTFO picture-generated article links)
Not every potential uploader is a script kiddie. Far from that.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
if they don't want to read it, user issue, please replace user and try again.
That's where you lose potential active uploaders.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
3. Duplicate File Eraser does this for you in a matter of minutes even with a slow CPU, and that only catches exact dupes.
That's 2nd tool you suggest to use. May I also suggest using an archiver application that is capable of reading and extracting Japanese filenames? So the uploader doesn't get too confused by all the gibberish and proceed to sort out, rename all the folders and archive them one by one again... like you suggested earlier. That's how they lose more of their time.

Please understand. It's not about how good you can get with certain software and how much time you can save with some great scripts. It's about the fact that almost nothing is being done to spread such a knowledge, to the point where some people spend some months and leave without ever knowing of its existence.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
4. The cover warning can be added as text to the uploading interface but won't prevent people from not reading it, just like the wiki, the content warning (clicks alright already, complains about objectionable content) or the Terms of Services (automatically checks I have read the ToS, "but no one told me about the ToS"). No system can automatically detect if there is a proper cover (relying on filenames is plain stupid and nagging pop-ups will quickly become annoying especially if there is one already).
I repeat an option that I proposed for digital galleries: add a field (product URL) and tell user that filling it may help with metadata, process it once and you show user what he can add from it: cover, title, circle, type, description, category etc. It's just an example of how it can work. The URL can be automatically added as a permanent non-editable line near the description, it can be used for fetching any possible info of interest or just cover, let the admin/coder decide themselves.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
Also, this accelerates the takedown process by effectively fetching resources from a shopping site as doujindb or whatever that crap is called is about as reliable as 2G/EDGE signals (at least until it was re-edited by TGP and gring). This thread itself is dangerous since it is fully crawlable, and already has been crawled and archived by tons of spiders and bots.
Can't say anything about "takedown process". I've described the motivation behind my suggestions already.

As for fetching resources, at least with covers their URLs can be easily generated even without sending any requests.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
5. Good idea but bad practice. The *user* has to provide that information. They're supposed to own the work so they know where they bought it, right? Then if they need alternate shopping sites, they can use <insert preferred search engine> and insert it in the uploader comment field.
Fetching some details from a single product URL (which they may provide) might just be enough to save some of their time. Imagine uploading a gallery and filling all the basic details without the need to switch from the current tab or to different application.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
- incomplete: How do you tell? MoeMoe? Sometimes even the shopping sites themselves are wrong on this, also, some pages may have been intentionally left out (and not just the blank pages) without qualifying for a proper incomplete replacement.
In many cases the product page contains information that can help easily identify this in case with Artist CG: file size, inclusion of PDF version, number of pages with all variations. Making the product URL available (after motivating uploader to include it) = making it less difficult to recognize for other members.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
- outdated: Older works are not unwanted as long as they don't already exist. It is not easy to check whether they do.
It's fairly easy. Go to the product page and see if it got any update dated since the gallery was posted. DLsite/DMM include this information. So if members see the gallery is outdated, they tag it so. If the uploader gets notified about this - more chances for him to update the gallery and reset the tag.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
- watermarked: Okay, now watermarks are relatively easy to identify... but still...
That equals to low quality content. Some users may decide to provide better version, so searching by such a tag may help them to find which gallery may need an improvement.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
- tainted: The upload tool isn't this sentient. It needs to fetch the original somewhere. And then compare. And then ask itself "isn't there a reason to keep both after all?" which is why there is a thread for this...
Of course it has to be tagged by someone. Again, having product URL helps with identifying this quicker. Mostly by file size.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
- optimized: Checking every single file for proper removal of rogue bytes? Telling the user "YOOOO, JPEG100 is retarded, it's still lossy, consider using JPEG80 and fuck you and fuck your artifact OCD or for fuck's sake, look bro, this shit's greyscale, USE FUCKING PNG!!" "Hey wait, 4-color palette animation?! USE GIF MAN! THIS THING'S OLDER THAN OLD!" "Do you even scan?! This is pure black and white, not even greyscale!!" Upload servers would return Derpy Hooves all the time.
This one is tricky. Maybe it needs some work or maybe it's not needed at all. For me, having a 50 MB file is better than 300 MB one, and crunching those with PNGGauntlet can take an hour. I can understand if a different gallery containing greatly optimized pages would deserve a tag or some information about it. So I wouldn't try to crunch it again, only to lose more of my time after I see it's already fully optimized.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
Okay, slightly less "negative" feedback because there is good intent in there and you are not stupid.Well actually, maybe you are, a little bit. Only Tenboro can touch this. There is only one admin. ALL of us are plebs. Even Joe. Or Ghosty. But I'll still give you benefit of doubt on "admins" being used all the time even for casual staffers/helpful community users, yes, like yourself.
I wasn't trying to point out at anyone. I don't know how the situation changes each day, and how it can change by the time something gets into development.

QUOTE(Luna_Flina @ May 19 2020, 00:06) *
Give the community pointers on how to include this and maybe Tenboro will consider it at some point.
Well, how about client side web application? The simplest form is a userscript but maybe something else is possible. WebAssembly? Maybe either option are stupidly difficult to implement. Maybe an executable application can be created that would prepare the files and help with the sorting, so a user would just create the gallery and upload all the files in one go. Such application would be either multi-platform or be limited to certain OS.

Anyway, the key importance of trying to making it work mostly (or fully) client side is of course to remove the possible additional burden from servers.
Here is what, form my knowledge, theoretically can be achieved on client side:
- Read the archive using proper codepage where needed.
- Extract all the relevant files in work/temp folder and show which files can not be uploaded (e.g. PDF, txt, MP4 etc.).
- Produce some base sorting where files are sorted by groups (1 group = 1 folder), and allow further sorting by group. Each group named by corresponding folder name (or folder path, in case there are subfolders).
- Provide support for sorting by files within the group, using any usable sorting method, and by hand.
- Show resolutions clearly visible for each file or group, so user could easily see if there are lo-res dupes and get rid of them asap.
- Fetch details from the product URL, including cover, description, title, circle etc.
- Process PDF (difficult, will need further input from user, but should cover many of usual cases). This should be discussed separately, and I can help with logic details.
- Provide the possibility to compare any of 2 images, better with zoom. Switching sides feature would further help with identifying better quality image quicker.
- After all is prepared, if nothing is changed on e-h side, move all files into one folder, all under prepared names so they are going to always be sorted correctly by filename, to ensure that the gallery is going to have sorting 1:1 equal to how it was before the final process. Alternatively, some special naming scheme can be prepared that e-h would understand and resort all the files automatically according to that scheme before the gallery is published.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 20 2020, 03:09
Post #17
Maximum_Joe



Legendary Poster
***********
Group: Gold Star Club
Posts: 24,074
Joined: 17-April 11
Level 500 (Dovahkiin)


QUOTE(genl @ May 19 2020, 20:27) *

- Extract all the relevant files in work/temp folder and show which files can not be uploaded (e.g. PDF, txt, MP4 etc.).

We already list accepted formats in bold text on the upload page. If someone got an archive straight from DLSite and wants to dump it maybe they can take 10 seconds to look through said archive first.

QUOTE
- Produce some base sorting where files are sorted by groups (1 group = 1 folder), and allow further sorting by group. Each group named by corresponding folder name (or folder path, in case there are subfolders).

Letting any files be in a "group" in the UI is just confusing since it will mislead users into thinking that having groupings is supported.

QUOTE
- Provide support for sorting by files within the group, using any usable sorting method, and by hand.

Drag-and-move or selection rectangles would be nice but I wouldn't go any further than that.

QUOTE
- Show resolutions clearly visible for each file or group, so user could easily see if there are lo-res dupes and get rid of them asap.

This is extremely irrelevant to most uploads and just clutters the post-upload page with useless information. As a reminder we are okay with lower rez versions if they are part of the official release.

QUOTE
- Fetch details from the product URL, including cover, description, title, circle etc.

We already have the community incentivized to do this for cleanup and comment points. The uploader need not concern themselves about anything besides the cover and having everything in order.

QUOTE
- Process PDF (difficult, will need further input from user, but should cover many of usual cases). This should be discussed separately, and I can help with logic details.

You're much better off writing a guide to PDF extraction. Another thought that came to me was just how long it takes to extract PDFs (I can easily foresee files that would take hours). No way is anyone gonna sit and stare at the upload screen for that long without thinking that something broke.

QUOTE
- Provide the possibility to compare any of 2 images, better with zoom.

There is no way in hell we can make something better than the offline programs that are dedicated to such functions.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 20 2020, 13:59
Post #18
blue penguin



in umbra, igitur, pugnabimus
***********
Group: Gold Star Club
Posts: 10,045
Joined: 24-March 12
Level 500 (Godslayer)


We are forgetting one detail. The entire argument is about losing potential uploaders but most uploaders are just complete shit

They upload content that simply needs to be throw away: the daily bumpers, the pixiv (re)scrappers, the free comic competitors, the i'm offended by comment on my gallery snowflakes, the my friend cannot see my gallery forum spammers. And i do miss some, I'm sure.

I'm not confident that we want to make life easier for those. I'll argue that the need for some knowledge in order to upload galleries protects us from dealing with a good deal of those idiots.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 20 2020, 17:22
Post #19
genl



Casual Poster
***
Group: Members
Posts: 190
Joined: 17-January 11
Level 54 (Expert)


QUOTE(Maximum_Joe @ May 20 2020, 04:09) *
We already list accepted formats in bold text on the upload page. If someone got an archive straight from DLSite and wants to dump it maybe they can take 10 seconds to look through said archive first.
Ideally they wouldn't need to unpack or repack anything. It wouldn't matter if they feed the upload tool with it without looking inside first if it's all going to be done on client side.

QUOTE(Maximum_Joe @ May 20 2020, 04:09) *
Letting any files be in a "group" in the UI is just confusing since it will mislead users into thinking that having groupings is supported.
I don't think it's a serious enough issue. It should be solved along with designing the UI.

QUOTE(Maximum_Joe @ May 20 2020, 04:09) *
We already have the community incentivized to do this for cleanup and comment points. The uploader need not concern themselves about anything besides the cover and having everything in order.
I'm still seeing galleries without any URL. Why not incentivize the uploader too?
Side note. In fact, one of the most annoying things I do almost every time I download an archive from e-h, is finding the product ID and including it in the filename, in order to be able to find it faster by ID in future.

QUOTE(Maximum_Joe @ May 20 2020, 04:09) *
Another thought that came to me was just how long it takes to extract PDFs (I can easily foresee files that would take hours). No way is anyone gonna sit and stare at the upload screen for that long without thinking that something broke.
I don't know what tools you are using. Extracting one ~100 MB PDF file takes less than a minute here (just a few seconds if it's not "secured"). Rendering is another process which is needed much less often, but it also doesn't take more than a few seconds per page.
A PDF that could take hours... As long as it's not several GB in size, I assume it's just not being handled properly.

QUOTE(Maximum_Joe @ May 20 2020, 04:09) *
There is no way in hell we can make something better than the offline programs that are dedicated to such functions.
First, I disagree because it doesn't seem overly complicated. It only needs 2 rectangles (with an image inside each) which would scroll simultaneously. Second, like I've said, an executable application (offline program as you call it) is a possibility too.

QUOTE(blue penguin @ May 20 2020, 14:59) *
We are forgetting one detail. The entire argument is about losing potential uploaders but most uploaders are just complete shit

They upload content that simply needs to be throw away: the daily bumpers, the pixiv (re)scrappers, the free comic competitors, the i'm offended by comment on my gallery snowflakes, the my friend cannot see my gallery forum spammers. And i do miss some, I'm sure.

I'm not confident that we want to make life easier for those. I'll argue that the need for some knowledge in order to upload galleries protects us from dealing with a good deal of those idiots.
I think people like those are not going to be affected since their uploads rarely have anything to do with multi-folder structures. Most uploaders upload that kind of content because it's free. Those who is ready to upload quality stuff are minority, and I hoped that my suggestions could keep that minority from shrinking further.

Btw, galleries with pixiv/twitter content can be useful for preservation since particular files can get altered or wiped at the source, or the whole artist profile can get removed.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post May 20 2020, 21:55
Post #20
saythe14wordsforme



Newcomer
*
Group: Recruits
Posts: 19
Joined: 19-January 20


QUOTE(blue penguin @ May 20 2020, 13:59) *


the daily bumpers, the pixiv (re)scrappers



What's the issue again? A lot of artists on Pixiv will remove stuff and it's a blessing to have people re-upload their stuff to e-hentai and other sites to preserve their content.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post


2 Pages V  1 2 >
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 


Lo-Fi Version Time is now: 18th May 2024 - 18:10