Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
> Filenames: Possibilities, Limitations, Problems, Ease of use, compatibility and processing speed

 
post Mar 9 2025, 20:51
Post #1
Katajanmarja



Regular Poster
*****
Group: Gold Star Club
Posts: 659
Joined: 9-November 13
Level 386 (Godslayer)


So, I have been wondering about a few things related to organizing my saved files.

Back in the days, all DOS filenames were eight plus one plus three ASCII characters long at maximum, e.g. "SAVEPIC7.BMP". My assumption is that at the time, each and every filename in itself took up the same amount of space on a disk (or diskette or whatever), i.e. it did not speed up any processes or save any disk space to name a file "SP7.BMP" instead of "SAVEPIC7.BMP".

Then longer filenames became possible, then using various Unicode characters. And, of course, this caused clashes between filesystems. Older Microsoft OSes were surprisingly good at dealing with newer ones, though; I vaguely recall filenames like "Savedpicture-Ran-Yakumo.jpg" and "Savedpicture-Sailor-Moon.jpg" routinely transforming into something like "SAVEDP~1.JPG" and "SAVEDP~2.JPG" when opened within an older system.

Various filename limits must still exist today. It occasionally happens that I try to save files to whom somebody has given ridiculously long names, and Ubuntu, or some program used on top of it, refuses to handle them. I’m not exactly sure how the name length is counted. My hunch is that names featuring several kanji have a lower maximum number of characters than ones featuring ASCII only, but both types are perfectly capable of reaching the limits.

It has also occurred, albeit fairly rarely, that I have had serious issues while copying stuff. I might, for one reason or another, have ended up saving files with ASCII punctuation characters such as "<", ">", "|", "#", "%", "&", or "?" in their names. While first saving, I’ve had no issues, but once I try to copy the files from one device to another, nothing can be done unless I replace some of these characters with something else or remove them. Apparently webpage titles tolerate a large assortment of ASCII characters and some filesystems tolerate almost as many, but some other filesystems have reserved much of ASCII punctuation for special uses.

Finally, I have understood that Linux can indeed handle the ordinary space (Unicode: U+0020) in filenames but considers it an exception that requires extra operations. When I give commands in the terminal, I must remember to use quotation marks with filenames or directory names featuring spaces, otherwise the system has no idea what I’m trying to tell it to do. And that is very easy to forget about while just trying to get a job done.

There was a time when I tried to come up with old-fashioned filenames like "SAVEPIC7.BMP" for everything I kept; in the 00s, I still had one computer with Windows 3.1x in limited home use. After some time, I started to consider such naming conventions impractical and unnecessary. If everybody else around seemed to be sporting "●PIXIV● アルデヒド aldehyde [578571] 756.png’s", why should I not?

Nowadays, I’m increasingly bothered by various filename problems I’ve encountered since then. They may be fairly rare, but they can slow me down terribly and even cause my schedules to explode when they hit me out of the blue.

So I’ve decided I should replace spaces with hyphens or underscores whenever practical, and avoid most other ASCII punctuation characters; stick to alphanumeric ASCII when it does not cause too much confusion or require bad latinizations or translations; and make most filenames no longer than required to easily identify the files in the relevant contexts, be that length anything between 8 and 128 characters (occasionally I use even longer ones, perhaps most notably when saving stuff from EH galleries).

It’s bugging me a bit that I’m not able to get rid of kanji, kana and the like entirely. While I might be able to learn all hiragana and katakana characters with a bit of effort (I have long recognized a few with ease), I will never learn to read most kanji, and the same goes almost certainly for several foreign scripts such as Thai. Filenames that I cannot read aloud are definitely not something I like to use. Furthermore, the encoding corruption cases that were extremely common in e‑mails some twenty years ago keep haunting me, even if I have not encountered similar problems with non-ASCII filenames in a good while. (For reasons related to such encoding corruptions, I generally avoid using such European letters as ń, ö, or þ as well, even though I can read those and many of them could increase the readability of my filenames.)

When renaming files, one thing I’d love to know more about is whether it saves filesystem resources or not.

If I rename "Sailor Moon lingerie" to "Sailor-Moon-lingerie", it obviously saves my effort if I have to do something to that file in the future. But if I rename "●PIXIV● アルデヒド aldehyde [578571] 756.png" to "aldehyde-578571-Pixiv-756.png", does that make my computer’s or my external HDD’s work any easier?

My fear is that the filesystem might check the longest filename it encounters on a device and use that much space for every other filename, too. For example, if I have "SAVEPIC7.BMP" and "Maruta__Tomodachi-ga-Sukunakutemo-Yoi-Riyuu__A-Good-Reason-for-Less-friends__COMIC-Penguin-Club-2011-12__engl__TV_10.jpg" in the same folder, could my filesystem actually read them as something like "SAVEPIC7________________________________________________________________________________________________________________.BMP" and "Maruta__Tomodachi-ga-Sukunakutemo-Yoi-Riyuu__A-Good-Reason-for-Less-friends__COMIC-Penguin-Club-2011-12__engl__TV_10____.jpg", respectively, or are filesystems sensibly good at optimization? In other words, does it make much sense to shorten many filenames unless one shortens all of them?

Furthermore, if an ASCII punctuation character such as "[" seemingly never causes any problems, does that mean it is no more resource-hungry than, say, "7", "B", or "m"? Or do some punctuation characters require some special attention from filesystems, similarly to how "%" requires in percent-encoding? As a side note, where could I find a listing of "forbidden" characters per filesystem?

I’m also wondering whether questions like this play any significant role when transferring gibibytes of data from one device to another at once. My hunch is that copying fifty 500-MiB files named like "Savedvideo-Sailor-Moon.mp4" is noticeably faster than copying five thousand 5‑MiB files named like "[C81] Metal (Shinobu) Puella Magi Sex Utage (Puella Magi Madoka Magica) [Chinese] [KNC_速食机翻].tar", but this could be just my imagination.

Any and all explanations, commentaries, further ponderings, and easy reading suggestions by you savvier folks are welcome.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Mar 9 2025, 21:41
Post #2
Gingiseph



Newcomer
*
Group: Members
Posts: 28
Joined: 20-September 22
Level 28 (Apprentice)


Filesystems can be quite complex and have optimization for lots of parameters, depending usage and storage type.
Which ones are you using?
That specific documentation is probably giving most precise info you seek.

About copying files to different medium or filesystem, I think the filename size is totally negligible. The amount of work is mostly to convert your starting filesystem to the new one, then transferring the high volume of data.

QUOTE
I’m also wondering whether questions like this play any significant role when transferring gibibytes of data from one device to another at once. My hunch is that copying fifty 500-MiB files named like "Savedvideo-Sailor-Moon.mp4" is noticeably faster than copying five thousand 5‑MiB files named like "[C81] Metal (Shinobu) Puella Magi Sex Utage (Puella Magi Madoka Magica) [Chinese] [KNC_速食机翻].tar", but this could be just my imagination.

It is not your imagination at all, the hint is not in the file names but in the file count.
5000 is 2 orders of magnitude out of 50, and the copying process has to allocate and re-create the starting directory tree as well.

Also, folders tends to become slightly slower in I/O when the amount of file contained becomes this big, since each listing operation would have to restart from the very top.
(This of course can be avoided by splitting the folder in sub-folders)

Hope this can help you figure out the process a bit better.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Mar 10 2025, 01:29
Post #3
Katajanmarja



Regular Poster
*****
Group: Gold Star Club
Posts: 659
Joined: 9-November 13
Level 386 (Godslayer)


Many thanks, Gingiseph!

QUOTE(Gingiseph @ Mar 9 2025, 21:41) *

Which ones are you using?

For the most part, Ext4 for internal drives and FAT32 for external ones. I have recently begun a process of migrating from FAT32 to exFAT (for external devices that need Windows compatibility, i.e. the majority of them) and possibly Ext4 (for ones to be used with Linux only). Now is a good time if somebody feels like explaining why I should pick something else.

QUOTE(Gingiseph @ Mar 9 2025, 21:41) *

About copying files to different medium or filesystem, I think the filename size is totally negligible.

I guess that makes sense, but I wanted to ask regardless. (Probably there was some reason why 100-character filenames were not widely used in the 80s or 90s, as far as I know.)

QUOTE(Gingiseph @ Mar 9 2025, 21:41) *

The amount of work is mostly to convert your starting filesystem to the new one, then transferring the high volume of data.

My doubts could be re-worded as, "Does it significantly complicate the filesystem conversion process if the filenames are very long or contain unusual characters?" If not, then good.

QUOTE(Gingiseph @ Mar 9 2025, 21:41) *

It is not your imagination at all, the hint is not in the file names but in the file count.

That would have been my second guess. Thanks.

QUOTE(Gingiseph @ Mar 9 2025, 21:41) *

Also, folders tends to become slightly slower in I/O when the amount of file contained becomes this big, since each listing operation would have to restart from the very top. (This of course can be avoided by splitting the folder in sub-folders)

Actually, I prefer to build complex directory trees, keeping the number of files or sub-directories per one directory below 50 (true, with quite a few notable exceptions). I am doing this for the sake of manually locating my files more easily. Well, at least those ones I’ve been able to sort out and rename properly. Are you saying this habit is actually making copying operations faster, even though it increases the number of directories (and thus units to copy) a lot?

Another weird feeling I have is that when I copy from ext4 to FAT32, the more units I copy with one command the more time it takes per unit, regardless of how the directories are split. Not to mention that copying a large mass at once is far more likely to cause system crashes or other complications. It’s good if weird, long filenames are not a significant factor here either.

This post has been edited by Katajanmarja: Mar 10 2025, 03:14
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Mar 10 2025, 07:19
Post #4
Moonlight Rambler



Let's dance.
*********
Group: Gold Star Club
Posts: 6,427
Joined: 22-August 12
Level 372 (Dovahkiin)


A bit tired to reply to all points of your post at the moment, but in unix/linux usually any character is allowed in file names except for null (0x00) or '/' (0x2f). This includes newlines and ASCII control codes unfortunately.

FAT32 and NTFS and other OS filesystems have much stricter limits, which can cause trouble when copying files. Also some filesystems may not support things like unicode characters at all (may be limited to 7 bits; all ASCII text only uses the lowest 7 bits of a byte, leaving the topmost bit empty). this is called being "8 bit clean."

There are cases where windows can allow a longer filename than unix/linux does, too. But I don't remember the specifics off the top of my head. When a file name is too long in linux/unix I believe that's usually the full path it's counting, starting from the leading '/'. So a file in /home/username has a longer name than the same file in /.
QUOTE(Katajanmarja @ Mar 9 2025, 14:51) *
In other words, does it make much sense to shorten many filenames unless one shortens all of them?
Yes. It makes sense. You do not have to worry about specific longer names. As far as I can tell space is dynamically allocated and the text string ended with a 0x00 character (null) to tell the system "the file name ends here," since that's one of the two characters not allowed in a file name.

I know the linux kernel does impose a maximum length on file names but I do not remember what that number usually is at the moment. I know I've hit that number before, though, when unpacking zip files and such. Usually the solution then is to move the zip file up to a higher level in the file hierarchy, extract it, rename everything, and then move it back to its destination.

the `mmv` tool (not mv, i mean mmv) is useful for mass-renaming files in cases like this. for instance:
CODE
mmv 'foobar_*.png' 'foo_#1.png'
…would rename all files with names like like foobar_001.png or foobar_abcde.png to foo_001.png or foo_abcde.png.

I also use mmv to zero-pad numbers. say i have "page_1.jpg" through "page_9.jpg" and then it goes to "page_10.jpg". If I want page_01.jpg through page_09.jpg, I could do something like
CODE
mmv 'page_[0-9].jpg' 'page_0#1.jpg'

The '#1' means to insert whatever the first wildcard matched was in the "origin" file name at that spot in the output filename. If you have multiple wildcards in your source, you can use #2, #3, and so on to add the others. It's also useful for just shuffling around parts of file names. Say i have file names like "title_page_01.jpg" but I want them to be like "01_title_page.jpg" instead. I could do
CODE
mmv '*_[0-9][0-9].jpg' '#2#3_#1.jpg'
In that example, #2 is the first of the two digits, and #3 is the second of the two digits. #1 is the asterisk.

This post has been edited by Moonlight Rambler: Mar 10 2025, 07:34
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Mar 10 2025, 07:32
Post #5
-terry-



Veteran Poster
********
Group: Global Mods
Posts: 2,623
Joined: 9-August 19
Level 500 (Ponyslayer)


max path and max file name length are different, here are mine (in bytes)
QUOTE
~ ❯ getconf NAME_MAX /
255
~ ❯ getconf PATH_MAX /
4096


Here is how i sanitize file paths in my downloader script, since it appears to me your use case is similar.
CODE
INVALID_CHARS_REPLACEMENTS = {
    "\x00": "",  # NUL
    "/": "⁄",   # Forward-slash
}

def get_safe_path(title: str, gid: str) -> str:
    for invalid_char, replacement in INVALID_CHARS_REPLACEMENTS.items():
        title = title.replace(invalid_char, replacement)

    base = f"{title} [{gid}]"
    suffix_len = len(f" [{gid}].zip")
    if len(base) > (MAX_FILENAME_LENGTH - suffix_len):
        title = title[:(MAX_FILENAME_LENGTH - suffix_len)]
        base = f"{title} [{gid}]"
    return base


I don't know if file names were relevant to performance 40 years ago, but today they most certainly aren't.
Moving a couple big files is likely gonna be faster than thousands of small ones simply because there is less overhead, suppose if u got modern high end drives that's less of a worry too though.

User is online!Profile CardPM
Go to the top of the page
+Quote Post

 
post Mar 10 2025, 07:35
Post #6
Moonlight Rambler



Let's dance.
*********
Group: Gold Star Club
Posts: 6,427
Joined: 22-August 12
Level 372 (Dovahkiin)


QUOTE(-terry- @ Mar 10 2025, 01:32) *

I don't know if file names were relevant to performance 40 years ago, but today they most certainly aren't.

in one way they were; try running 'ls' on a directory with really long file names on a 4800 baud (or 2400, or whatever) terminal.

QUOTE(Katajanmarja @ Mar 9 2025, 19:29) *
Another weird feeling I have is that when I copy from ext4 to FAT32,
I'd advise that you do not ever try that.
Mostly because FAT32 is not a good filesystem, but also because it is not unicode. It's from the age of codepages and Windows 9X. Whenever I save a Japanese filename on my english Windows 98SE filesystem from my Japanese Win2K install (on the same hard disk), the windows 98 file system checker thinks the japanese file is corrupted data. It might be fine if you mount the FAT32 filesystem with the Japanese character set (iocharset and codepage options) in linux, but unless you can guarantee that any other computer (like windows machines) are in japanese locale, you may have problems. codepage=932,iocharset=euc-jp might work. maybe iocharset=utf8 would also. That's how I mount PC-9801 disk images in Linux without getting mojibake. (yes those are technically probably fat16 or something, not fat32, but fat32 inherits the same issues)

This post has been edited by Moonlight Rambler: Mar 10 2025, 07:48
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Mar 11 2025, 08:14
Post #7
Gingiseph



Newcomer
*
Group: Members
Posts: 28
Joined: 20-September 22
Level 28 (Apprentice)


QUOTE(Katajanmarja @ Mar 10 2025, 02:29) *

Many thanks, Gingiseph!


glad to be of some help.

QUOTE(Katajanmarja @ Mar 10 2025, 02:29) *

For the most part, Ext4 for internal drives and FAT32 for external ones. I have recently begun a process of migrating from FAT32 to exFAT (for external devices that need Windows compatibility, i.e. the majority of them) and possibly Ext4 (for ones to be used with Linux only). Now is a good time if somebody feels like explaining why I should pick something else.


it depends on the external drive usage and where you plan to connect it.
some devices that are not PC might be used to read that, so an old format like fat32 is less efficient but more portable/compatible.

QUOTE(Katajanmarja @ Mar 10 2025, 02:29) *

I guess that makes sense, but I wanted to ask regardless. (Probably there was some reason why 100-character filenames were not widely used in the 80s or 90s, as far as I know.)


There were several characters set to support non-Latin characters, meaning that not all systems have all the supporting sets installed, or might break the character encoding on reading/rendering it.
In this cases, when possible and I don't trust the "host system" I prefer to load external partitions in read only first, so they cannot mess with the partitions.

QUOTE(Katajanmarja @ Mar 10 2025, 02:29) *

Actually, I prefer to build complex directory trees, keeping the number of files or sub-directories per one directory below 50 (true, with quite a few notable exceptions). I am doing this for the sake of manually locating my files more easily. Well, at least those ones I’ve been able to sort out and rename properly. Are you saying this habit is actually making copying operations faster, even though it increases the number of directories (and thus units to copy) a lot?


Just my experience, of course, I've seen the filesystem doesn't really has risks even if a folder has thousands files. Your experience might be slower, because of the operations of listing a high number of elements.
The hardware itself plays a role: a slow disk or a damaged disk really slows down every process.

QUOTE(Katajanmarja @ Mar 10 2025, 02:29) *

Another weird feeling I have is that when I copy from ext4 to FAT32, the more units I copy with one command the more time it takes per unit, regardless of how the directories are split. Not to mention that copying a large mass at once is far more likely to cause system crashes or other complications. It’s good if weird, long filenames are not a significant factor here either.


In this case I'd look into two things:
fat32 is sequential, it means the process has to reserve space for a single file based on it's size (that's not the case with ext partitions, that uses inode systems), it might also limit the number of parallel writing to avoid fragmenting the files (implementation dependent).

A good hing can be checking SMART status of the destination disk, especially if that's an HDD and if that's a bit old.
If you suddenly experience lag spikes there is a fat chance your disk is starting to present bad blocks and small damages.
In that case IMMEDIATELY STOP USING the disk, and find a replacement ASAP, possibly backup any irreplaceable data.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

 
post Mar 11 2025, 22:07
Post #8
Katajanmarja



Regular Poster
*****
Group: Gold Star Club
Posts: 659
Joined: 9-November 13
Level 386 (Godslayer)


Thank you all for your input, I am impressed to see people spending time on my woes like this.

I do not understand each and every point you have made above, but let’s say I do understood a good percentage and at least a chunk of the rest I find worth looking into. If you have more comments, keep them coming, even if they are not of the simplest sort.

I’m embarrassed to admit that I probably confused NTFS with FAT32 above, because I wrote the opening on a tired moment. I’m pretty sure I’ve made use of both in the past, but it is highly likely that my normal external devices (except for the one I’ve formatted as exFAT) actually have NTFS today. My apologies. In any case, I’m happy to have read your comments regarding FAT32. It is a crucial bit of info to me that FAT32 can’t handle Unicode at all. (I’m writing this away from home and cannot check my external drives right now.)

As for mmv, I’m pretty sure I’ve used it on occasion. However, after getting used to XFCE, I use Thunar’s bulk renamer all the time, so there’s less incentive to use command line tools for renaming files.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post


Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 


Lo-Fi Version Time is now: 25th April 2025 - 17:58