• 6

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191


File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

I want to manage a huge number of files on my server (say millions). It is needed to save files in two or three levels of folders to keep the number of files in each folder low. On the other hand, it is not good to have many folders to spend inodes.

How much is the optimum ratio of files per folder? Is there a theoretical approach to determine this, or it depends on the server specifications?

Server specifications are likely to be less of an issue than the file system you are using. Different file systems have different approaches to storing directory data. This will impact the scanning speed at various sizes.

Another important consideration is the lifecycle of the files. If you have frequent addition and deletion of files you may want the leaf directories to be smaller than they might otherwise might be.

You may want to look at the cache directory structures used by the Apache web server and Squid proxy. These are are well tested caches which handle relatively high rates of change, and scale well.

EDIT: The answer to your question depends significantly on the life-cycle and access patterns of the files. These factors will significantly influence the disk I/O and buffer memory requirements. Number of files is likely to be a less significant factor.

Besides file system chosen, memory, disk interfaces, number of disks, and raid setup will all impact disk access performance. Performance needs to be sufficient to requirements with some leeway.

Disk setup tends to be more important as writes and deletes increase. It can also be more important as access to files becomes more random. These factors tend to increase the requirement for disk throughput.

Increasing memory generally makes it more likely that files are accessed from disk buffers than disk. This will increase file access performance for most systems. Access to many large files may result in poorer disk caching.

For most systems I have worked with, the likelihood a file will be accessed is related to when it was last accessed. The more recently a file was accessed the more likely it will be accessed again. Hashing algorithms tend to be important in optimizing retrieval in these cases. If file access is truly random, this is less significant.

The disk I/O required to delete a file may be significantly higher than adding a file. Many systems have significant problems deleting large numbers of files from large directories. The higher rate of file additions and deletions, the more significant this becomes. File lifecycle is an important factor when considering these factors.

Backups are another issue and may need to be scheduled so they don't cause disk buffering issues. Newer systems allow IO to be niced so backups and other maintenance programs have less impact on the application.

  • 5
Reply Report

Trending Tags