1. OCAU Merchandise is available! Check out our 20th Anniversary Mugs, Classic Logo Shirts and much more! Discussion in this thread.
    Dismiss Notice

file system to handle photos (millions of photos)

Discussion in 'Business & Enterprise Computing' started by lavi, Aug 8, 2008.

  1. lavi

    lavi Member

    Joined:
    Dec 20, 2002
    Messages:
    4,008
    Location:
    Brisbane
    say you got several TB's of images (no pr0n i swear) and about 1mil files that are about 2kilobytes each what would be a decent fs to use?

    to give you an example ... i have a directory with quite a few files in it stored on ntfs, size on disk is 400MB, if i zip it with no compression it's under 3meg

    the files need to be accessed from windows essentially (major requirement) altough thre will be a web interface for it

    anybody went throgh something like this? i've been playing with cluster size on raid arrays but makes fsk all diff

    also backing it up takes an amazing ammount of time, mainly because of small size and number of them....
     
  2. looktall

    looktall Working Class Doughnut

    Joined:
    Sep 17, 2001
    Messages:
    27,463
    not sure about the file system question, but for backing up couldn't you run a script to zip your files up somewhere and then just backup that single file?
     
  3. OP
    OP
    lavi

    lavi Member

    Joined:
    Dec 20, 2002
    Messages:
    4,008
    Location:
    Brisbane
    tried that takes longer, basically processing that many little files kills performance completely
     
  4. NSanity

    NSanity Member

    Joined:
    Mar 11, 2002
    Messages:
    18,856
    Location:
    Frankfurt, Germany
    It will kill restore speed, but backing up via NDMP or other block level method would be the way to go imho.

    As far as the filesystem goes, no idea. Does ZFS stack up with a CIFS/SMB share?
     
  5. platinum

    platinum Member

    Joined:
    Mar 5, 2003
    Messages:
    2,038
    Location:
    Adelaide
    reiser deals well with small files
    and kills your wife!
    but, xfs might be worth looking into? Assuming you're thinking of using a linux type partition.
     
  6. stalin

    stalin (Taking a Break)

    Joined:
    Jun 26, 2001
    Messages:
    4,581
    Location:
    On the move
    I was going to suggest reiserfs v3.. until I kept reading the post and saw he needed Winduz support. also XFS is more suited to large files, not small.

    Why not mount it in a container on the file system (eg TrueCrypt) NTFS file system, with a TC container, with NTFS format inside. Don't backup the actual files, but backup the TC container. Your standard access times wont be improved (and CPU will increase due to the encryption (though you can do weak encryption to decrease it)) but backup performance of the container file will be extremely fast....

    You should be able to automount the TC volume, then script your web service to load AFTER its mounted, so its not pointing at files that don't exist... or just deal with the ;file not found 404 errors for a couple mins after reboot.
     
  7. checkers

    checkers Member

    Joined:
    Jan 22, 2007
    Messages:
    613
    Location:
    Perth
    To partially reduce space taken per file, reduce the cluster size of the dist to something like 256 bytes. NTFS has both a fixed and variable overhead component for each file though, so this will only save you so much space.

    As for the filesystem, since you're on Windows you are probably stuck. Storing inside a database might help, might not.
     
  8. shift

    shift Member

    Joined:
    Jul 28, 2001
    Messages:
    2,943
    Location:
    Hillcrest, Logan
    ZFS should handle small files well.

    lf your access is through a web frontend, then running FreeBSD/Solaris on the server shouldn't be an issue.
    Backups can be done at filesystem level, even incremental ones (via snapshot delta dumps).
     
  9. elvis

    elvis OCAU's most famous and arrogant know-it-all

    Joined:
    Jun 27, 2001
    Messages:
    47,968
    Location:
    Brisbane
    tar was built for precisely that sort of thing some decades ago.

    The only other thing I could think of to assist you would be to store the images as binary blobs in a database. Look at sites like www.smugmug.com that host millions of images, and that's how they do it (which they discuss in their attached blogs). Backups can be done via database dump tools, however you run the risk of losing a lot of data should one part of the database file become corrupt (which is why we back things up, of course).

    The other upside to the DB option is the trivial nature of adding tags and meta data, as is replication. The downside is making it trivially available as a file system (Windows/SMB export, etc), and similarly the extra administration overhead needed.

    No real solutions there, but maybe some ideas to throw about.
     
  10. OP
    OP
    lavi

    lavi Member

    Joined:
    Dec 20, 2002
    Messages:
    4,008
    Location:
    Brisbane
    the DB path is in the back of my mind ...... but once it hits about 20TB backing it up will be a bit of a pain, what i can do is probably do it by date, everything older then 1 year dump somewhere else until needed, like rewrite our software to use multiple databases and keep old databases (read over 1 year old) offine as once it's in a db it will not change

    thanks guys.....never thought it would be this dificult to archive this dumb photos while keeping them online
     
  11. checkers

    checkers Member

    Joined:
    Jan 22, 2007
    Messages:
    613
    Location:
    Perth
    Just another note, if you do go down the database route, don't reinvent the wheel - you are essentially doing what large websites already do. So get ready to learn all about database sharding :)
     
  12. hapkido

    hapkido Member

    Joined:
    Sep 20, 2003
    Messages:
    291
    Location:
    Brisbane
    Looking at the Storage and backup side of this.

    I just put in a NetApp storage array (take a look at this stuff sometime :)) for a 'scientific' customer that had major problems with managing lots of small files, as in 32million files in a single volume, with several volumes (about 8TB worth of data in total). File sizes as small as 3-to-200k. This data is accessed via NFS. Anyways we are running de-dupe this weekend and I will know the results this weekend.

    NetApp of course offers both SAN and NAS connectivity options from within the same box - which I'm guessing you already know ;)

    As NSanity pointed out - have a look at NDMP.
    One of the major problems this customer had was backing up this scientific data. It was taking them many days for a full backup - millions of small files over a congested network, Phew.

    So for this customer, we are connecting an existing Autoloader direct to the array and will be using NDMP. NDMP is just a raw block level data dump - its fast and when connected direct to the storage device is typically faster than even SAN based backups. SAN or NAS, does not matter. Although NDMP is not limited to direct attached methods (it can run over a network). You do need an NDMP agent with your backup system, some backup systems allow the option to also create a file index while still doing the block level backup, its a little slower of course, but this may be required if you want to restore single files, as without this facility, you would need to restore the whole volume - bit of pain for single file. NDMP is not in any way restricted to NetApp storage - but NetApp does have the advantage of non-impacting, high retention period snapshots - therefore managing most instances of whole volumes or single file restores (almost immediately) before needing to restore from tape/backup device.

    I also know that many of the higher tier backup applications also have a snapshot/block level image agent to handle this same issue.

    This customer also has some new equipment that will be generating many TIFF files as well, so we will see how de-dupe works on that data.



    Regards,
    Hapkido
     
    Last edited: Aug 9, 2008
  13. Jase

    Jase Member

    Joined:
    Jun 28, 2001
    Messages:
    196
    Location:
    Sydney 2081
  14. Jase

    Jase Member

    Joined:
    Jun 28, 2001
    Messages:
    196
    Location:
    Sydney 2081
    when you say you played with the cluster size, did you mean the NTFS cluster size? that would have been my first place to look

    as for backups, is the CIFS server a NAS appliance or a native wintel box? is the backend storage on a SAN ? what is your backup software?

    also, do you ever need to edit or delete these files? or are they WORM ?
     
  15. username_taken

    username_taken Member

    Joined:
    Oct 19, 2004
    Messages:
    1,352
    Location:
    Austin, TX
    I did a similar thing for short videos about 8 years ago. ended up settling with a 2 level hashed directory structure to store the videos in with a mysql database containing stats and data about the videos.

    Hashed directory structures are great at storing lots of small files. basically you have a directory for each char of the alphanumeric, and each of those have a subdirectory for each char of the alphanumeric, and so on for as many levels as you need.

    file ABFTPD.jpg would be located in /A/B/ABFTPD.jpg
    file TGCV4.jpg would be located in T/G/TGCV4.jpg

    postfix ( mail server ) uses a similar directory structure to emails in it's various mail queues. It's very easy to traverse manually or scriptually as the file name contains all the information needed to determine it's pathname. You can even span multiple volumes or servers pretty easily by splitting the directories A-G H-Z across them.
     
  16. MudBlood

    MudBlood Member

    Joined:
    May 17, 2004
    Messages:
    1,215
    Location:
    Mildura, Victoria
    if you have all your directory names set like:

    c:\blah\blah\photos_of_trip\day_1

    and not like:
    c:\blah\blah\photos of trip\day 1

    couldn't you just use xcopy to back them up to an external location? then only the newly modified files in the source location will be copied accross to the destination?

    even if you are using a web service to store them, grab sftpdrive and set that to map a drive to your computer and point to that with xcopy

    set it to run at a certain time of the day/night with the windoze scheduler?

    its what i do for pics/docs/etc that i need to have backups of...

    Muddy
     
  17. username_taken

    username_taken Member

    Joined:
    Oct 19, 2004
    Messages:
    1,352
    Location:
    Austin, TX
    Mate, I think there's a slight difference of scale.
     
  18. OP
    OP
    lavi

    lavi Member

    Joined:
    Dec 20, 2002
    Messages:
    4,008
    Location:
    Brisbane
    they are not home photos lol and yeah one thing to backup 500gig and other to back up 6-7 tb's
     
  19. bugayev

    bugayev Whammy!

    Joined:
    May 15, 2003
    Messages:
    4,092
    Location:
    Melbourne
    I'd seriously look at the Oracle CMSDK. I've used it in multi-terabyte content stores without any issue at all, and because you can connect in a variety of ways, multi-platform access can be a real cinch.

    You then get the advantages of using technologies like RMAN to perform differential backups or high-speed online backups even past the ten terabyte barrier.
     
  20. r3sist4nce

    r3sist4nce Member

    Joined:
    Oct 22, 2007
    Messages:
    25
    You will find xcopy will crash on data of that volume. xxcopy will work... but will require you to have a license to back up to a network drive.
     

Share This Page

Advertisement: