-
Website
http://www.flickerdown.com/ -
Original page
http://flickerdown.com/2009/08/micro-burst-metadata/ -
Subscribe
All Comments -
Community
-
Top Commenters
-
Jame
1 comment · 1 points
-
3parfarley
1 comment · 1 points
-
om_nick
2 comments · 1 points
-
storagenerve
1 comment · 5 points
-
ldevanna
1 comment · 1 points
-
-
Popular Threads
-
Why FCoE is a GOOD thing…
1 week ago · 2 comments
-
Merry Christmas!
1 week ago · 1 comment
-
Why FCoE is a GOOD thing…
In the end you can never hope to capture all the "metadata" in the object so you are going to be forced to deal with your option 2 anyway.
To me, it's all just linked-data anyway....
thanks... in the end, I'd definitely agree with "it's all linked data anyway" sentiment but the method of linkage has the important. if any of those links "break" you've effectively "lost" or "isolated" data that, depending on the importance to your business, could be critical. *shrug* Hopefully, I'll be able to dig a bit deeper into Atmos' method of metadata linkage a bit more.
cheers,
Dave
1) the domain hosting the link is gone
2) the network is down
3) the "linked-data" resource has been destroyed/deleted
4) the "linked-data" resource has been migrated/moved/archived
Most of these point to a more interesting question of ownership, not linkage. So if you assume you "own" all the data then you either have control of the above situations or you don't. So if you don't have control and you need to mitigate then the typical strategy is to "copy" the data locally. IMO, that has just as many downsides (perhaps more) than "lost" or "broken" links. Missing data is better than stale/bad/incorrect data that is out-of-sync with the "authoritative" data.
interesting note on ownership to which I'd say that there has to be dual ownership, one from the system level (with immutable meta such as creation date, etc.) as well as mutable data (e.g. user generated meta). The meta db then needs to maintain and track 2 different levels. Policy can affect either, fwiw.
dave
I would argue that "linked-data" requires polar opposite semantics where the user of the data needs to assume that data will be missing and inaccessible (read 404). The ownership point I was making before is part of this discussion; can I assume that I can make geographically distributed updates (or even just validations of garbage collection), do I have the rights? Even if I have the rights; can I tolerated the latency?
IMO, in this world we need to move to "BASE" semantics (Basically Available Softstate that is Eventually consistent). It is more like the way the web works today and isn't that the point of Cloud storage?
Having separate metadata (such as in a database) allows the metadata to be queried much easier than if the metadata is associated with the data. But as you point out, there are trade-offs all around... Perhaps a hybrid approach is best?
Definitely think that hybrid models work especially as you look to bridge between "classic" block 'n file environments (where metadata is 99% system generated) and object environments where the emphasis is on user-generated content NOT system meta. (Objects, by nature, redact the meaning of the underlying file and thus require descriptors).
dave
-- mark
thanks for the comments! I'm very well aware of the XAM standard as well as the (in process) SNIA Cloud Storage Standard (am a member of EMC's SNIA group; though silent) and frankly, I see great things being done there.
As i've said elsewhere (the first "micro-burst"), these articles are meant to be more Socratic than anything else as I'm really not trying to chose one particular position over the other. In cFS words (cloud file systems), the capability to expand beyond a limited NAS role has to be tied to forward looking object-based storage (imho, of course). Replication the metadata is obviously part of that process and in the Atmos world, we're doing synchronous metadata replication along with background consistency checking for "issues" that can arise by running separate and discrete object/meta repositories.
as always, keep up the constructive comments!
The challenges of the second model have been encountered for years in the Enterprise Content Management world. How do you keep your metadata in your database in synch with content on the SAN for backups when users access the system all the time. It is a headache, but not an unknown one.
There are systems that can do this though. If you just want a good scalable database to house the metadata in, look at xDB, formerly X-Hive. Conveniently enough, owned by EMC.
-Pie
Thanks a lot for the feedback! I agree that there's potentially greater flexibility in running meta separate from object and, while I'm not intimately familiar with ECM and it's requirements, I'm becoming more aware as time moves on. (thanks to the likes of Craig Randall and others in Documentum!) ;) I know we're doing some REALLY cool things with some of the other groups within EMC using Atmos and Atmos Online (that I can't discuss right now...stay tuned!) and I think that these two services (using a common REST API) are just the ticket for keeping meta and data functional, flexible, and powerful!
cheers,
Dave
Anyway to Metadata. Which is not all about protection schemes etc, most value from metadata comes from the ability to search. To find or not to find etc. It's the bit that interest me most anyway.
1. Wrapping it around the data implies proprietary and thus platform and vendor tied. For archive this won't do. Your data should outlive your vendor.
2. Metadata is a side-dish best served cold. OK if you have metadata in a DB loosely coupled to the data itself then there will be an inevitable split one day. The old adage of "If you can't find it you do not have it" rings true here. Lose the DB, lose the data. Potentially.
3. Metadata Objects. Going for a bit of metadata on the side? Then why not tie the metadata to the object so that they will always live on the same archive media (same node, tape etc). Build an in-memory DB on each node that looks after search for its own data. Cluster those nodes. Have those nodes do fancy self-healing/failover. Clustered distributed search.
4. FileSytem support. File Systems like XFS support metadata for files on the FS. Pretty sure ZFS offers the same. Data and Metadata tied together using the FS. Open(ish)
5. Hybrid. Go for a 3 + 4 cocktail. Use some open format (XML, Java Class) to store your MD as a file on the FS. Tie the file to the Content (GUIDS etc). Have that backed up by using an FS that supports MD natively.
Just my musings..
om_nick
definitely love the points you've made! One of the bigger issues I see out there are block/file storage companies who are trying to "trespass" and reinvent themselves into "cloud storage providers" without a thought (or maybe just not apparent yet) for how to accomplish this. The beauty of the Atmos product family is that we've tackled some of these item proposals (e.g. your point #3) by using a self-healing filesystem that runs consistency checks to ensure appropriate linkage between meta and db. From the ground up, we designed this for scalable, large/small object storage and put a heck of a lot of effort into making that data accessible. ;)
again, thanks for your comments!
dave
Glad you tackled Metadata as for us it is becoming as important and the essence itself.
Additionally, you're right in saying that this isn't just a cloud problem. I look at the previous work I've done with SAN-based products and we're so used to looking at data as 01010101 versus a more descriptive method of understanding. I think Tom Maguire made the point earlier that a movie file (for example) is nothing but a file name in a file system until you start describing it and providing characterization against it for your programs. (since I know you do stuff with Final Cut Pro). sorry my thoughts this morning are a little less collected than yesterday but I appreciate the dialogue!
cheers,
Dave
In the digital world ECM, records management and other applications store the most important metadata in a database and the file is stored in some folder. The context of that file in some folder is lost if that database is lost or disappears after a period of years. However, by storing more descriptive metadata along with the file allows that context to persist over time without worrying about losing the ability to easily determine the value of a file. It's what I like to call Content in Context. The question of how much metadata to store in an object becomes a question for the organization and its requirements for information management. There are a number of industry standards for metadata that can be used for specific types of content such as the NBII Biological Metadata Standard, the Content Standard for Geospatial Metadata and Dublin Core (there are many more). Perhaps only a subset of the metadata standard is required or maybe all of it. That's a level of flexibility information/records managers should have. A lot of content is going to be kept for decades and keeping that context alive is critical. The hybrid model makes a lot of sense, at this point in time and I expect new applications will emerge that take advantage of object stores allowing dynamic views/organization of content/files.
Derek Gascon
VP Marketing, Caringo, Inc.