![]() ![]() If you have a few 100 sensors and are sampling them every minute, storing data for a month, it would likely be fine. Of course, it can work fine despite the inefficiency. For example, you may insert 1000s of properties at once (most efficiently stored as interleaved data), yet typical access patterns involve obtaining a single property over many timestamps. The fundamental problem with time series data is often that the insert pattern is the exact opposite of typical retrieval patterns. Of course, it can work perfectly fine if the you don't have too much data. Relational DBs aren't great for timeseries data which is why solutions like TimescaleDB, InfluxDB and other time series databases exist. Naive pipelines often process whole files in RAM, so you end up choosing appropriate file layouts to align to the kind of sparse or sequential access needed by data producer and consumer programs. Sometimes, the right file layout is just as important as choosing chunking parameters within a format like HDF5. Microscopy projects might choose OME-TIFF for individual image stacks, or something even more mundane like a directory of individual 2D TIFF or PNG files, if that maps well to how their acquisition or processing pipeline works. to store a single N-dimensional numerical array in one file. one could do all of the above and still choose to use HDF5! But these individual files would be simpler, i.e. Of course, this still leaves the question of what file format(s) to use for the individual objects in the system. These could even be versioned to record well-defined "data releases" or provenance info related to individual processing tasks. But you can also export snapshots of metadata into companion objects at appropriate milestones in the processing workflow. You could benefit from using the catalog as an authoritative store of metadata that is still being curated/refined. The catalog should contain metadata needed to coordinate the workflows of production and consumption of objects. ![]() New objects are added and registered in the catalog for others to read, but you don't mutate them afterwards. The database stores a coherent catalog of which data objects are available. ![]() You avoid low-level codec failures from concurrent writes to shared files, but you may still have semantic inconsistencies when interpreting the evolving multi-object dataset.Ī hybrid approach would be to have some other data-management system with database-like properties to track the objects. But it does not provide coherent views of multi-file changes. The object store can ensure that concurrent users are only encountering a coherent snapshot of each metadata file, i.e. It is easier with a companion file strategy, where you can regenerate smaller metadata files alongside immutable bulk data files. rewriting an entirely new version of the object. If you expect metadata to need rounds of editing or curation, you don't want it embedded in bulk objects, where mutation is expensive, i.e. However, object-storage still has coherence problems. You write entire files and expose them to read-only consumers once they are valid and complete, side-stepping concurrent writer/reader scenarios. Adoption of bject-storage semantics would help eliminate some of the corruption/concurrency hazards mentioned in Cyrille's post. If you can break it into distinct phases with individual actors, passive serialization formats can make more sense. How is data acquired and added to datasets? How is it consumed?Īn online, database-like product is most important if you need to coordinate concurrent activity by a number of data producers and consumers while maintaining a coherent view of the growing data. I think that depends on the use case and full data lifecycle. ![]()
0 Comments
Leave a Reply. |