Great for storing small amounts of data, where you don’t need to look stuff up all the time. You can just serialize each line into an array of objects and iterate through them. However you do end up running into a problem when trying dump a flat file multiple GBs into memory, since you end up getting limited by system ram. Although there are ways to get around this. Or you could just keep running it on a bigger and bigger machine.
The other caveat is that you will always be limited to a time complexity of O(N).
Fundamentally databases are just a bunch of files, that exist in part on disk and in ram. However differences between them and flat files, come from their ability to index data. Indexes are similar to the same type of index you would find at the back of textbook. Some one has saved you the time, of looking through the whole text book for you. And has created a map (index) of where everything is. Which is really useful when trying to search through your data set.
Databases do a whole lot more, however I mentioned the one of the key things they do above.
These databases are one of the oldest and most widely used forms of data stores. Most if not all companies in existence, use them. What relational databases do is model complex data, as a set of tables, and the relationships between them. And then give you a very straight forward and powerful query language, to get information out of it.
For example let’s say you work for a company that runs warehouses. And each warehouse keeps a record of what goes out and what comes in. If you tried to store this in a flat file you would have to write a lot of custom code to manage this different relationships, and aggregate data about how much comes in and out of the warehouse.
Where as with a relational database, you would have a table for a warehouse and shipments. Where the shipments table would have the ID of the warehouse that it was sent from. That way we could run a query like the following:
Select count(id), shipmenttype from shipments group by (insert_warehouse_id_here)
And get the amount of shipments by warehouse id.
In relations databases all operations are transactions. You can think of transactions as units of work.
The most import feature of Relations Databases is that they are ACID compliant.
ACID stands for:
Atomicity – “All or nothing”, as in all the work in a transaction must succeed. If it does not succeed then all the changes it made are rolled back.
Consistency – Only valid data can exist in the database. Data in the database must follow the rules defined, if not changes will be rolled back.
Isolation – Transactions happen in order, meaning reads and writes to the DB will not effect each other. This does not mean two operations can’t happen at the same time. Multiple transactions can happen at the same time as long as they don’t effect each other.
Durability – Once your transaction succeeds you can rest easy knowing your data won’t disappear if the server dies.
Hadoop & Hbase (For BIGG data !)
When you have large amounts of data (Peta and Exa bytes of it), you run into a number of special problems that can’t easily be solved. Such as scaling out your data across different machines, replicating data, and running operations on it. Google solved this problem a while ago using the GFS (Google File System) that solves all these problems. And the open source incarnation of it is called “Hadoop”
Hadoop is not a database. Rather it a type of file system base on GFS (Google File System) that lends its self really well to performing computations across machines.