The Big Wide World of Data Stores

Flat Files

Great for storing small amounts of data, where you don’t need to look stuff up all the time. You can just serialize each line into an array of objects and iterate through them. However you do end up running into a problem when trying dump a flat file multiple GBs into memory, since you end up getting limited by system ram. Although there are ways to get around this. Or you could just keep running it on a bigger and bigger machine.

The other caveat is that you will always be limited to a time complexity of O(N).


Fundamentally databases are just a bunch of files, that exist in part on disk and in ram. However differences between them and flat files, come from their ability to index data. Indexes are similar to the same type of index you would find at the back of textbook. Some one has saved you the time, of looking through the whole text book for you. And has created a map (index) of where everything is. Which is really useful when trying to search through your data set.

Databases do a whole lot more, however I mentioned the one of the key things they do above.

Relational Databases

These databases are one of the oldest and most widely used forms of data stores. Most if not all companies in existence, use them. What relational databases do is model complex data, as a set of tables, and the relationships between them. And then give you a very straight forward and powerful query language, to get information out of it.

For example let’s say you work for a company that runs warehouses. And each warehouse keeps a record of what goes out and what comes in. If you tried to store this in a flat file you would have to write a lot of custom code to manage this different relationships, and aggregate data about how much comes in and out of the warehouse.

Where as with a relational database, you would have a table for a warehouse and shipments. Where the shipments table would have the ID of the warehouse that it was sent from. That way we could run a query like the following:

Select count(id), shipmenttype from shipments group by (insert_warehouse_id_here)

And get the amount of shipments by warehouse id.

In relations databases all operations are transactions. You can think of transactions as units of work.

The most import feature of Relations Databases is that they are ACID compliant.

ACID stands for:

Atomicity – “All or nothing”, as in all the work in a transaction must succeed. If it does not succeed then all the changes it made are rolled back.

Consistency – Only valid data can exist in the database. Data in the database must follow the rules defined, if not changes will be rolled back.

Isolation – Transactions happen in order, meaning reads and writes to the DB will not effect each other. This does not mean two operations can’t happen at the same time. Multiple transactions can happen at the same time as long as they don’t effect each other.

Durability – Once your transaction succeeds you can rest easy knowing your data won’t disappear if the server dies.

Hadoop & Hbase (For BIGG data !)

When you have large amounts of data (Peta and Exa bytes of it), you run into a number of special problems that can’t easily be solved. Such as scaling out your data across different machines, replicating data, and running operations on it. Google solved this problem a while ago using the GFS (Google File System) that solves all these problems. And the open source incarnation of it is called “Hadoop”

Hadoop is not a database. Rather it a type of file system base on GFS (Google File System) that lends its self really well to performing computations across machines.

Why Store Procedures Don’t Scale

Disclaimer: When I say “Scale” I mean a couple of ideas:

  • Accommodating large increases in load, with less then half a couple of hours worth of work.
  • Growing your technology team up to hundreds, or thousands of application developers.
  • The cost of your hardware infrastructure & licensing, does on correlate linearly with how much your product gets used. Rather the relationship should be Log(n) as in it should tapper out.

Also this post deals with propriety database management systems. Specifically I mention SQL Server, however the same can be said about Oracle, and etc.

Now with that out of the way, lets get going.

Store Procedures don’t scale (in terms of people, cost, and infrastructure)

And here’s why:

Points “1” and “2” won’t really matter, if you work for an industry that has more money than it knows what to do with, and or has no competition, like government agencies, hospitals, big telecom, and etc.

1) The supply of people who know TSQL, and know it well isn’t as large as, the number of people who know modern object-oriented programming languages like Java, Python, etc. If you start dumping business logic into TSQL that sits in a propriety SQL Server Database, you are going to have to start hiring DBAs who specialize in SQL Server which is expensive

2) If you don’t hire people who specialize in TSQL and SQL Server tuning, you’re going to have to increase size of the hardware that the SQL Server runs on. However, since SQL server is proprietary software, its licensing is tied to the amount of hardware it runs on. So, if you want to run a bunch of poorly optimized TSQL on bigger infrastructure, you must pay more for additional hardware, and licensing (since its per core).

holly moly that’s expensive [1]

3) Let’s say you’re okay with both points “1” and “2”. There is going to be a hard limit of 128 cores and a terabyte of ram that you can run a SQL server instance on. At that point if the TSQL is not optimized, and your data set is big enough you’re still not going to have lighting fast query times. Which really means that dumping core business logic into TSQL for performance reasons, was just covering up an underlying problem of bad schema design.

4) When you use TSQL store procedures, everything is inside SQL Server. Now there are some problems with that:

4 – A) Your business logic does not live in GIT, its changes are not tracked, and there is no autonomous log of who changed what, when. You may say “My store procs do live in GIT, inside a DB project”, that may be true. However that does not guarantee, what’s in source control is any one of the SQL server instances you may have. Since someone could always login to SQL Server and change a store proc, without anyone knowing. Now to stop this from happening, you could have an experienced DBA that acts as the guardian of SQL Servers, or you could buy one of Red Gates Products for seamless change deployments, however that again is expensive.

4 – B) Now if you have multiple “Staging”, “Prod”, and “Dev” SQL server instances. You can never be 100% sure that all the Store Procs are the same across the environments, or reliably determine how out of date a store proc is.

4 – C) Using TSQL a lot leads to abusing TSQL (unless you have a good DBAs), with no checks or balances in your data tier. Like cross DB actions, synonyms, sending out emails, doing large ETL jobs, creating functions, etc. At this point the database stops being becoming you Data Tier, and is pretty much you application tier. Except its a lot more expensive to maintain and run (SQL server licensing and DBAs), then a web app that does the same sort of stuff.

[1] – https://redmondmag.com/articles/2019/11/08/sql-server-2019-licensing.aspx

Key Performance Counters for Windows Server using Performance Monitor

Every windows server ships with a lot of free tools, to help monitor and better understand what’s happening on the system. One of the most use full tools is called “Performance Monitor”, don’t let the outdated UI fool you. Performance Monitor is an extremely useful tool that allows you to log, collect, and visualize all your performance related data.

Before you go increase your VM size on your cloud provider, or switch to SSDs. Be sure to take a look at Performance Monitor, they could surprise you.

The data you collect can be broken down into four main groups:

  • CPU
    • Spawning too many threads
    • Doing too much work at one time
    • Background job that could be eating up all the CPU cycles
    • Too many Garbage collector calls
  • Memory
    • Creating too much garbage
    • Memory leaks
    • Trying to load too much data into memory at one time
    • Not streaming in data
  • Disk
    • Writing to disk all the time
    • Using too much ram at one time, so the OS starts to use Disk
    • Serving too much static content for your site (use a CDN)
  • Network
    • More traffic then your one machine has bandwidth for
    • Other applications on machine saturating connection

Any hiccup in any one of this areas of the machine could have a cascading effect on performance degradation, since each one could act as a bottle neck to the others.

On to the actual metrics you want to watch for, grouped by type:


% Processor Time:

  • What is it ?
    • Total time the processor was busy processing.
  • Why should I care ?
    • Gives a very general measure of how busy the processor can get.
  • How does it help me ?
    • If this counter is always very high, then you need to use some of the counters down below

% Privileged Time:

  • What is it ?
    • Total time the processor spent executing in kernel mode.
  • Why should I care ?
    • This measure takes into account only the kernel related operations that processor does, like memory management
  • How does it help me ?
    • If this counter is above 20% you have a driver, or hardware issue.

% User Time:

  • What is it ?
    • Total time the processor spent executing in any user application code.
  • Why should I care ?
    • Give you an idea of how much work your application code forces the processor to do.
  • How does it help me ?
    • If this percentage is too high it may be conflicting with privileged processor time. You always want to have some buffer, between “user time” and “privileged time” so the system can run smoothly.

Queue Length:

  • What is it ?
    • Number of threads waiting for a core to become available.
  • Why should I care ?
    • Give you an idea of how much work your machine is trying to do, at any given time.
  • How does it help me ?
    • Divide this number by the core count of machine. If the value is greater than 3, there is too much CPU pressure on the machine, and created a back long.

Process (*) \ Thread Count:

  • What is it ?
    • Amount of threads currently active in this process.
  • Why should I care ?
    • More threads means more CPU utilization.
  • How does it help me ?
    • Could be an indication that your application is spinning up too many threads at once.


Average Disk Queue Length

  • What is it ?
    • A simplified definition is how many disk operations (read & writes) were queued.
  • Why should I care ?
    • Gives you an idea of how saturated your disk is.
  • How does it help me ?
    • If the queue length is over 2 for a prolonged periods of time, then it could be an indication of a disk bottleneck.

% Disk Idle Time

  • What is it ?
    • How much time your disk spends doing nothing.
  • Why should I care ?
    • Gives you a picture of when the disk is free.
  • How does it help me ?
    • If the “Disk Idle Time” is low then it is doing a lot of work, which is okay. But only if the “Average Disk Queue Length” is below 2. Could give you an indication that the disk isn’t busy when it should be busy.

Avg Disk sec/Read & Avg Disk sec/Write

  • What is it ?
    • Latency of your disks.
  • Why should I care ?
    • Gives you an indication of how much lag there is to do anything with the disk.
  • How does it help me ?
    • Could indicate a hardware issue if the latency is too high.


Available Mega bytes

  • What is it ?
    • Amount of physical Memory available to processes.
  • Why should I care ?
    • Lets you know if your running out of memory.
  • How does it help me ?
    • Could indicate a memory leak if it continues to decrease

Pages / Sec

  • What is it ?
    • Rate at which pages are read directly from disk (slows down the whole system)
  • Why should I care ?
    • Occurs when there are page faults which cause system wide delays.
  • How does it help me ?
    • The higher this number gets, the more the system is running out of memory. The more it runs out of memory the more page faults will occur.

Pool Nonpaged Bytes

  • What is it ?
    • Area of memory for objects that cannot be written to disk
  • Why should I care ?
    • Eats up your available memory (ram)
  • How does it help me ?
    • If it becomes greater then 80% could lead to system halting, Nonpaged Pool Depletion Issue (Event Id 2019)

Pool Paged Bytes

  • What is it ?
    • Area of memory for objects that can be written to disk, when not being used.
  • Why should I care ?
    • The more object you have in this area the more expensive it will be to retrieve them.
  • How does it help me ?
    • The bigger this value gets the longer it takes to retrieve objects from memory.

Process (*) \ Private Bytes

  • What is it ?
    • The current size in bytes of the memory this process has allocated, that cannot be shared with other running processes.
  • Why should I care ?
    • Tells you how much memory your process takes up.
  • How does it help me ?
    • If this value gets consistently bigger over time, it could indicate the your application has a memory leak.


Output Queue Length

  • What is it ?
    • The length of the output packet queue.
  • Why should I care ?
    • The higher this number the large the backlog of queue packets. The longer it takes to send out data
  • How does it help me ?
    • Helps indicate a network bottle neck. If greater then one, the systems network is nearing capacity.

Resources used:

How to Containerize a Ruby on Rails 6 App


Ruby on Rails should give you super powers. However its not the easiest to deploy in the traditional way (deploying on a Linux virtual machine).

Docker to the Rescue ! If you already know what docker is then you can just copy and paste the docker file below. However if you don’t here it is in a nut shell:

What is Docker ?

Docker is a virtualization tool, that does not need a whole operating system for each application. It does not emulate an entire operating system. Instead it uses the concept of “Containers” to run your applications.

Containers act like a virtual world that your application lives in. The magic of docker comes from the “Docker Engine” its what translates the system calls,  from the docker container to the base operating system kernel.

Long story short you will never have to worry about dependency management ever again !!! Or have a case of  “It works on my machine”  ?

Now on to the docker file

RFROM ruby:2.7.2

RUN apt-get update && apt-get install -y postgresql-client

RUN curl https://deb.nodesource.com/setup_12.x | bash

RUN curl https://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add -

RUN echo "deb https://dl.yarnpkg.com/debian/ stable main" | tee /etc/apt/sources.list.d/yarn.list

RUN apt-get update && apt-get install -y nodejs yarn

RUN apt install imagemagick


RUN mkdir -p /home/node/application

WORKDIR /home/node/application

COPY Gemfile Gemfile.lock ./

RUN bundle install 

COPY package.json yarn.lock ./

COPY . .

RUN yarn install


RUN chmod -R 777 /home/node/application

RUN rake assets:precompile

# Start the application server
CMD ["rails", "server", "-p","8080","-b", ""]

What Are we doing ?

  1. First we are using the latest “Base Image” (set of stuff we are using as a base of our dependencies of our app) of ruby:2.7.2
  2. All the “RUN” commands install all the other dependencies. We going to be adding:
    1. postgresql-client
    2. Node
    3. Yarn
    4. imagemagick (included because I want to use “ActionText”)
  3. Now we set the Environment variable called RAILS_SERVE_STATIC_FILES, because we want our rails app to serve our JavaScript and CSS.
  4. After that we create a directory inside our container where our project files will live.
  5. We copy over our Gem file and install our dependencies.
  6. Then we do the same with our package.json file
  7. Now we copy over all our files, and then install all our front end dependencies.
  8. We allow our node process to get into our “/home/node/application” folder
  9. We expose a port that allows our app to communicate to the outside world with, Port 8080.
  10. We run the rake assets:precompile command so that we compile any SCSS to CSS and put them in the right folders.
  11. Finally we specify the first command that runs we start our container:
    1. “rails server” : starts our server up
    2. “-p 8080” : lets our app know which port to listen to
    3. “-b” : lets app know it should listen on all the IPs of the container

Hope this helps you deploy your ruby apps a lot easier ?

Should you switch to .Net Core 5 ?

.Net Core 3.1 is in LTS mode (ends support in Dec 2022) where as .Net Core 5 is coming to general availability in November. However, there will never be a LTS version of .Net Core 5. They have a new release life. Every odd release just gets a general availability release, rather than an LTS release.


Pros of .Net Core 5

  • Unification of runtimes (mono, …)
  • Performance gains
  • Language interoperability (swift, java, …)
  • Not a lot of breaking changes

Cons of .Net Core 5

  • No LTS release
  • Fundamental packages like Xunit, Swashbuckler, Serilog, and etc, have not transitioned over yet.
  • Still a lot of un knowns in terms of bugs, and stability.
  • No Default Base Image provided my Microsoft yet, everything remains experimental

Recommendation: I think everyone that can should switch to .Net 6 some time after it comes out. Which is in Nov 2021, this would gave you a whole year to migrate to 6, while 3.1 is still being supported.




How to make your program faster, regardless of programming language or hardware

Person A: Does this sound like a impossible task ?

Person B: No not really

Person A: Does it have limits ?

Person B: Yes it does, but you see dramatic difference in speed regardless.

Person A: This sounds like a scam. Is it a scam ? How much will this cost ?

Person B: No its not a scam, and its free.

Person A: So what is it ?

Person B: You just have to master runtime complexity ?

Person A: What !? …. ?

Person B: Yeah I know, I thought that too. But it works ?

Person A: Yeahhhh well I never really understood that sort of stuff. I just implement businesses logic for living. I don’t know much about computer science. And I am not gonna waste my time on this. If I need speed, I’ll just deploy it on a bigger server ?

Person A: Come on … Its easy. You don’t have to understand the computer science. You just need to understand graphs, and recognize patterns.

Person B: What … really ?

Person A: Yeah you just got to use the graph below. Or just Google something like “Runtime Complexity Graph” . All you got to know that things in Red are the danger zone, things in Orange are the “meh” zone, and things in the Yellowish Green zone are okay.

Person B: Wait what does this have to do with programming, and what about those function things ?

Person A: Oh yeah right. So those function things represent how your program can run. The “n” represents the input to your function, like an array objects, or numbers. In the danger zone, if you add just one more element, your time to completion more then doubles. Whereas in the Yellowish Green zone adding another element doesn’t do much of anything.

Person B: ? This still doesn’t help me.

Person A: Okay okay okay, how about I make cheat sheet for you ? Your little guide to spotting when your in the danger zone ?

Person B: Show me.

Person A:

Type FunctionDescription
Constant Time1No matter how many elements/ inputs you give your function. Its runtime will always stay the same.
Logarithmic TimeLog(n)When doubling the number of input/elements into your function does not double the runtime. Also this is the runtime of most search algorithms.
Linear TimenWhen doubling the number of inputs/elements doubles your runtime. This is a for loop spanning from zero to the end of the input.
n + mTwo for loops one after the other, going over two different collections.
Quasi-linear Timen *Log(n)A worse version of Log(n). This is the runtime of most sorting algorithms.
Quadratic Timen2Every element in an array is compared with every other element in the input. This is the classic double “for loop” over a single array. For every nested “for loop” you add one more to the exponent. So if you had 5 nested for loops, you would have n5 .
n * mTwo nested for loops, but going over two different collections.
Exponential Time 2nA single extra input doubles runtime. You never want this, ever.
[jetpack_subscription_form show_only_email_and_button=”true” custom_background_button_color=”undefined” custom_text_button_color=”undefined” submit_button_text=”Subscribe” submit_button_classes=”undefined” show_subscribers_total=”false” ]

WTF is Async & Await in C# ?

Simply put they allow you to easily write asynchronous programs. Without you ever having to reorganize your code. Which can lead to massive performances increases.

The “async” & “Await” markers are keywords that mark the beginning and end of asynchronous code. Where “async” is put right before a function name, and “await” is put right before calling the function. However if a method is async then it needs to return a Task object.

Now you can use different parts of Task Asynchronous Programming (TAP) model. Such as start a bunch of tasks, and wait for them to finish. Or even call a new task on the completion of another task. All while your main application is running.

How is this possible ? Does it start a bunch of new threads ? Yes and No. If you start a bunch of tasks and wait for them to complete then yes. Where as if you await a heavy task it cuts up everything happening in our program the second it hits an await keyword. And starts executing everything based on the time available on the current thread. So you aren’t able to tell that your programming is waiting around for something.

Questions People have Asked Me – Part 2

What is the root object in the base class library ?

For Java and C# that would be “Object”.

What methods does “Object” have ?

For C#:

    • Equals – Supports comparisons between two objects
    • Finalize – Used to perform cleanup operations on un-managed resources being used by the object. Before the object is destroyed.
    •  GetHashCode – Generates a number based on the value of an object. Used to supported hash tables.
    • ToString – Create a human readable piece of text that describes the object.

Is “String” mutable ?

For C# & Java: Strings are always IMMUTABLE

What is Boxing and Un Boxing ?

For C#:

Process of converting a value type to the type of object or any interface type implemented by this value type. Like storing int in a object which would be “Boxing” (implicit aka do it with out thinking about it). And then taking that object, and “Un-Boxing” it explicitly. Example of this would be something like “int i = (int)x” where x is type of object. Why would you ever want to do this ? Well that’s cause value types get stored in the stack, whereas reference types get stored in the heap. So if your running into performance problems by having a lot of value types floating around in the stack. You can just dump them into the heap, by boxing them.


WTF is S.O.L.I.D – WTF is D ?

D = Dependency Inversion Principle

This principle is about making sure you never have to rewrite your core logic. Meaning that if your class or piece of code has a dependency on something else. It should never access it directly. Instead it should go through some intermediary that abstracts the functionality away.

For example if your application talked to a database, you would’t want to be writing SQL statements directly into your code. Or if you were using a ORM (Object Relational Mapper) you would want queries every where. Especially if at some point you decide to move to another database type or ORM. To fix this problem you would need to create a wrapper around it, abstracting complex queries into simple common method calls. Like “Update User profile”, or “Set User Password”.

This way if you ever had to make any changes to the logic of how you accessed the database. You could do it with out changing any of your core application logic. Since your core application wouldn’t directly rely on how the method is implemented. This can also be thought of as always coding to a interface, rather then a direct implementation.

WTF is S.O.L.I.D – WTF is I ?

I = Interface Segregation Principle

This is all about separating your interfaces and making them smaller. Since we don’t want to end up having mega interfaces with a tons of properties, and methods, that may or may not be used by classes that implement them. If you don’t follow this principle you are probably gonna end up with a hair ball of OOP design. That will lead to harder refactors further down the line. For example lets say you have a “Entity” interface, that has properties “attackDamage”, “health”, and also has methods “move”, “attack”, and “takeDamage”.  Now lets say classes “Player”, “Enemy” , “Chair”, and “MagicChest” implement it. Does this make sense ? Should the “Chair” class need to implement the “attack” method ? Most likely no it should not, but then it does need the “name” property. So we can factor out the common piece among the classes that implement “Entity”. So instead of just having the “Entity” interface. We can have a “BaseEntity”, “ActorEntity” and “StaticObject” interface. This way we won’t have any unnecessary implementations for any of the classes that implement the interfaces.