Motivation
I see a lot of people looking to increase the loading speed of certain pages. Although there are many techniques that can be used for this purpose, this article will discuss building summary tables. This is a common technique employed with big data.
Definition/Purpose
A summary table is a table with the calculations of typical, or resource intensive queries. Imagine, for example you have a page on a web site that produces aggregate reports. You can either reproduce the same aggregation queries for all users over all rows, or you can do that work ahead of time, maybe breaking it out into chunked time periods. The summary table will, by definition, have much less data in it, and be much quicker to scan.
Approach
In a linux environment it is quite simple to write a script to query the fact table, and insert into the summary table, then set that script in cron. A notes= here:
If you’re resummarizing some of the data due to requirements, you should make sure to do your delete/truncate in the same transaction as your insert. I recommend this method if possible as opposed to an insert because, all else being equal, one or two sql statements will be much faster than many updates.
Implementation
For this example I’ll be using a sqlite database, which contains stock market data for the nasdaq and nyse market markets for 15 years.
echo ".schema stock_hist" | sqlite3 dbs/stocks.db
CREATE TABLE stock_hist ( DATE date, SYMBOL text, PCT_DIFF float, OPEN float, CLOSE float, VOLUME bigint );
CREATE INDEX IDX_date_symbol_01 ON stock_hist ( date, symbol );
CREATE INDEX IDX_pct_diff_open_01 ON stock_hist ( pct_diff, open );
echo "select * from stock_hist where date = '2010-01-29' order by symbol desc limit 5;" | sqlite3 dbs/stocks.db
2010-01-29|USL|-1.33368891257995|37.490002|36.990002|30600
2010-01-29|USATP|0.0|9.0|9.0|0
2010-01-29|URG|1.26582278481013|0.79|0.8|119700
2010-01-29|UQM|-2.05338809034909|4.87|4.77|251700
2010-01-29|UPRO|-4.53726461789673|140.16996|133.810078|17454000
Here you can see what the fact table looks like.
Imagine I wanted to create a dashboard that queried this table to find the biggest average losers since 2014, 2015, etc. With about 10,000 symbols per day, this query may take a little while to run. Imagine, though, if that data were already available, and date were indexed, along with the numerical average percent difference, etc. At that point finding the biggest losers would only be scanning x rows where x is the number of stocks to return. It should be quite fast, in theory. Let’s look at the actual performance.
Here’s a query that looks at the worst daily average from 2014 on.
time echo " select avg( pct_diff ), symbol from stock_hist where date >= '2014-01-01' group by 2 order by 1 limit 1;" | sqlite3 dbs/stocks.db
-7.62124711316397|AFI
real 0m22.099s
user 0m9.900s
sys 0m12.200s
You can see this took over 20 seconds.
Similarly, the query from 2015 on.
time echo " select avg( pct_diff ), symbol from stock_hist where date >= '2015-01-01' group by 2 order by 1 limit 1;" | sqlite3 dbs/stocks.db
-7.62124711316397|AFI
real 0m18.928s
user 0m5.790s
sys 0m8.540s
You can turn this concept into a simple bash script.
echo " select avg2014, avg2015, avg2016, s2014.symbol from ( select avg( pct_diff ) avg2014, symbol from stock_hist where date >= '2014-01-01' group by 2 ) s2014, ( select avg( pct_diff ) avg2015, symbol from stock_hist where date >= '2015-01-01' group by 2 ) s2015, ( select avg( pct_diff ) avg2016, symbol from stock_hist where date >= '2016-01-01' group by 2 ) s2016 where s2014. symbol = s2015.symbol and s2014.symbol = s2016.symbol;" | sqlite3 dbs/stocks.db > outfile
echo " drop table if exists year_sum; " | sqlite3 dbs/stocks.db
echo "create table if not exists year_sum( avg2014 float, avg2015 float, avg2016 float, symbol text ); " | sqlite3 dbs/stocks.db
echo -e ".separator \"|\"\n.import outfile year_sum" | sqlite3 dbs/stocks.db
Now you get the answer in less than a tenth of a second, plus the answers for 2015, and 2016.
time echo "select * from year_sum where symbol = 'AFI';" | sqlite3 dbs/stocks.db
-7.62124711316397|-7.62124711316397|-7.62124711316397|AFI
real 0m0.054s
user 0m0.030s
sys 0m0.030s
Putting it All Together
This is a very simple, and slightly meaningless example, but it illustrates the point, which is that queries can often be precalculated at intervals, which allow rapid, and repeated querying in an efficient manner. This technique is very useful for long-running, resource intensive, or highly used queries.