Aaron Adel – Software Engineer

Tecnical Tips & Tricks

Googose 1.0.0 Released

Announcements

I’m pleased to announce the official release of googose 1.0.0. This is the minimum viable product release.

 

Googoose is a javascript/jQuery plugin that allows you to easily convert any html content into a Microsoft word document. It’s lightweight, fast, and easy to use.

 

More to be developed in the future. The github page can be viewed here.

Performance Tip – Memoization

What is Memoization

Memoization is a caching technique. It’s generally used to speed up program execution by storing the results of function calls for each given input, and checking whether or not that function’s input has already been computed. If it has, the previous value will be used straight away instead of being re-computed.

What does this mean for me?

All of the major languages have an implementation of memoization. So, whatever you’re writing in, memoization can be beneficial for you to know.

Once you understand how memoization works, it will start to become obvious what should and should not be memoized.

How it Works

Generally speaking, the best memoization implementations should be seamless to the application writer.

Perl’s Memoize module does a fantastic job of this. You write and call functions as you normally would. You simply declare one time that a function should be memoized, and it is. You can even do this inside a loop, iterating over a list of function names.

This module replaces the function name in the symbol table with the memoized version, which first checks a hash for a key with the arguments given as a single string. If the key exists in the hash, the value is returned. Otherwise, the original function is called, and the key value pair is stored for later use.

I’ve actually attempted to implement this same idea in C with an open source library called libBlondie.

There is another C memoization library that I’m aware of, however I believe this one and that one are the only two at this time. The benefit of libBlondie is that it follows the same protocol. You essentially declare and call functions the same way that you normally would without memoization. There is the C preprocessor macro that you wrap around your function code which wraps the necessary code for creating a hash and checking the hash. It’s really very smooth, in my opinion, and I encourage you to check it out. However at this time it’s certainly not production-ready. Particularly it doesn’t seem to handle all data types. I’ve tried a few different things, and I intend to come back to it eventually. I think the way to go will ultimately be to keep both the key and value as a string in the hash. Both will be Json strings. When forming the key to check it will have to encode to the Json format, and when forming the value to use it will have to decode from the Json format into the native C code.

Anyway my point is that any language that you want to use should have memoization. That’s great news for the developer because memoization is a very handy technique to know. Below I’ll hopefully demonstrate that.

Test Results

Let’s look at the classic example memoization use case, Fibonacci, using libBlondie.

Here is the C code to recursively calculate Fibonacci numbers.

int Fibonacci(int n) {
   if ( n == 0 )
      return 0;
   else if ( n == 1 )
      return 1;
   else
      return ( Fibonacci(n-1) + Fibonacci(n-2) );
} 

Here is that same code re-written to be memoized.

BlondieMemoize( 
    int, Fibonacci, $P(int n), $V(n), {
    if ( n == 0 )
          return 0;
       else if ( n == 1 )
          return 1;
       else
          return (Fibonacci(n-1)+Fibonacci(n-2));
    }
)

As you can see there is very little difference. However the standard version takes over a second-and-a-half to calculate the 38th Fibonacci number whereas the memoized version takes just under three hundredths of a second to do the same. And in fact, the higher than Fibonacci number you want to calculate, the longer than the standard version takes. The memoized version shows almost no difference.

Why is that? It’s because the standard version has to recalculate the Fibonacci number of n-1 and n-2 recursively each time, whereas the memoized version only has to calculate each Fibonacci number once.

Putting it All Together

Don’t go running off thinking that every function should be memoized. There is a time and place because the act of creating a cash and checking the hash for each function call does take a bit of time. So if you have a function that is only uniquely being called once, generally speaking, it may not be faster it when memoized. However if you have a function that’s being called with the same input multiple times and that function does a system call or does something else that generally takes a lot of time, maybe it is computationally intensive, memoization should definitely be considered.

Perfomance Tip – Summary Tables

Motivation

I see a lot of people looking to increase the loading speed of certain pages. Although there are many techniques that can be used for this purpose, this article will discuss building summary tables. This is a common technique employed with big data.

Definition/Purpose

A summary table is a table with the calculations of typical, or resource intensive queries. Imagine, for example you have a page on a web site that produces aggregate reports. You can either reproduce the same aggregation queries for all users over all rows, or you can do that work ahead of time, maybe breaking it out into chunked time periods. The summary table will, by definition, have much less data in it, and be much quicker to scan.

Approach

In a linux environment it is quite simple to write a script to query the fact table, and insert into the summary table, then set that script in cron. A notes= here:

If you’re resummarizing some of the data due to requirements, you should make sure to do your delete/truncate in the same transaction as your insert. I recommend this method if possible as opposed to an insert because, all else being equal, one or two sql statements will be much faster than many updates.

Implementation

For this example I’ll be using a sqlite database, which contains stock market data for the nasdaq and nyse market markets for 15 years.

echo ".schema stock_hist" | sqlite3 dbs/stocks.db
CREATE TABLE stock_hist ( DATE date,  SYMBOL text,  PCT_DIFF float,  OPEN float,  CLOSE float,  VOLUME bigint );
CREATE INDEX IDX_date_symbol_01 ON stock_hist ( date, symbol );
CREATE INDEX IDX_pct_diff_open_01 ON stock_hist ( pct_diff, open );

 

echo "select * from stock_hist where date = '2010-01-29' order by symbol desc limit 5;" | sqlite3 dbs/stocks.db
2010-01-29|USL|-1.33368891257995|37.490002|36.990002|30600
2010-01-29|USATP|0.0|9.0|9.0|0
2010-01-29|URG|1.26582278481013|0.79|0.8|119700
2010-01-29|UQM|-2.05338809034909|4.87|4.77|251700
2010-01-29|UPRO|-4.53726461789673|140.16996|133.810078|17454000

Here you can see what the fact table looks like.

Imagine I wanted to create a dashboard that queried this table to find the biggest average losers since 2014, 2015, etc. With about 10,000 symbols per day, this query may take a little while to run. Imagine, though, if that data were already available, and date were indexed, along with the numerical average percent difference, etc. At that point finding the biggest losers would only be scanning x rows where x is the number of stocks to return. It should be quite fast, in theory. Let’s look at the actual performance.

Here’s a query that looks at the worst daily average from 2014 on.

time echo " select avg( pct_diff ), symbol from stock_hist where date >= '2014-01-01' group by 2 order by 1 limit 1;" | sqlite3 dbs/stocks.db
-7.62124711316397|AFI                                                                       
real    0m22.099s                             
user    0m9.900s                              
sys     0m12.200s

You can see this took over 20 seconds.

Similarly, the query from 2015 on.

time echo " select avg( pct_diff ), symbol from stock_hist where date >= '2015-01-01' group by 2 order by 1 limit 1;" | sqlite3 dbs/stocks.db                                         
-7.62124711316397|AFI                                                                       
real    0m18.928s                             
user    0m5.790s                              
sys     0m8.540s

You can turn this concept into a simple bash script.

echo " select avg2014, avg2015, avg2016,      s2014.symbol from ( select avg( pct_diff )    avg2014, symbol from stock_hist where date >= '2014-01-01' group by 2 ) s2014, ( select     avg( pct_diff ) avg2015, symbol from          stock_hist where date >= '2015-01-01' group   by 2 ) s2015, ( select avg( pct_diff )        avg2016, symbol from stock_hist where date >= '2016-01-01' group by 2 ) s2016 where s2014.  symbol = s2015.symbol and s2014.symbol =      s2016.symbol;" | sqlite3 dbs/stocks.db >      outfile    

echo " drop table if exists year_sum; " |     sqlite3 dbs/stocks.db       
                                  
echo "create table if not exists year_sum(    avg2014 float, avg2015 float, avg2016 float,  symbol text ); " | sqlite3 dbs/stocks.db

echo -e ".separator \"|\"\n.import outfile    year_sum" | sqlite3 dbs/stocks.db

Now you get the answer in less than a tenth of a second, plus the answers for 2015, and 2016.

time echo "select * from year_sum where symbol = 'AFI';" | sqlite3 dbs/stocks.db          
-7.62124711316397|-7.62124711316397|-7.62124711316397|AFI                                                                                 
real    0m0.054s                              
user    0m0.030s                              
sys     0m0.030s

Putting it All Together

This is a very simple, and slightly meaningless example, but it illustrates the point, which is that queries can often be precalculated at intervals, which allow rapid, and repeated querying in an efficient manner. This technique is very useful for long-running, resource intensive, or highly used queries.

Programatically Writing a Word Document Using Html

Intro

So, you want to generate a word document automatically? This is a surprisingly straight-forward process, that can be easily written in any language, not limited to the .NET framework. I will be reviewing the method of generating a Word Document using standard html.

Basics

A standard word document should have this shell html.

<html xmlns:o='urn:schemas-microsoft-com:office:office' xmlns:w='urn:schemas-microsoft-com:office:word' xmlns='http://www.w3.org/TR/REC-html40' >
    <head>
    <!--[if gte mso 9]>
        <xml>
        <w:WordDocument>
        <w:View>Print</w:View>
        <w:Zoom>75</w:Zoom>
        <w:DoNotOptimizeForBrowser/>
        </w:WordDocument>
        </xml>
    <![endif]-->
    <style>
    </style>
    </head>
    <body>
    </body>
</html>

Explanation

 The lines

        <w:View>Print</w:View>
        <w:Zoom>75</w:Zoom>

mean to open the word document in print mode, as opposed to html viewer, which would be the default otherwise, and to open it with a 75% zoom level.

Styling Your Word Document

Almost anything that’s possible in html is also possible when you’re styling your word document.

 

In the style tags in the head tag you’ll put css rules to style your word document. Below I’ll list some common uses. Although, most anything is possible.

Some Typical Style Rules

<style>

.MsoFooter, .MsoHeader {
    margin:0in;
    margin-bottom:.0001pt;
    mso-pagination:widow-orphan;
}

<!-- /* Style Definitions */
@page {
    size:8.5in 11.0in; 
    margin:1.0in 1.0in 1.0in 1.0in ;
    font-family:"Arial";
}
@page Section1 {
    mso-header-margin:.5in;
    mso-header:h1;
    mso-footer: f1; 
    mso-footer-margin:.5in;
}
div.Section1 {
    page:Section1;
}

#h1 {
    width: 100%;
    text-align: center;
}
#f1 {
    margin-right: 1in;
    position: absolute;
    width: 100%;
    text-align: right;
}
table#hrdftrtbl{
    margin:0in 0in 0in 9in;
}
-->
</style>

Explanation

The @page directive has a rule for the size of the page, and the margins of the page. These rules, specifically, state that each page should be a standard A4 page with 1 inch margins all around.

 

You can see that there are also some rules about the header and footer. The header is an image, and it is centered. The footer is a page number, and it is right aligned.

 

Additionally, there’s this interesting style

table#hrdftrtbl{
    margin:0in 0in 0in 9in;
}

Which is necessary because without it, the header and footer content will be displayed both in the header/footer, and as they would normally. This hides them by positioning them off the page on the right, so long as you wrap them in the hdrftrtbl table.

Putting it All Together

<html xmlns..='urn:schemas-microsoft-com:office:office' xmlns:w='urn:schemas-microsoft-com:office:word' xmlns='http://www.w3.org/TR/REC-html40' >

    <head>
        <!--[if gte mso 9]>
            <xml>
            <w:WordDocument>
            <w:View>Print</w:View>
            <w:Zoom>75</w:Zoom>
            <w:DoNotOptimizeForBrowser/>
            </w:WordDocument>
            </xml>
        <![endif]-->
        <style>

.MsoFooter, .MsoHeader {
    margin:0in;
    margin-bottom:.0001pt;
    mso-pagination:widow-orphan;
}

            <!-- /* Style Definitions */
        @page {
            size:8.5in 11.0in; 
            margin:1.0in 1.0in 1.0in 1.0in ;
            font-family:"Arial";
        }
        @page Section1
        {
            mso-header-margin:.5in;
            mso-header:h1;
            mso-footer: f1; 
            mso-footer-margin:.5in;
        }
        div.Section1
        {
            page:Section1;
        }

        #h1 {
            width: 100%;
            text-align: center;
        }
        #f1
        {
            margin-right: 1in;
            position: absolute;
            width: 100%;
            text-align: right;
        }
        table#hrdftrtbl{
            margin:0in 0in 0in 9in;
        }
        -->
        </style>
    </head>

    <body lang=EN-US style='tab-interval:.5in'>
        <div class=Section1>

            <br clear=all style='mso-special-character:line-break;page-break-before:always'>

            <!-- Table of Contents -->

            <p class=MsoToc1> 
            <!--[if supportFields]> 
                <span style='mso-element:field-begin'></span> 
                TOC \o "1-3" \u 
                <span style='mso-element:field-separator'></span> 
            <![endif]--> 
            <span style='mso-no-proof:yes'>Table of content - Please right-click and choose "Update fields".</span> 
            <!--[if supportFields]> 
                <span style='mso-element:field-end'></span> 
            <![endif]--> 
            </p>

            <br clear=all style='mso-special-character:line-break;page-break-before:always'>

            <h1>Other Commom Uses</h1>
            <ol>
                <li>
                    <h2>Table of Contents</h2>
                    <p>
                    Many people like to add a table of contents to their word documents. This is easiest to generate after the document is rendered in Word because it contains page numbers, which are calculated fields. However, doing this is very simple. Right click on the page above this and click either 'Update Field', or 'Edit Field'.
                    </p>
                </li>
                <li><h2>Page Breaks</h2>
                    <p>
                    You should write a page break like this:
                    &nbsp;&nbsp;

                    <br clear=all style='mso-special-character:line-break;page-break-before:always'>
                    </p>
                </li>
            </ol>

            <h1>The World is Yours!</h1>
            <p>The rest is basically self explanitory. Images can be embedded wuth the img tag. If you want to style something in a non-default way, feel free to do so using css.
            </p>
        </div>

        <table id='hrdftrtbl' border='1' cellspacing='0' cellpadding='0'>
            <tr>
                <td>
                    <div style='mso-element:header' id=h1>
                        <p class="MsoHeader">
                        <img  width="200" height="40"  src="img/header.jpg"/>
                        </p>
                    </div>
                </td>

                </td>
                <td>
                    <div style='mso-element:footer' id=f1>
                        <p class=MsoFooter><span class=SpellE><span
                                                       class=MsoPageNumber><span style='mso-no-proof:yes'>1</span></span><!--[if supportFields]><span
                                                       class=MsoPageNumber><span style='mso-element:field-end'></span></span><![endif]-->
                        </p>
                    </div>
                </td>
            </tr>
        </table>
    </body>
</html>

One other thing that should be mentioned is that you should name your file with a .doc file extension. If this file is being downloaded from a web browser, you should print the proper header information. For example, in php

header('Content-Type: application/msword');   
header("Content-disposition: filename=$filename.doc");

Goals

My intent is basically to add a few technical articles about topics that interest me occasionally. Topics will generally be related to:

  1. Performance, and techniques to enhance performance.
  2. Data Techniques – loading data, querying data effectively.
  3. Miscellaneous Tips that are obscure, or not well-known

Why WordPress?

I’m a full-stack software engineer. Why would I use wordpress?

I’m using wordpress for a few reasons:

  1. It’s much easier/requires less of my time
  2. Updating the theme frequently is one of my main goals.
  3. This site is mainly about content.

From there it was an easy decision, really, and I’m happy with it so far.