Examples | Data and the Web

Data and the Web

Archive for the ‘examples' Category

A Wee Bit of Housekeeping…

Friday, July 17th, 2009

brooms2.pngWe haven't been doing much regular blogging lately, but we're hoping this will change in the coming weeks.

In the meantime, we've recently done some housekeeping on our website, so if you haven't visited recently we'd encourage you to do so. We've updated many pages with new content, but here are two sections in particular that we'd steer you toward:

  • Examples Section. This is a long overdue section that puts together some quick examples of how Kirix Strata™ can be applied to common data problems. The section is still a work in progress with more videos still to be produced. However, we expect what we have now will prove useful to new and old Strata users alike. Check it out.
  • Video Tutorials and Archive. We've done a bunch of different videos and screencasts over the past year or so, but they've been they've been posted all over our website. This new section wrangles all of the videos together in one place for posterity. The feature tutorials, in particular, are worth viewing as they help give a more comprehensive look at how to use specific features in Strata. Take a look.

So, in a nod to the Matrix, where one cannot be told what it is, but one must see for oneself, we've tried to make some high quality video documentation available. Stay tuned for more to come. Enjoy!

Fun (and Fraud Detection) with Benford's Law

Tuesday, July 22nd, 2008

Benford Law Graph - smallBenford's law is one of those things your high school math teacher would break out on a slow, rainy day when the students' attention span was even lower than usual.

He'd start out by asking the class to look at the leading digits in a list of numbers and then predict how many times each leading digit would appear first in the list. The students would make some guesses and eventually come to the consensus that the probability would be pretty close — about 11% each.

Then, the teacher would just sit back, smile, and gently shake his head at his simple-minded pupils. He would then go on to explain Benford's law, which would blow everyone's mind — at least through lunchtime.

Play Benford's Law Video

(Click the image above… or here's an embeddable YouTube version)

Per Wikipedia:

Benford's law, also called the first-digit law, states that in lists of numbers from many real-life sources of data, the leading digit is distributed in a specific, non-uniform way.

Specifically, in this way:

Leading Digit     Probability
      1              30.1%
      2              17.6%
      3              12.5%
      4               9.7%
      5               7.9%
      6               6.7%
      7               5.8%
      8               5.1%
      9               4.6%

Again, from Wikipedia:

This counter-intuitive result applies to a wide variety of figures, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature).

Boiling it down, this means that for almost any naturally-occurring data set, the number 1 will appear first about 30% of the time. And, by naturally occuring, this can mean check amounts or stock prices or website statistics. Non-naturally occurring data would be pre-assigned numbers like postal codes or UPC numbers.

Besides being fun to play with, Benford's is used in the accounting profession to detect fraud. Because data like tax returns and check registers follow Benford's, auditors can use it as a high-level check of a data set. If there are anomalies, it may be worth investigating closer as potential fraud.

If you're interested in further information about fraud detection using Benford's, definitely give these two articles by Malcolm W. Browne and Mark J. Nigrini a read.

Try It Out for Yourself

Take a look at the demonstration video above to see Benford's law in action with data sets from the web. If you'd like to play with it yourself, just install the Benford's Law extension for Kirix Strata™ and have fun.

Also, please note that I used the following data sets in the video, if you'd like to give those a spin:

Wikipedia List of Lakes in Minnesota
US Census Data Sets
Social Blade - Digg Statistics

And here are a few other worthy ones that didn't make it in the video:

NASDAQ Historical Stock Price
Wikipedia List of Countries by Population
And plenty more at Delicious here…

Enjoy!

Kirix Strata Beta 7: Quick Filter, Data Link Refresh and Report Writer

Monday, January 7th, 2008

(NOTE: See screencast video below for a quick look at some of the new features!)

Hope everyone had a lovely holiday season!

We're happy to report that our developers provided lots of shiny new toys in our Strata stocking over this past month, including further work on Data Links, the inclusion of a “Quick Filter” mechanism and the introduction of our new report writer. Please feel free to download Strata Beta 7 and let us know what you think!

Here's more information on what's new in this latest version:

Data Links

The ability to bookmark data files is coming into its own. We've got things working pretty well on CSV and RSS files at the moment, with some more work still to do on HTML tables. Here's a general synopsis:

  1. Open a CSV or RSS table from the web.
  2. Perform your own analysis, using calculated fields or marks.
  3. Save the data URL as a simple bookmark.
  4. Click the Refresh icon or open up the bookmark in the future. Your data (and your calculations) will refresh based upon the new or updated data on the server.

We've been finding this quite useful internally, particularly in relation to analyzing our web log data. Check out the screencast below for further info.

Report Writer

With Beta 7, we are also introducing our new report writer.

You can create your report in a design view (similar to a template) and then toggle to a layout view for a preview of what you'll see when you print. As a bonus, the layout view enables you to manipulate and format your data directly, instead of being bound to a “print preview” mode.

Another cool thing is that, besides creating reports from data in your project, you can also create reports directly from external data, such as local CSVs or MySQL tables. (First go to File > Create Connection, then you can select it as your source data in the report writer). Check out the screencast below for a quick demo of the report writer in action.

Please note that there are a few known bugs with Report Writer in Beta 7. These include:

  • When using groups, the first group does not display properly.
  • The layout view can be extremely slow when using large files. Now that we've got some big features in, optimizations will soon follow.
  • Items in the Report Header in the design view do not display properly on the top of the page.

Other Enhancements

Here are some of the other improvements that have been implemented:

  • Quick filter allows tables to be filtered really easily (see screencast below for a quick demonstration).
  • Quick import for MS Access and Kirix Package file via the File > Open command instead of File > Import.
  • Support for CSV files with Unicode character sets.
  • CSV auto-sensing determines the field delimiter so lots of different delimited files are parsed and opened automatically (e.g. comma, tab, semi-colon, colon, pipe, tilde).
  • A bunch of scripting additions, including functions to access a database table list and table structure information. We've also added functions to encrypt/decrypt strings.
  • Automatic plugin detection (Strata now doesn't need to reinstall programs like Flash plug-ins if you have already downloaded them for other browsers).
  • Streamlined extension installation and uninstallation.
  • A new “loading” icon that appears on tabs while web pages are being downloaded.

Please check out this screencast, which provides an overview of Data Links, Quick Filter and Report Writer:
Play Video

(And here's an embeddable YouTube version…)

NOTE: For those interested, here is the Yahoo URL used in this screencast. Check out Gummy Stuff's extremely useful Yahoo Stock Ticker CSV API site for further information.

Thanks for downloading it and giving it a spin. Please let us know if you run into any bugs or need help with anything!

Mr. MacGyver, Meet Kirix Strata

Tuesday, October 16th, 2007

Map Visualization 2(NOTE: Screencast of this exercise is available below.)

A few days ago, the always datariffic folks at Juice Analytics posted an article about MacGyver-ing call volume data and pushing it into an online mapping application called Mapeteria. Basically, they were doing some ad hoc data visualization comprised of public web data, private phone call data and a web service that provided the visualization (which in turn used the Google Maps API).

Huh… local data, web data and web APIs? Sounds like a perfect application for a data browser (well, it would've been perfect if the web service accepted a POST command, but I digress). A data browser enables you to easily access web data, combine it with local data, perform any required data clean up and then push/pull data from the web — without ever leaving the tool.

It also would've saved Juice a bit of time, particularly with grabbing area codes and prepping that file. Let's look at the four steps they went through and we'll see how Kirix Strata™ might improve the experience:

1. Pull out the area codes.

The data had phone number values like “12345678901″ as well as “2345678901″, so they used the following formula to pull out the area codes using Excel:

=VALUE(IF(LEFT(E7,1)="1",MID(E7,2,3),MID(E7,1,3)))

Strata would use a similar formula:

iif(left(tel,1)="1",substr(tel,2,3),substr(tel,1,3))

The main time savings here (particularly with large files) is that the calculated field populates automatically for every record in Strata, instead of needing to paste formulas. OK… not terribly exciting thus far.

2. Convert area codes into states

This is a multi-part step:

a) Locate a table from the web that has area code data associated with a state ID (while fending off parasitic scammers).
b) Clean up the table as necessary.
c) Do a lookup from the phone call data that adds in the state where the call originated from.

Strata can really cut down the amount of time spent on this step. Because of the website used, the folks at Juice surely had to create his lookup table manually. I went to Delicious, searched for “area codes” and found this very useful website, which had all the data in a nice HTML table. With Strata, I simply right-clicked and selected “Import Data” and immediately had the table I needed for the lookup.

Finally, I created a relationship between my two tables and dragged in the state codes (e.g., CA, IL, NY, etc.) into the phone call data.

3. Create a summary data set

This was done using a pivot table in Excel. Strata doesn't have classic pivot tables in its feature set at this point, but it does have a nice li'l grouping utility. So, once I knew what csv format was required for the Mapeteria web service, I grouped the data accordingly.

4. Create colorized map the of U.S.

This is the “almost perfect” part I referred to above.

Though Mapeteria is a very cool visualization service using Google Maps, it needs to fetch a CSV file embedded in a URL from elsewhere on the web. If the service was able to accept data via a POST command (or something like an “Upload Data” button), Strata would have been able to just take the table we created and push it to the web service, no csv transformation required (in fact, we've got some stuff cooking in our labs that would make this as easy as copy and paste). And, if we were just able to push the data out like this, we would have immediately gotten the map without ever leaving our data browser.

But, like Zach at Juice, I had to save the file in a CSV format and then upload it to a server before I was able to get my map. Here's a screencast of the entire process… once I found the area code data on the web, it took less than 5 minutes to get my map.

Play Video

(And here's an embeddable YouTube version…)

If anyone wants to try this process out for themselves, please feel free to download Strata and give it a try. This data browser is in beta and completely free to use; we're also giving away free full licenses to anyone who provides feedback during the beta period. Oh, and here is the sample phone call volume data I used for this exercise:

Click here to download Phone Call Volume Sample Data (.csv, 10KB)

This is a pretty simple example of how Strata can be used for ad hoc data access and manipulation with data from the web (or, as one can imagine, within a corporate intranet) and make this kind of analysis very efficient. Throw in some web services, web APIs or very large files into the mix, and you've got the chance to do some fairly interesting things.

As always, if anyone has any questions, either post in the comments below on in our support forums… or just shoot us a support email. Thanks!

Playing Nice with Yahoo Pipes

Wednesday, October 10th, 2007

Yahoo Pipes LogoYahoo Pipes is a pretty slick tool that makes it easy to combine and mash up data sources from around the web and then output the data into formats like RSS and JSON. One of the really nice things is its interface, which lets non-programmers lurk and meddle in this otherwise fearsome domain.

Today I came across a post by tagaficionado Jon Udell who was looking for a way to combine multiple feeds (based on a single tag) into a single feed for consumption. Within an hour an a half, a person named engtech created a Yahoo Pipe called Tagosphere to solve the problem. Pop in the tag you want, hit Run and get your results. Very cool.

To digress for a moment, one of the pet projects I've had on my (long) to do list is to use Kirix Strata™ to create an application that alerts me when someone references “kirix” in a blog post, article, or elsewhere on the web. I currently do this by subscribing to feeds from Google News, Google Blog Search, Technorati, Bloglines, Topix, Digg, etc. This is fine, but a bit clunky due to the many duplicate entries. It also is not comprehensive.

So the other thing I want to do is bring in my website referrers from AWstats or Google Analytics (or our raw apache web logs). Lots of times we'll see people coming to our site from blogs, forum posts or websites that never get picked up by those above-referenced feeds. So then, I would just need to combine all the data, remove duplicates, timestamp it… and now I have a pretty comprehensive idea of where the latest buzz is coming from.

So, the Tagosphere Pipe mentioned above is a pretty good start. I can create a feed for “kirix” and get a combined set of data with the duplicates removed. However, because I want to sort and filter this dataset, I need to get it into Strata. I could just manually go to the Tagosphere page in Strata and click on the RSS feed to get my table. However, because I'm looking at actually using this Pipe for a future application, I decided it would be nice to show a how Strata can work directly with the Yahoo Pipe via a script:

1. In Strata, go to File > New > Script.

2. Copy the following text into the script tab:

var t = new TextEntryDialog;
t.setCaption("Pipes Search");
t.setMessage("Please enter search term:");
if (t.showDialog())
{
var s = "http://pipes.yahoo.com/pipes/pipe.run?_id=mFZPs1l33BGJYdGGn0artA&_render=rss&tag=";
s += t.getText();
HostApp.openWeb(s);
}

3. Save the Script then go to Tools > Run Script/Query

As you see, a dialog opens where you can enter your tag. Enter the tag, click OK and up pops the feed in a table format.

This example is obviously very simplistic. But, if I then take it to its logical conclusion and bring in my referrer data, remove duplicates and run it on a regular basis, I've got my own personal Pub Sub. Even better, I can stand on the shoulders of giants by using all the great stuff already written in Yahoo Pipes or Dapper or anything else that exports data as RSS or CSV.

We've got a ton of ideas that we plan on sharing with everyone, but have really been really focused on getting the Strata beta fully functional and stable. Stay tuned though, more fun stuff to come soon…

P.S. If anyone wants to play around with using Yahoo Pipes with Strata and needs any help at all, please just shoot us a support email or post something in the forums. Also, if you come up with a cool app, let us know, we'd be thrilled to hear about it. Thanks!

Embedded phpBB Search Terms within Apache Web Logs

Friday, August 24th, 2007

This afternoon I was doing some analysis on our web logs and thought it may make for a good screencast and blog post. We currently use a combination of AWstats and Google Analytics for our web stats but are increasingly using Kirix Strata™ to dig deeper into the raw web logs for the more customized things that aren't readily available otherwise.

Also, honestly, it is kind of fun to plow through almost a million records on your own. Hmmm, maybe I should get out more.

The topic of the screencast below are the search terms people enter to find things in our phpBB3 support forums. These terms are embedded in the “request” field of the apache logs and I couldn't find a way to get them without digging into the logs themselves (NOTE: I wouldn't doubt that there is some way to do this via a mod to phpBB or a filter in Google Analytics… but since I couldn't find anything via a quick Google search, using Strata just ended up being a lot faster).

An example of a search string we're dealing with is:

GET /forums/search.php?keywords=proxy HTTP/1.1

So the trick was to parse the search keywords out of the field and then group them together to see what people were searching for… and in turn give us the chance to improve our support area by targeting some of these search terms and expanding our documentation accordingly.

Hope this video proves helpful:

Play Video

(And here's an embeddable YouTube version…)

TECHNICAL NOTE:

I downloaded the Apache logs from the server and, due to the file size, decided to import them into Strata rather than open the file and work with it directly. To import your logs, go to Import, select text-delimited files, and then import as space delimited with quotation marks as the text qualifier. Update: You can now use a handy little log parsing extension to pull in your web log files without having to mess around with a straight text import.

TECHNICAL NOTE 2:

For posterity, here are the functions that were used in this screencast:

STRPART(string, section [, delimiter])
SUBSTR(string, start [, length])
CONTAINS(string, search string)
IIF(boolean test, true value, false value)

Horizontal Tab Groups Make Bug Entry Fun!

Friday, August 10th, 2007

Bug Entry is Fun! (screenshot)

Thanks for all the bug reports this week; we're working hard to sort them out.

We're hoping to have a new beta for everyone early next week. In addition to a lot of nickel and dime fixes, we'll definitely be adding a configuration page for proxy settings, our most requested feature.

Have a good weekend!

About

Data and the Web is a blog by Kirix about accessing and working with data, wherever it is located. We have a particular fondness for data usability, ad hoc analysis, mashups, web APIs and, of course, playing around with our data browser.