Data and the Web

Archive for January, 2008

For Large Data Sets and the People Who Love Them…

Friday, January 18th, 2008

theinfo.org - Logo ImageYou may remember one of our earlier quests on this blog related to tagging the world’s public data. We’re still eagerly adding and taggin publicly available data sets when we find them.

Today, a friend of mine alerted me to another lode of mine-able data offered by Peter Skomoroch of Data Wrangling. There are a bunch of great sets here; hope you find something to your liking as well.

Besides enjoying his great blog name, I was also happy to be directed to a site called theinfo.org, which simply states it is “for people with large data sets.” It’s a site built to gather data-enjoyers across the web and collaborate in three areas:

  • Get: scrapers, crawlers, phone calls, buyouts
  • Process: conversions, queries, regressions, collaborative filtering
  • View: tables, graphs, maps, websites

Here’s a quick peek into their mission:

Some of us have spent years scraping news sites. Others have spent them downloading government data. Others have spent them grabbing catalog records for books.Referencement Google And each time, in each community, we reinvent the same things over and over again: scripts for doing crawls and notifying us when things are wrong, parsers for converting the data to RDF and XML, visualizers for plotting it on graphs and charts.

It’s time to start sharing our knowledge and our tools. But more than that, it’s time for us to start building a bigger picture together. To write robust crawl harnesses that deal gracefully with errors and notify us when a regexp breaks. To start converting things into common formats and making links between data sets. To build visualizers that will plot numbers on graphs or points on maps, no matter what the source of the input.

We’ve all been helping to build a Web of data for years now. It’s time we acknowledge that and start doing it together.

If you love data, this appears to be well worth checking out.

Have a great weekend!

Kirix Strata Beta 7: Quick Filter, Data Link Refresh and Report Writer

Monday, January 7th, 2008

(NOTE: See screencast video below for a quick look at some of the new features!)

Hope everyone had a lovely holiday season!

We’re happy to report that our developers provided lots of shiny new toys in our Strata stocking over this past month, including further work on Data Links, the inclusion of a “Quick Filter” mechanism and the introduction of our new report writer. Please feel free to download Strata Beta 7 and let us know what you think!

Here’s more information on what’s new in this latest version:

Data Links

The ability to bookmark data files is coming into its own. We’ve got things working pretty well on CSV and RSS files at the moment, with some more work still to do on HTML tables. Here’s a general synopsis:

  1. Open a CSV or RSS table from the web.
  2. Perform your own analysis, using calculated fields or marks.
  3. Save the data URL as a simple bookmark.
  4. Click the Refresh icon or open up the bookmark in the future. Your data (and your calculations) will refresh based upon the new or updated data on the server.

We’ve been finding this quite useful internally, particularly in relation to analyzing our web log data. Check out the screencast below for further info.

Report Writer

With Beta 7, we are also introducing our new report writer.

You can create your report in a design view (similar to a template) and then toggle to a layout view for a preview of what you’ll see when you print. As a bonus, the layout view enables you to manipulate and format your data directly, instead of being bound to a “print preview” mode.

Another cool thing is that, besides creating reports from data in your project, you can also create reports directly from external data, such as local CSVs or MySQL tables. (First go to File > Create Connection, then you can select it as your source data in the report writer). Check out the screencast below for a quick demo of the report writer in action.

Please note that there are a few known bugs with Report Writer in Beta 7. These include:

  • When using groups, the first group does not display properly.
  • The layout view can be extremely slow when using large files. Now that we’ve got some big features in, optimizations will soon follow.
  • Items in the Report Header in the design view do not display properly on the top of the page.

Other Enhancements

Here are some of the other improvements that have been implemented:

  • Quick filter allows tables to be filtered really easily (see screencast below for a quick demonstration).
  • Quick import for MS Access and Kirix Package file via the File > Open command instead of File > Import.
  • Support for CSV files with Unicode character sets.
  • CSV auto-sensing determines the field delimiter so lots of different delimited files are parsed and opened automatically (e.g. comma, tab, semi-colon, colon, pipe, tilde).
  • A bunch of scripting additions, including functions to access a database table list and table structure information. We’ve also added functions to encrypt/decrypt strings.
  • Automatic plugin detection (Strata now doesn’t need to reinstall programs like Flash plug-ins if you have already downloaded them for other browsers).
  • Streamlined extension installation and uninstallation.
  • A new “loading” icon that appears on tabs while web pages are being downloaded.

Please check out this screencast, which provides an overview of Data Links, Quick Filter and Report Writer:
Play Video

(And here’s an embeddable YouTube version…)

NOTE: For those interested, here is the Yahoo URL used in this screencast. Check out Gummy Stuff’s extremely useful Yahoo Stock Ticker CSV API site for further information.

Thanks for downloading it and giving it a spin. Please let us know if you run into any bugs or need help with anything!

About

Data and the Web is a blog by Kirix about accessing and working with data, wherever it is located. We have a particular fondness for data usability, ad hoc analysis, mashups, web APIs and, of course, playing around with our data browser.