Data and the Web

Announcing Kirix Strata 4.3

April 28th, 2009

Image - Strata IconWe’re pleased to announce that we just released a new upgrade to Kirix Strata, version 4.3! Kudos to our developers for adding a lot of nice features and bug fixes.  The full list of notes to this release is below the jump, but here are a few of the bigger changes:

External Database Connectivity

We’ve really improved the way that Strata works with external databases by optimizing our pass-through queries for databases like Oracle, SQL Server and MySQL.  In addition, queries in the query builder that reference external database tables pass the query through to the external database, significantly increasing the speed of queries on external databases.  Furthermore, you can now edit individual cells in Strata and have them update in your external database table.  This is very welcome news to folks that want to use Strata as a front-end to their external database tables.

UPDATE (04/30/2009):  Just a quick note of clarification, on the “read” side, you can work with external databases for things like sorting, filtering, marks, calculated fields, grouping and copying.  On the “write” side, we currently only have cell editing available, but will work on adding other features in the future such as append (i.e., insert record), delete, update, and some modify structure operations.  If you need these additional “write” features, please send us a note to let us know how you would plan on using them to help us prioritize our development efforts. Thanks!

Improved SQL Support

We added a console panel to allow direct querying of internal and external databases with SQL commands, as well as to provide feedback for database operations and scripts.  You can learn more here.

EBCDIC Conversion

Strata now handles EBCDIC.  We haven’t added copybook support just yet, but you can either manually set your breaks using the text-import or create scripts to convert the EBCDIC file to ASCII format.  You can learn more here.

Fixed Length and Delimited Table Export

We’ve also added Fixed-length export (this also works when using File > Save As External).  In addition, we’ve expanded the text-delimited export so that you can specify your own delimiters, such as pipe-delimited and semi-colon delimited. You can learn more about the new text-delimited functionality here.

Handling Tablenames & Fieldnames with Spaces

One of our most common support questions relates to spaces in a fieldname (like “my field” instead of “my_field”).  We’ve now solved this issue by allowing spaces to be used by enclosing the name in brackets.  So, for example, these are now all valid expressions:

[Field  1] * [Field  2]
Field1 * [Field  2]
[Table 1].[Field 1]*[Table 2].Field2

You can learn more here.

Much Much More…

There are plenty of other upgrades like project handling, new keyboard shortcuts, auto-fill group and sort dialogs, new script classes, etc.  You can check out all the changes below the jump.

Please download the latest Strata (or just click “Check for Updates” in the Help menu), give it a whirl and let us know what you think!

Read the rest of this entry »

Data.gov

March 5th, 2009

OMB SealWe recently posted an article about Vivek Kundra, who was named United States CIO this morning by the Obama administration.  He’s got $71 billion in IT spending under his care.  Hmm, that’s a lot of data browsers.

One interesting tidbit appeared in this Saul Hansell NY Times article:

Another initiative will be to create a new site, Data.gov, that will become a repository for all the information the government collects. He pointed to the benefits that have already come from publishing the data from the Human Genome Project by the National Institutes of Health, as well as the information from military satellites that is now used in GPS navigation devices.

“There is a lot of data the federal government has and we need to make sure that all the data that is not private, or restricted for national security reasons, can be made public,” [Kundra] said.

In another bit of interesting news, the Jonathan Stein at Mother Jones notes that Mike Honda (D-Calif) added a provision into the recent appropriations bill that requires government entities to make their public available in raw form:

If the Senate passes the bill with the provision intact, citizens seeking information about Congress’ activities—such as bill names and numbers, amendments, votes, and committee reports—won’t have to rely on government websites, which often filter information, are incomplete, or are difficult to use. Instead, the underlying data will be available to anyone who wants to build a superior site or tool to sift through it. “The language is groundbreaking in that it supports providing unfiltered legislative information to the public,” says Honda’s online communications director, Rob Pierson. “Instead of silo-ing the information, and only allowing access through a limited web form, access to the raw data will make it easier for people to learn what their government is doing.”

Kim Zetter from Wired has more on the story here.

Maybe once the data is made more accessible, some clever folks can put an interface on things that improve the complex aftermath of the “laws and sausages” routine.  I did my best to search for Honda’s three-sentence provision in the latest omnibus bill with no luck.  Anyone know what the actual provision stated? [UPDATE:  Rob Pierson, Online Communications Director of Congressman Honda’s office, provided a link to an O’Reilly post with the full text of the provision.  Give the full article a read — it’s quite worthwhile.]

And, for posterity, here are some of the data repositories mentioned in the articles above:

AWS Public Data Sets Continues to Expand

February 25th, 2009

AWS Public Data Sets ScreenshotPreviously, we posted some information on Amazon’s foray into making huge public data sets available to users of their web services.  Yesterday they announced the addition of some very sizable additions:

If you use AWS, the announcement provides more info on these datasets as well as how to access them.  If you don’t use AWS, you can still access much of this data directly from the websites linked above.

Free E-Gov Conference (via webcast) on February 17, 2009

February 11th, 2009

As a follow up to my previous post on e-government, just wanted to let those who are interested know that there’s a free conference offered next week that will get much more in-depth about the initiatives for changing the way government uses and disburses information.  The conference will also have a particular emphasis on using semantic technologies.

Here are the details:

From E-Gov to Connected Governance: the Role of Cloud Computing, Web 2.0 and Web 3.0 Semantic Technologies

Tuesday, February 17, 2009.

Morning session: 8:30 am EST to 12:00 noon. Afternoon session: 1:00 pm EST to 4:00 pm EST.

Synopsis:  “We have a new administration that values transparency, citizen participation, collaboration, information sharing, and internet technology… The purpose of this conference is to operationalize this vision, demonstrate the kinds of changes that are coming to next stage web-based systems in government, and to map the role of  information and communication technologies (specifically, cloud computing, Web 2.0, and Web 3.0 semantic technologies) in the evolution of government information systems from e-gov (silos with web front ends) to connected governance (e.g. distributed social computing environments for collaborative work, information sharing, knowledge management, and participatory decision-making.)”

Webcast sign-up here (or, if you are in Washington DC area, you could attend in person)

Further information about the conference can be found here.

More Government Data Coming to a Browser Near You…

February 6th, 2009

File CatalogIt was intriguing to see how all this newfangled web 2.0 technology was applied during the US presidential campaign this past year (organization, multimedia, etc.).  It’s also quite interesting to hear about some of the big ideas for how the new administration wants to change how government works.  And, not to be outdone, the opposition party is also getting into the Web 2.0 game.

According to Nextgov, it appears that Vivek Kundra, current CTO of the District of Columbia, is going to be given the nod as the next e-government liaison.  From the article:

Kundra also is a strong proponent of giving the public access to government data. “Why does the government keep information secret?” he rhetorically asked during an interview with Nextgov. “Why not put it all out in the government domain?” [Since arriving in Washington], I’ve made all the government databases public. Every 311 call, every abandoned automobile, who has responded, etc. It provides high-level oversight of the daily tasks of government.”

A more in-depth bio of Kundra can be found at this recent Washington Post article.  A couple of the more intriguing things that he promoted in the District of Columbia were the DC Data Catalog and “Apps for Democracy.”

The data catalog covers all kinds of DC data from  crime statistics to — ahem — most recent roadkill pickups.  It’s also available in a wide variety of formats. The “Apps for Democracy” was a kind of mashup contest to see what kind of apps could be developed to improve DC resident’s access to data.  It was highly successful, providing 47 different applications for a fraction of the cost of formally contracting out these projects.

Of course, changing such a huge, bureaucratic system as the Federal government will not happen overnight, but it is encouraging to see more of a focus on making data available in a timely manner (and in usable formats).

For those interested in this sort of thing, I’d also recommend checking out the Sunlight Foundation, which is focused on government transparency.  Also, TechPresident and Nextgov are both news sources focused on following all things e-gov.

Got any other interesting links on this topic?  Please feel free to post ‘em in the comments below.

Cooking the (Quick)Books

January 14th, 2009

Illinois ST-1 ImageAh, tax season… could there be a more thrilling time of the year?

So, today I was reviewing a sales & use tax form for the State of Illinois. Since our governor really isn’t helping matters in our state these days, we felt the least we could do to help was to make sure to pay our taxes on time.

So, I was looking at our sales tax report in Quickbooks and, like a good accountant, just quickly checked to make sure it matched up against the total revenues in the income statement. They didn’t match.

Hmm… funny thing about accounting, things really ought to balance.

It was a small discrepancy, but after searching unsuccessfully for the difference, it was clear that the issue involved more than one transaction. And, unfortunately, there were just far too many transactions to try and come up with a solution manually.

So, since I happened to have this data browser laying around, I exported both reports as CSV files and opened them up in Strata.

The Quickbooks CSVs were obviously meant for spreadsheet export (as it included subtotals and odd record breaks), so I quickly did some clean up and then did a few gymnastics to compare the tables. Turns out there were a few manual journal entries that weren’t mapped to the sales tax codes required by Quickbooks. And here I was hoping to blame Quickbooks… oh well.

Running through this process was a 5 minute affair, but it made me wonder about all these other small data manipulation tasks that are out there. There have got to be millions, nay, billions, of these things — 5 minute one-off, ad hoc data tasks that just can’t be solved with the help of a spreadsheet (in this case, grouping or relationships were needed to do this quickly).

What do people normally do in these situations? I fear that they probably spend hours working the problem manually. Got a similar story and/or solution? Feel free to share in the comments section below.

Amazon Gets into the Public Data Sets Game

December 4th, 2008

Amazon AWS LogoAmazon announced the launch of its Public Data Sets service this evening.  Bottom line, they asked people for different public or non-proprietary data sets and they got ‘em.  Here’s a sample of the (pretty hefty) stuff they are hosting for free:

  • Annotated Human Genome Data provided by ENSEMBL
  • A 3D Version of the PubChem Library provided by Rajarshi Guha at Indiana University
  • Various US Census Databases provided by The US Census Bureau
  • Various Labor Statistics Databases provided by The Bureau of Labor Statistics

Though the individual size of the sets are huge, there aren’t many of them at this point, but it appears that Amazon will be filling this out over time.

How do you access them?  Well, there’s a slight hitch.  You need to fire up an EC2 instance, hook into the set and then perform your analysis.  You just pay for the cost of the EC2 service.  Given how massive these tables are, it seems like a pretty good way to go.  A step closer to the supercomputer in the cloud.

We’re devoted users of Amazon S3 here and have also done some work with EC2, which is quite impressive.  Overall, this is another example of a nice trend where large data sets are becoming more easily accessible.

If anyone has the chance to play with this service, let us know how it goes.

A CSV File You Can Believe In

December 1st, 2008

Change.gov logoThis is not a blog that delves into political issues, but I happened to notice that the Obama transition team released the names of all their donors today.  However, inexplicably, they don’t have them in a CSV format for easy slicing and dicing in your favorite data analysis software.

A couple clicks in Strata took care of that pretty quickly. (*.csv, 120 KB)

Some interesting bits of information:

  • Google is the employer with the most total donations at $14,200 (from “Google” and “Google, Inc.”, 8 employees).
  • Microsoft employees only gave $500 (2 employees)
  • 74 different colleges and universities were represented for $25,900 (81 employees)
  • 4 people who defined themselves as “Not Employed” gave a total of $11,250.
  • There are 1,776 donors in the list.  Mere coincidence… or more evidence that Obama is truly “that one” (alternatively, the list could have been hacked because he is “the one“)?

The data is a little bit dirty (particularly the “Employer” field), but you might have some fun poking around.  Shoot us a message in the comments if you find anything interesting.

P.S.  Also, I saw this article about data overload during the campaign… looks like the Federal Election Commission could have used the Kirix Strata government discount;)

Update:  Also, looks like George Lucas jumped in and we see an employee of the notorious Dewey, Cheetham & Howe

Announcing Kirix Strata 4.2

November 4th, 2008

Image - Strata IconWe’re happy to announce another new Strata upgrade today.  If you’ve got a version of Kirix Strata 4.x on your system, this is a free minor upgrade, so download it now.

This release contains a ton of various bug fixes and adds a few notable features:

Saved Indexes

Indexes are fundamental to the inner workings of a database.  Strata has always saved indexes per session for things like sorts and relationships; however now you can save them from session to session so they don’t need to be recreated.  You can save indexes by choosing Edit Indexes from the data menu.  Once saved, you can just right-click on any field header and select Sort Orders to access your saved sorts instantly.

Saved Column Views

When performing your analysis, it is almost inevitable that, at some point, you’ll change the way your columns are shown.  Some will be rearranged, others will be hidden.  However, particularly with data sets that contain many fields, it is often useful to be able to save column views to access at different times.  You can now save column views in Strata by selecting Data > Columns > Edit Column Views.  Once you save your views, you can easily access them by right-clicking on any field header and selecting Column Views.

Function Help

We’ve made it a little bit easier to get help with various functions when you use the formula/expression builder.  If you hover over any function in the list, you’ll see a tool tip with the function syntax and short description.  If you click on a function, you’ll see the syntax appear on the bottom left of the formula builder with a hyperlink to further information.  Functions are such an integral part of data analysis that we hope this extra little bit of help will make things faster.

Scripting/Extensions

Overall, we’ve fixed lots of different bugs and have continued to improve the strength and breadth of our scripting language.  Because of some scripting tweaks, some of the previous extensions you downloaded may not be visible in your menu, as expected.  If this is the case, we’ve updated all the extensions with the new syntax — so just go ahead and re-install the extension and you’ll be ready to roll.

As always, if you have any recommendations or see any bugs, please let us know — it is extremely helpful to have this type of feedback.

The Dirty Data of Data Mining

October 28th, 2008

Bulldozers in LandfillToday I came across a survey on data mining by a consulting firm called Rexer Analytics.  Their survey took into account 348 responses from data mining professionals around the world.  A few interesting tidbits:

* Dirty data, data access issues, and explaining data mining to others remain the top challenges faced by data miners.

* Data miners spend only 20% of their time on actual modeling.  More than a third of their time is spent accessing and preparing data.

* In selecting their analytic software, data miners place a high value on dependability, the ability to handle very large datasets, and quality output.

We’ve found these issues to hold true with our clients as well, particularly in various auditing industries.  Auditors will get a hold of their client’s data, maybe in some delimited text file.  The data set is inevitably too large for Excel to handle easily, so they may try Access (of course, once they are eternally frustrated, they give Strata a shot).

Once they can actually see that data set, they start exploring it to learn about what they’re looking at and then inevitably find out how dirty it is.  Multiple fields are mashed together or individual ones are stripped apart.  Company names appear multiple times in various forms (”I.B.M” vs. “IBM”).  An important bit of numeric information is embedded in a text field.  There is no end of time spent “purifying” the data set to make sure to avoid the “garbage in, garbage out” syndrome.

Often overlooked, data cleansing is really as important as the analysis itself.  Only once this step is complete can you move on to your data mining or other data analysis.

Check out the survey summary yourself and let us know if it matches your experience.

About

Data and the Web is a blog by Kirix about accessing and working with data, wherever it is located. We have a particular fondness for data usability, ad hoc analysis, mashups, web APIs and, of course, playing around with our data browser.