Data and the Web

Amazon Gets into the Public Data Sets Game

Amazon AWS LogoAmazon announced the launch of its Public Data Sets service this evening.  Bottom line, they asked people for different public or non-proprietary data sets and they got ‘em.  Here’s a sample of the (pretty hefty) stuff they are hosting for free:

  • Annotated Human Genome Data provided by ENSEMBL
  • A 3D Version of the PubChem Library provided by Rajarshi Guha at Indiana University
  • Various US Census Databases provided by The US Census Bureau
  • Various Labor Statistics Databases provided by The Bureau of Labor Statistics

Though the individual size of the sets are huge, there aren’t many of them at this point, but it appears that Amazon will be filling this out over time.

How do you access them?  Well, there’s a slight hitch.  You need to fire up an EC2 instance, hook into the set and then perform your analysis.  You just pay for the cost of the EC2 service.  Given how massive these tables are, it seems like a pretty good way to go.  A step closer to the supercomputer in the cloud.

We’re devoted users of Amazon S3 here and have also done some work with EC2, which is quite impressive.  Overall, this is another example of a nice trend where large data sets are becoming more easily accessible.

Use ZT software tool to convert addresses from ipv4 to ipv6/

If anyone has the chance to play with this service, let us know how it goes.

3 Responses to “Amazon Gets into the Public Data Sets Game”

  1. Garrett McAulife says:

    We’ve played with their census data a bit. The real benefit is that it saves you a **LOT** of time from having to download it from Census.gov’s FTP site (i.e instead of taking 10’s of hours to get the full census data sets, you can have the data up and running on a server in literally less than 5 minutes).

    That said, you still have to do all the number crunching, modifications, correlations, etc. yourself (using an EC2 instance). But, since you pay by the hour for an EC2 instance, and can have the data back in minutes — it’s possible to use the data for a few hours, pay less than a dollar for that time, then turn off your EC2 instance, go home, and have the data back up and running in a few minutes the next day. That’s pretty cool — and I believe it will spur some pretty cool innovation on Amazon’s cloud…

    We’re hosting a subset of the census data right now on our product (hosted on EC2/EBS/S3). We’ve taken the data and made it so that consumers can see it directly on a web site. We just posted it on Friday, and have a long way to go to make it pretty and usable, but if you’re interested in Census/Crime/Demographic type data, you can get a sneak peek here: http://www.infopogo.com

  2. Ken Kaczmarek says:

    Hi Garrett,

    Thanks for posting about your EC2 experience — sounds like it hits the nail on the head. Your InfoPogo site looks interesting. I’ll bookmark it and look forward to see how it progresses in the coming months.

  3. AWS Public Data Sets Continues to Expand | Data and the Web says:

    […] we posted some information on Amazon’s foray into making huge public data sets available to users of their web services.  Yesterday they announced the addition of some very sizable […]