42 Big Data Startups – Vote for the Top 10

Update: Voting for the top 10 Big Data startups has closed, and the roundup featuring the 10 finalists is now available on CIO.com.

Update #2: A more comprehensive and more recent list of Big Data startups is available in my Big Data 50 2014 report. Check it out!

The Big Data space is heating up, and unlike some over-hyped trends (cloud, I’m looking at you), it’s pretty easy to pinpoint the ROI with these tools.

When I put out calls for nominees through my Story Source Newsletter, HARO, Twitter, etc., for my upcoming CIO.com story, “10 Big Data Startups to Watch,” I received more than 100 recommendations. Usually, when I get that many recommendations, a good chunk of them can be dismissed out of hand. Some are clearly science projects; others have zero funding, no management pedigree and a dubious value proposition, and a few are clearly the products of fevered malarial hallucinations.

Not so this time. Very few of the startups left off this list of 42 nominees were whacky long shots. Most were decent ideas, but were left out because they were too old, too new, or, in a few cases, just not convincing.

I’m after those Goldilocks startups. Not old enough that everyone already knows who they are. Yet, they can’t be so new that they have zero customers or funding. In other words, they need to be viable but still somewhat under the radar.

Now comes the hard work. Paring this list down to 42 was hard enough, but picking 10 is traditionally a subjective, manual, error-prone process – which is why I’m asking you to vote. Select up to three of your favorite startups (although you can select only one if you choose). The voting will be weighted, so if you are voting for your own startup, please take the time to give some support to your colleagues. The weighting will ensure that you help yourself the most, even when supporting others.

There is one big wrinkle to the voting this time around: this list of 42 isn’t locked. If your startup competes directly with someone on this list, and you think your startup is superior, file a startup challenge. Now, just because you file a challenge doesn’t mean I’ll set one up. If your startup is too old, still in stealth mode, hasn’t raised VC money, has a green management team, etc., I may not consider it a worthwhile challenge.

If I do set up a challenge, the startup will have a chance to fight its way onto the list of nominees, and, perhaps, knock someone else off it in the process. Or maybe it’ll just convince me to expand the list.

I’m going to leave the voting open for a week or so. To get updates, and to learn when the CIO.com story runs, be sure to sign up for Startup50 Updates. As a bonus, you’ll get a copy of my Mobile 50 report (Edit: the report now available is the updated Big Data 50 report) when you sign up.

Voting closes Monday, June 3, 2013 at 5 PM PT. 

1. Alpine Data Labs

What they do: Provide data science solutions for Hadoop and Big Data.

What I like: While there are a ton of Big Data tools entering the market, many companies still struggle to gain actionable insight from these mountains of data.

Alpine Data set out to simplify machine learning methods and make them available on petabyte-scale datasets. Their tools make these methods available in a lightweight web application with a code-free, drag-and-drop interface. Alpine Data leverages the parallel processing power of Hadoop and MPP databases and implements data mining algorithms in MapReduce and SQL.

Alpine Data’s visual environment helps teams collaborate and quickly create and deploy analytics workflows and predictive models.

Why they might not make the cut: They’re a strong entrant. They’re in the process of closing a $10-13 million Series B funding round, which builds on a $7.5 million Series A. However, SAS dominates this market, and other startups are moving into this space too, including Platfora, Skytree, Revolution Analytics and Rapid-I.

2. Bright.comBright

What they do: Provide a job search/job recruiting site that relies on Big Data insights to match potential employees and employers.

What I like: Bright takes the abstract concept of Data Science/Big Data and applies it to a practical real life problem: employment. Bright claims to have solved the “spray-and-pray” problem of searching and applying for jobs online by creating a 1-to-1 signal between job seekers and open positions.

Why they might not make the cut: Bright is in a strong position in their niche. In terms of traffic, they are the number one Big-Data-built jobsite, and are fifth overall in terms of traffic. However, this is a tough space, and I could easily see the Monster.coms of the world rolling out a similar feature. Moreover, I was pitched on several similar services, including Cangrade.com and TalentBin.

3. Cloudant

What they do: They’re a Database-as-a-Service provider.

What I like: According to Cloudant, the problem with databases is that if an application is successful, organizations often outgrow them. This is commonly referred to as the “App Store Effect.” Even “scale-out” distributed databases and caches are limited by cluster hardware and partitioning schemes.

The Cloudant Database-as-a-Service (DBaaS) is a managed service purpose-built for data-driven Web and mobile application developers who want to handle Big Data workloads without ever having to deal with distributed database design, sharding, partitioning, backup, etc.

Why they might not make the cut: They’re a strong contender, but plenty of exciting startups will be left out of the top 10.

4. Cloudera

What they do: Provide a Hadoop-based Big Data platform.

What I like: Big Data is hot, and Cloudera is the pioneer that first developed a Hadoop-based platform for Big Data. Moreover, they’re sitting on a mountain of VC cash and have a solid management team.

Cloudera lets users query all of their structured and unstructured data and have a view beyond what’s available from relational databases. Cloudera recently released Impala, a new open-source interactive query engine for Hadoop that enables interactive querying on massive data sets in real time.

Why they might not make the list: Cloudera might end up being a victim of their success. Readers flock to “Startups to Watch” roundups to discover new startups, rather than to see which ones have already outgrown the “startup” label.

5. Cloudmeter

What they do: Develop tools to gather insights from Big Data streams generated from enterprise analytics apps and customer-facing apps.

What I like: According to Cloudmeter, there is a wealth of information that lies in network traffic, but traditionally that data has been hard to get at. Cloudmeter claims to capture and transform real-time network data into actionable information for IT and business users, delivering insights that companies can use to optimize end-user experiences. Cloudmeter helps companies push the right kind of data into their analytical solutions.

Why they might not make the cut: Should these products be standalone ones, or would they really best serve users as features of larger platforms?

6. CloudPhysics

What they do: Provide Big Data tools for analyzing data-center infrastructures.

What I like: According to CloudPhysics, virtualization and cloud management platforms wholly lack actionable information that administrators can use to design, configure, operate and troubleshoot their infrastructure systems. This results in waste and risk, which undermine the very reasons for virtualizing and enabling cloud approaches in the first place.

CloudPhysics’ tools combine big data science, simulation/modeling/analysis and virtualization resource management. The mashup helps admins discover and analyze problems that often go unnoticed by users of traditional systems management platforms.

Why they might not make the cut: They’re a little light on funding at the moment. On the other hand, they have an impressive customer roster, which includes Equinix and the Government of Denmark.

To stay on top of cool startups like CloudPhysics, be sure to sign up for the Startup50 Newsletter to receive a monthly summary of hot startups, as well as exclusive access to special content and new Top 50 reports.

7. Concurrent

What they do: Provide Big Data application platforms.

What I like: Concurrent’s goal is to make it easy for non-experts to adopt Hadoop for Big Data application development. When developing applications on Apache Hadoop, developers program in MapReduce or scripting query languages like Pig and Hive. However, there is a growing realization that MapReduce is difficult to use and doesn’t really scale terribly well. Meanwhile, Pig and Hive fall short of enabling anything more than simple ad-hoc analysis.

Concurrent’s Cascading platform is an alternative API to MapReduce for processing huge data sets on Hadoop. Cascading is a Java application open source framework that enables developers to build data-processing and data-management applications on Apache Hadoop.

Why they might not make the cut: They’re a solid entry, but when I start making cuts to get this roundup down to 10, plenty of worthwhile startups won’t make the cut.

What they do: Provides Python-based data analytics tools.

What I like: Continuum’s goal is to help firms break down vast amounts of data and handle issues associated with effectively collaborating on analysis. Their first two products are: 1) Anaconda – an enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing; and 2) Wakari – a web-based Python and Linux environment for collaborative data analysis, exploration and visualization.

Why they might not make the cut: While they recently won a $3 million research fund from DARPA’s XDATA program, they aren’t as well-funded as many Big Data analytics challengers.

Datameer9. Datameer

What they do: Provide an analytics application that sits on top of Hadoop.

What I like: It’s no secret that Hadoop isn’t the easy technology in the world to use. Hadoop has no end-user interface, so you have to know MapReduce to use it as a platform for big data analytics. In other words, using Hadoop for Big Data is typically an IT-intensive project.

Datameer removes the need to code a custom solution, enabling any business user to integrate, analyze and visualize their data.

Why they might not make the cut: Datameer is a strong contender, with $17.8 million in funding in the bank and such top-tier customers as Visa and Newegg. However, they’ll need to turn out the vote to get into the final 10.

10. DataSift

What they do: Provide a social data platform that helps organizations aggregate, filter and extract insights from the billions of public social conversations on Twitter, leading social networks and millions of other sources.

What I like: Delivered as a cloud platform, DataSift does the heavy lifting for companies that want to create social media monitoring, social CRM, business intelligence, financial trading and news monitoring applications.

DataSift provides access to both real-time and historical social data to uncover insights and trends that relate to brands, businesses, financial markets, news and public opinion. Moreover, they’re sitting on a healthy stack of VC funding and have a top-tier client roster.

Why they might not make the cut: There’s not much to complain about here, but DataSift will need to turn out the vote to ensure they make the top 10.

12. DataStax 

What they do: Provide a Big Data application platform.

What I like: Traditional relational databases from Oracle and the like are incapable of supporting the emerging real-time line of business applications. However, relational databases are far more expensive, less scalable and vulnerable to going offline during disasters. DataStax enables companies to develop applications that can analyze data in real time, scale as usage increases, and avoid disaster by spanning multiple databases and the cloud.

The company also has a solid management team and a serious amount of VC funding.

DataStax performed well in the challenge, and a couple customers emailed me directly to sign their praises — in great detail.

Why they might not make the final cut: The competition is really tight for this particular subsector of the Big Data market, with DataStax competing against 10Gen, NuoDB, Hortonworks, Cloudera and Pivotal. I’m sure there are others I left off that list. This space is a land grab for the time being, but it won’t stay that way forever.

12. Enigma

What they do: Provide a “a platform for finding relevant data, discovering unknown data sources and uncovering the relationships across datasets.

What I like: The key problem Enigma addresses is that developing data rich portraits of companies, people and locations is limited by the messy and disconnected organization of public data sources. At best, identifying data relevant to a question remains a scattered process – and comprehensively searching and relating this data is currently impossible.

Enigma’s platform collects, curates and pinpoints relationships among data in public data sources. Enigma empowers users to find-hidden facts and connections across a universe of disparate and siloed sources.

Why they might not make the cut: They’re a pretty strong entrant. I like the focus on public data, since so many Big Data plays are internal facing.

13. FusionOps

What they do: Provide BI/Big Data SaaS tools aimed at supply chain business users.

What I like: FusionOps argues that Big Data and Business Intelligence (BI) go hand-in-hand. However, Big Data has exposed a big problem in BI – there’s actually not that much intelligence in typical, one-size-fits-all BI solutions.

With a traditional BI approach, customers buy (or subscribe to) a tool that starts them out at zero, and then rely on management and IT to collaborate to develop the necessary KPIs. Sometimes the business doesn’t even know what the KPIs for the supply chain should be.

A new approach – domain-specific BI – is emerging, and FusionOps is an early entrant into this space, focusing on supply chains.

Why they might not make the cut: They were founded clear back in 2000, making them downright geriatric for a startup roundup. They did refocus and relaunch as a BI/Big Data company in 2005, but that’s still probably a stretch for this roundup.

14. Garantia Data Garantia

What they do: Offer enterprise-class Redis and Memchached hosting for developers.

What I like: Redis, an open source NoSQL database, is catching on with developers. Redis is doing well enough that it is gradually replacing Memcached as a caching engine for accelerating database performance.  However, Redis doesn’t meet the enterprise requirements of high availability and scalability.

Garantia Data enterprise-class Redis hosting platform is designed to overcome these obstacles.

Why they might not make the cut: There’s not much to complain about here, other than the fact that this is a fairly obscure sub-sector of the Big Data ecosystem. Many Big Data platforms are advancing to abstract much of the development complexity away. Then again, those same tools often lose some of the customization and flexibility of developer-focused platforms in the process.

15. Gravity

What they do: Develop data analytic tools to help “personalize” the Web.

What I like: You know all of that information about you that everyone from Google to Facebook to your bank gathers? Many of the use cases for that information are, if not downright nefarious, are at the very least not really what consumers think they’ve signed off on.

However, that doesn’t mean this information can’t be put to good use. There is an overwhelming amount of information on the web, and Gravity’s goal is to help people sift through it. Then, Gravity presents what is most interesting and relevant in real-time. Based on Interest Graph, Gravity’s core IP, the platform semantically understands each user’s individual interests, calculates the strength of those attachments over time and returns recommendations designed to optimize engagement and user experience.

Why they might not make the cut: Personalization is a hot topic these days. Competitors include behemoths like Google, as well as such companies as Outbrain, Prismatic and Vu Digital.

16. Hortonworks

What they do: Hortonworks’ mission is to revolutionize and commoditize the storage and processing of big data via open source tools.

What I like: Hadoop is emerging as a next-generation data platform. It enables companies to capture and store massive amounts of data, and it also helps organizations integrate new forms of data that have never been captured before.

However, Hadoop is still a complicated beast. Hortonworks helps organization access Hadoop data from traditional data sources (RDBMS, OLTP, OLAP), data systems (EDW, MPP, RDBMS), and business applications (analytics/BI, customer applications, enterprise applications). Hortonworks has strategic partnerships in place with Microsoft, Rackspace, Red Hat, Teradata and dozens of other companies.

Why they might not make the cut: Competitors Cloudera and MapR are also on this list. I can pretty much guarantee that not all of them will make the cut.

17. Icimo

What they do: Develop tools that connect to an organization’s data, presenting that data visually so it can be easily analyzed and interpreted.

What I like: While everyone and their brother tries to take advantage of Big Data, one thing that is often overlooked is ease of use. Many Big Data platforms are cumbersome, expensive beasts.

Icimo believes the best way to simplify things is to think not about Big Data, but about bridging the gap between data and decisions. Icimo offers a a toolset built on top of Tableau, which makes it easy to connect to data in many different places. Icimo then serves up that information in a single dashboard, from which the user can take actions and improve decision making.

Why they might not make the cut: They have yet to raise any funding, and it could be easy to dismiss something like this as little more than a better dashboard.

18. LucidWorks

What they do: Provide enterprise search tools to help navigate Big Data.

What I like: IT organizations are beginning to collect orders of magnitude more data. Collecting data is one thing; however, making actual use of it is another. Enterprise search clearly has a role to play in terms of making Big Data accessible. The challenge is doing it in a way that other applications can utilize.

LucidWorks Search product is designed to help developers build highly secure, scalable and cost-effective search applications, while providing a simple and comprehensive way to access open source search technologies.

LucidWorks Big Data is an application development platform that integrates search capabilities into the foundational layer of Big Data implementations. T

Why they might not make the cut: They’re a strong competitor, but will have to turn out the votes to fight their way into the final 10.

19. MapR Technologies

What they do: Provide a Hadoop/NoSQL Big Data platform.

What I like: Both Hadoop and anything to do with Big Data are getting a lot of attention these days. Big names such as Yahoo and Facebook have built applications on top of Hadoop. Meanwhile, Big Data promises to transform organizations as data analysts turn up all sorts of insights that were previously opaque.

MapR claims that it is able to merge Hadoop, NoSQL, database and streaming applications in one unified Big Data platform. Speed has been an issue with Hadoop, but MapR claims to have cleared this hurdle, while also offering such enterprise-grade features as “high availability, business continuity, real-time streaming, standard file-based access through NFS, full database access through ODBC, and support for mission-critical SLAs.”

Why they might not make the cut: Not much to complain about here. They’re already making a name for themselves in the Hadoop/Big Data world, and they also have solid named customers, including Ancestry.com, the Rubicon Project, comScore and NextBio.

To stay on top of cool startups like MapR, be sure to sign up for the Startup50 Newsletter to receive a monthly summary of hot startups, as well as exclusive access to special content and new Top 50 reports.

20. MemSQL memsql

What they do: Develop real-time data analytics platforms.

What I like: Inside organizations today, data is available from hundreds of different sources in massive quantities. For the most part, companies store this data in existing relational database management systems that were not designed to capture, query, and store Big Data in real-time.

Since Big Data is only useful if it’s accessible in a timely manner, organizations are struggling to justify the sometimes high costs associated with it.

MemSQL’s claims to solve this problem with a distributed in-memory database that processes Big Data in real-time. Using in-memory database technology and SQL to C++ conversion, MemSQL can process hundreds of terabytes in milliseconds.

Why they might not make the cut: MemSQL is a strong entrant, with a decent Series A round under their belt and a list of top-tier customers.

21. Metamarkets

What they do: Provides a data analysis platform for the real-time bidding (RTB) ad buying marketing.

What I like: The RTB online advertising market (also known as programmatic buying), which includes everything from video to mobile to display, now generates petabytes of live streaming data each day.

To meet the requirements of Metamarkets’ online advertising customers, which have data volumes often upwards of hundreds of billions of events per month and need highly interactive queries on live data streams, Metamarkets decided to build Druid, its own distributed real-time analytics database designed to provide insight on large quantities of streaming data.

While Metamarkets is focused on the on-line advertising space, Druid was recently open-sourced (in October 2012) to help foster analytics for live streaming data. Druid is now used across many industries to analyze large quantities of data, both with speed and efficiency.

Why they might not make the cut: Not much to gripe about here. They’ve raised a good stack of VC money, have impressive customers and are even giving back to the community by open-sourcing Druid. Now, they’ll need to attract votes.

22. Mortar Data

What they do: Deliver the Hadoop platform as a service for building Big Data pipelines.

What I like: Mortar Data bills itself as a company that can deliver “Hadoop in an hour.” Considering that the complexity of Hadoop can scare plenty of potential users away, this is solid positioning – assuming they live up to that promise. Mortar is also focused exclusively on engineers and data scientists, rather than analysts.

Mortar Data’s service is designed to facilitate team collaboration, allowing users to easily share, repeat, and maintain their code. Data scientists and engineers using Mortar get full code history, full execution history, automated testing, and one-button deployment.

Why they might not make the cut: They’ll face stiff competition from the likes of AWS (Elastic MapReduce), Infochimps, Qubole, Continuuity, and Treasure Data. As of now, they’re funding ($1.8 million) looks a little light for a fight of this magnitude.

23. NGDATA

What they do: Provide a Big Data management platform.

What I like: Consumer-focused companies, such as banks and retailers, have actually collected more data about their consumers than companies like Google and Amazon. However, in spite of possessing so much data, these companies still don’t know their consumers well.

There are hundreds of internal and external data sources, and each represent a small silo of information about the consumer.

NGDATA delivers a combination of interactive Big Data management, machine-learning technologies and consumer intelligence in a single platform. The company’s consumer intelligence solution, Lily, enables storing, indexing and analyzing massive data sets and provides a “360-degree view of the consumer.”

Why they might not make the cut: This is a tough space. NGDATA will compete with the likes of Cognos, Continuuity, Platfora, SAS and WibiData, to name only a few.

24. ParStream

What they do: Develop database technologies to enable “Fast Data.”

What I like: Traditional databases just weren’t designed for Big-Data-scale analytics, and they certainly aren’t able to deliver those insights in real time. Traditional databases analyze data sequentially and aren’t able to take advantage of advances in multi-core processing.

When I spoke with CEO Michael Hummel at CTIA 2013, he noted that memory is a big bottleneck for traditional databases. Meanwhile, the Big Data database darling, Hadoop, has trouble scaling efficiently.

ParStream enables “Fast Data” by using a distributed architecture that processes data in parallel. ParStream was specifically engineered to deliver both big data and fast data, enabled by a unique High Performance Compressed Index (HPCI). This removes the extra step and time required for decompression of data.

Why they might not make the cut: They’re a strong entry. However, having just moved its founding team to the U.S., they will have to execute quickly to keep up with numerous startups in this space.

25. Platfora

What they do: Develop software that transforms raw data in Hadoop into interactive, in-memory business intelligence.

What I like: Platfora is focused on the main challenge of Big Data: namely, how to make sense of it.

While businesses have been rapidly adopting Apache Hadoop as a scalable and inexpensive solution to store near-infinite amounts of data, they struggle to extract value from that data. Traditional relational database and analytics tools just can’t deal with massive amounts of structured and unstructured data. So, businesses must perform a complex and rigid set of steps between the customer interactions that generate data and analyzing that data with business intelligence (BI) software.

Platfora tries to simplify that process and automatically transform raw data in Hadoop into interactive, in-memory business intelligence, with no ETL or data warehousing required. Platfora provides an exploratory BI and analytics platform designed for business analysts and not just IT.

Why they might not make the cut: They’re a solid contender, but plenty of those won’t make the final 10.

26. RainStor

What they do: Provide database solutions that offer high levels of data compression.

What I like: Businesses today struggle to cope with an unprecedented explosion of data, which according to various sources (such as McKinsey Global Institute, IDC and Wikibon) is increasing at a rate of 40-60 percent per year. As more mobile and post-PC devices enter networks, and don’t forget to add machine-to-machine communications into the mix, keeping up with Big Data will be a massive struggle.

RainStor’s database has been purposely designed from the ground-up to cost-effectively manage Big Data and enable fast queries and analytics. As a result, RainStor claims that it changes the economics of managing Big Data and yields a 90 percent cost savings on data storage.

Why they might not make the cut: They’re probably a bit long in the tooth for this roundup, having been founded in 2004.

Rocket Fuel27. Rocket Fuel

What they do: Provide a programmatic buying platform that uses artificial intelligence to interpret the petabytes of data available to advertisers and “learn” in real-time which ads work to drive engagement and sales.

What I like: John Wanamaker, the founder of modern advertising, once said, “Half the money I spend on advertising is wasted; the trouble is I don’t know which half.” Despite all of our advanced marketing analytic tools, this quote still rings true.

According to Rocket Fuel, advertisers waste $250B a year on digital ads that don’t reach their intended audiences. With programmatic buying, identifying the best location and time to place an ad is no longer guess work; however, the task of analyzing and interpreting data in real time is cumbersome, prone to error and slow.

Rocket Fuel’s programmatic buying platform uses artificial intelligence to interpret the petabytes of data available to advertisers and “learn” which ads work to drive engagement and sales.

Why they might not make the cut: They could be too successful for their own good – in terms of this roundup. Having raised $76.6 million in total funding and having a client roster that features Allstate, BMW, DISH and others means they’re bordering on incumbent status, despite being only about five years old.

28. ScaleArc

What they do: Provide database infrastructure software that simplifies the way database environments are deployed and managed.

What I like: ScaleArc’s flagship product, iDB, is software that inserts transparently between applications and databases, requiring no modifications to applications or databases. ScaleArc claims that it can be deployed in about 15 minutes. Then, users gain visibility into all database traffic with granular real-time SQL analytics.

Why they might not make the cut: I can’t really complain about much with ScaleArc. They’re raised a significant amount of VC funding, have top-tier customers and are targeting a very real pain point. Now, they’ll need to attract votes.

29. SiSense

What they do: Provide Big Data analytics platforms.

What I like: According to SiSense, traditional big data analytics solutions are like battleships: They’re expensive, complicated to operate, and are actually overkill for most businesses, which just don’t need that much processing. The typical business does not need to analyze petabytes of data. Rather, they’d be happy gaining insights on terabytes of data, but that’s either too expensive or forces them to rely on in-memory solutions, which cannot later scale to handle massive amounts of data.

SiSense Prism is built to offer big data analytics technology to businesses of all sizes. With no coding or scripting required, business analysts can analyze data themselves, without having to draw IT or data scientists into the process. SiSense claims that Prism allows non-technical users to analyze 100 times more data than current in-memory analytics solutions, and it does so 10 times faster. There’s no need to set up complex data warehouse systems or OLAP cubes.

Why they might not make the cut: SiSense was founded way back in 2004, a lifetime for many startups. However, they spent six solid years on development, and then took a couple of more years in testing before officially launching in 2012.

30. Skyhigh Networks

What they do: Provide a Hadoop-based Big Data platform to perform statistical and behavioral anomaly detection for traffic flows to the cloud.

What I like: Skyhigh Networks is focused on a major, but often ignored, cloud problem: shadow IT. It’s ridiculously easy for some non-tech savvy person in an organization to sign up for a cloud service on a credit card, or even expense it under some nebulous “other” category.

Of course, this introduces all sorts of security and compliance risks.

Skyhigh’s service puts control back in IT’s hands. Skyhigh helps customers safely realize positive cloud returns by finding and reducing risk, cost, and missed opportunities resulting from unmanaged “Shadow IT” cloud adoption.

Why they might not make the cut: Skyhigh is a strong contender, especially with Cisco and Equinix on their named customer roster. If they don’t make the cut, chances are their supporters didn’t support them enough in voting.

31. Skytree

What they do: Develop machine-learning-based platforms for Big Data analytics.

What I like: According to Skytree, advanced analytics, contrary to popular belief, “is not a meat grinder into which you can dump data in one end and expect nuggets of wisdom to come out of the other end.”

Skytree has created a general purpose platform that allows data scientists to focus on what matters most, which Skytree says is Mean Time to Insights (MTI), and focus on what they are good at: building and deploying analytic models rather than coding algorithms. Skytree is delivered as an application within a data center that can be used by many, as opposed to an individual application used on a single PC.

Why they might not make the cut: They’re a strong contender with a good chunk of VC funding and a roster of named customers. However, whittling this list down to 10 will mean leaving out a number of strong contenders.

32. SnapLogic

What they do: Provide a cloud-based enterprise data integration platform.

What I like: Serving up data from cloud, SaaS and even on-premise applications to Big Data sources isn’t easy. The traditional approach to integration – lots of hand-coding and specialized teams of consultants and engineers – cannot keep up with the reams of data being created in social, cloud and SaaS applications, nor can it even keep up with integration demands from legacy applications that were data silos.

With the SnapLogic platform, companies can connect applications across the enterprise through a standardized engine. IT managers can build projects with full interface support from the native SnapLogic library and SnapStore Components. Its modular architecture delivers data integration for the cloud or for on-premise applications. Its cloud integration solution and “containerized” snaps (application-specific connectors) from its Snap Store connect any number of data sources and cloud applications.

Why they might not make the cut: The data integration sector is pretty full already. It includes incumbents like Informatica, Microsoft, IBM, Oracle and SAP, as well as such startups as Talend and Pentaho.

33. SolidFire

What they do: Provide SSD cloud storage solutions suited for Big Data applications.

What I like: Today’s cloud storage infrastructure has a tough time handling Big-Data-scale requests. To circumvent this issue, IT has created shared infrastructures where one application can temporarily borrow resources from other applications.

However, this approach simply pushes problems elsewhere. For instance, a noisy neighbor can steal resources, causing all the others in the infrastructure neighborhood to experience variable, unpredictable performance. Issues like this limit the enterprise’s use of the cloud for mission-critical applications, and it also presents a bottleneck to Big Data queries.

The SolidFire storage architecture was architecturally built to solve the issue of unpredictable performance in large-scale multi-tenant clouds, which helps enterprises move mission-critical applications to the cloud and take advantage of Big Data.

Why they might not make the cut: Cloud and big data storage plays are abundant. Competition will come from incumbents like EMC, as well as from startups such as Amplidata, NexGen, Pure Storage and others.

34. Sqrrl

What they do: Provide a Big Data platform that powers real-time applications.

What I like: Large organizations struggle in building mission critical real-time applications on top of big data because of security concerns around the data and because Big Data application development tools are complex and still too immature.

Sqrrl’s Big Data platform includes “cell-level security” capabilities. Sqrrl tags every piece of data with a security label that dictates who can access that data at the application layer. The company has also built SQL, graph, full-text search, and statistical tools on top of its platform to simplify application development.

Why they might not make the cut: Sqrrl is less than a year old, only has seed funding and has no named customers yet. However, with a strong management team, that could all change quickly.

35. Statwing

What they do: Develop tools that make it easy for anyone to use the same statistical analysis tools that data scientists and statisticians use.

What I like: Currently statistical analysis software, such as IBM SPSS, is so hard to use that non-experts give up trying quickly, so they revert back to Excel. Excel is good for organizing data, but it’s fairly clunky and time-consuming to use for good data analysis (it can take minutes to make a simple chart or examine a single hypothesis in the data), and it almost entirely misses out on the power of statistical analysis.

Statwing automates routine data analysis decisions, and translates the results of statistical tests into clear visualizations and plain English that anyone can understand.

Why they might not make the cut: They face tough competition from the likes of IBM, Tableau and Cognos.

36. SumAll

What they do: SumAll ‘s product is an analytics tool that helps businesses make more money by using their own data. SumAll tries to break down various data silos, from those associated with legacy apps to those involved with social media.

What I like: I’m a big Big Data advocate, and I like how SumAll removes the pain of tethering a million and one APIs together to get true visibility into all of the actionable data businesses either already own or have access to.

Why they might not make the cut: This was pitched to me as a social media play, but that’s a stretch. Sure, they have hooks into social media sites and tools, but this is really a Big Data play. The social media positioning could work in their favor in a roundup like this.

37. VoltDB

What they do: Provide an in-memory database management system.

What I like: VoltDB is a “NewSQL” database solution designed to deliver both the ACID compliance of a relational database and the scalability of NoSQL. It is purpose-built to support any application – financial, gaming, retail or otherwise – with high requirements for ingesting large quantities of information at a high rate (database throughput requirements reaching millions of operations per second).

It provides visibility into real-time data and enables the application to make automated decisions based upon the data. It does this by employing a closed loop process where analytics are fed back into the decision-making process, allowing data to inform the front end of the application that is acting on new data as it arrives, thus maximizing business value.

Why they might not make the cut: With most companies just figuring out how to access and analyze Big Data in the first place, there’s no guarantee that the speed message will resonate. However, the “Fast Data” subsegment is slowly growing. VoltDB will compete with the likes of ParStream, NuoDB and Akiban.

38. Vyopta

What they do: Provide a SaaS-based platform that ties video collaboration to business workflows and processes, so companies can gather business intelligence from video collaboration.

What I like: Big Data is all the rage these days. Why not try to pull video collaboration into the mix?

Why they might not make the cut: They’ve been around since 2007, but have only raised $700K in funding and don’t have on-the-record customers lined up yet. That said, they do have a partnership with Cisco that should help.

39. WibiData

What they do: Develop tools that help companies build Big Data applications.

What I like: Enterprises across industries are collecting data from their users in back-end systems, yet there are few applications that allow companies to utilize this data in their front-end systems in real time to create better user experiences.

WibiData closes this gap by building and delivering Big Data Applications. Using WibiData’s Big Data applications, organizations are able to store data for each of their end users across a variety of application channels and dynamically apply predictive models to that data, adapting the application with each click. Organizations are then able to add new data sources and experiment with new models to create personalized experiences without having to dedicate vast engineering resources.

Why they might not make the cut: They’re a strong contender, having raised seed funding from Eric Schmidt and with a series A from top-tier VCs. [Edit: They just closed a Series B a few days ago, raising $15 million in a round led by Canaan Partners with participation from existing investors, including NEA and Google Chairman Eric Schmidt.] Moreover, CEO Christophe Bisciglia was on Cloudera’s founding team. However, they’ll need to turn out the vote to crack the top 10.

40. Xplenty

What they do: Provide Hadoop as a Service for Big Data analytics.

What I like: Hadoop is being hyped to the moon these days, but development, implementation, and maintenance of Hadoop require a very specific and arcane skill set. Xplenty goal is to eliminate your need to learn any of that.

Xplenty provides a data integration platform that processes Big Data. A drag-and-drop interface eliminates the need to write complex scripts or code of any kind.

Xplenty is cloud based, so there is no installation of anything on an end user’s servers, and there is no software to download onto workstations. With automated server configuration, users simply point to a data source, configure the data transformation tasks and tell the platform where to right the results to. Xplenty’s platform uses SQL terminology, so for data analysts, the learning curve should be minimal.

Why they might not make the cut: This is a new entrant into this space, only backed by seed funding and with no on-the-record customers yet.

41. Zettaset

What they do: Provide Big Data management tools.

What I like: Various industry estimates calculate that as much as 80 percent of all data that exists is unstructured. Hadoop has increasingly become a popular option to process, organize, and store huge volumes of semi-structured and unstructured data, thereby making it suitable for data mining and business analytics purposes.

Hadoop and similar NoSQL data stores enable any organization, large or small, to collect, manage and analyze immense data sets. However, these nascent technologies were not designed with comprehensive security in mind.

Zettaset Orchestrator is intended to solve this problem. Orchestrator is not a Hadoop distribution, but is a software application that sits on top of any Hadoop distribution and functions as an independent management layer. Orchestrator provides access control, policy management and compliance support.

Why they might not make the cut: A lack of on-the-record customers is always a concern.

42. Zoomdata

What they do: Helps companies visualize business insights with a mobile platform that crunches and merges data streams from internal and external sources to create real-time, interactive visualizations

What I like: Data administrators are able to quickly connect the Zoomdata stream-processing engine to a wide variety of real-time data feeds and historical data sources. Once connected, a real-time data feed is instantly available, allowing users to visualize and analyze trends within its most critical data, whether it arises from social networks, IT operations, trading data, transactional systems, or virtually any enterprise or cloud-based application.

Zoomdata works in conjunction with data bases and BI tools to make the results easy to understand, easy to share and simple to dig through and analyze.

Why they might not make the cut: Whenever a startup tells me, “[we’re] a compliment to existing tools, not a competitor,” my feeling is that they need to study the market more diligently, since that’s almost never true.

Update (5/30/13):  The first challenger has broken through, forcing me to add them to the list: 

43. Pivotal

What they do: Develop PaaS and Big Data tools

What I like: Many established enterprises are being left out of the cloud, social and mobile app gold rush. Developing new capabilities that work alongside entrenched technologies and existing data platforms can be stunningly difficult.

Pivotal’s goal is to deliver “a next generation Enterprise Platform-as-a-Service that makes it possible, for the first time, for the employees of the enterprise to rapidly create consumer-grade applications. To create powerful experiences that serve a consumer in the context of who they are, where they are, and what they are doing in the moment. To store, manage and deliver value from fast, massive data sets. To build, deploy and scale at an unprecedented pace.”

Okay, that sounds like a lot of vendor hokum to me, but Pivotal has an impressive management team, led by Paul Maritz, who previously served as Chief Strategist of EMC and CEO of VMware. They’ve raised a tractor-trailer’s worth of VC money, with GE, EMC and VMware as the investors, and they already have solid beta customers, such as GE and NYSE Euronext.

Moreover, when you drill down beneath the spin, there is a real core there. Pivotal claims it will not just address the need to store and process analytics on a grand scale, but is aiming to build a complete platform that will integrate large-scale analytics (“Big Data”), real-time processing of information arriving from multiple sources (“Fast Data”) and the development of the applications that take advantage of these data capabilities. The result will be creation of new experiences and business models all done in a cloud portable manner.

Why they might not make the cut: Pivotal is probably jumping the gun here. Their consumer-grade product hit the market a couple months ago, but the enterprise products won’t be available until Q4. Still, it’s hard to ignore their management team, the loftiness of their goals and their behemoth backers.

To stay on top of cool startups like Pivotal, be sure to sign up for the Startup50 Newsletter to receive a monthly summary of hot startups, as well as exclusive access to special content and new Top 50 reports.