Fivetran Brings Data Lake Interoperability to Google Cloud

Alongside data lake support for Microsoft Fabric, data integration vendor Fivetran expanded its Managed Data Lake Service to support Google’s Cloud Storage (GCS), following previous launches on AWS and Azure. The Fivetran Managed Data Lake Service, which the vendor launched last year, automatically converts data into open table formats, specifically Apache Iceberg and Delta Lake, and facilitates interoperability with popular query engines and metadata catalogs.
In announcing the new service at Google Cloud Next in Las Vegas, Fivetran says it has around 4,000 joint customers with Google, and it is already onboarding Google’s Cloud Storage customers.
Anjan Kundavaram, chief product officer at Fivetran, said in an interview with The New Stack that Fivetran has native integration with Google’s BigQuery metastore. This ensures that data in GCS is automatically cataloged in BigQuery’s metastore, improving governance and interoperability across Google’s data ecosystem. “Customers who are used to Google BigQuery really can’t tell the difference between a BigQuery interaction and an Iceberg query running on Google’s Cloud Storage with the Fivetran Managed Data Lake Service,” he said.
What Is a Data Lake?
Unlike a data warehouse, which stores data in an ACID-compliant system (i.e., one that has atomicity, consistency, isolation and durability), a traditional data lake is a system or repository of data stored in a raw format, usually as object blobs or files. The goal is to have a single store of data, including raw copies of source system data, sensor data and social data.
The term “data lake” was coined in 2010 by then-chief technology officer James Dixon’s team at Pentaho. Dixon wrote that he wanted a term distinct from “data mart,” which is a smaller repository of interesting attributes derived from raw data.
To add to the terminology confusion, the term “data lakehouse” is often used somewhat interchangeably with “data lake.” Strictly speaking, a data lakehouse is a hybrid approach; like a data lake, it can ingest a wide variety of raw data formats, but it also supports ACID transactions like a data warehouse does. However, a modern data lake leverages open table formats, which store data in an ACID-compliant manner to bring data warehouse-like functionality to data lakes.
Data lakes can be tricky to manage, especially when not actively maintained, and consequently are sometimes derogatively called “data swamps.” In a 2014 report from PwC, Sean Martin, CTO of Cambridge Semantics, said, “We see customers creating big data graveyards, dumping everything into the Hadoop Distributed File System and hoping to do something with it down the road. But then they just lose track of what’s there. The main challenge is not creating a data lake but taking advantage of the opportunities it presents.”
How GenAI Is Boosting Data Lakes
This perhaps explains why data lakes seemed to fall briefly out of favor. However, Kundavaram suggested that generative AI (GenAI) has been a catalyst for a new wave of data lake-based initiatives. This, he said, is because “for agents or RAG [retrieval-augmented generation], you really want all your data, structured and unstructured, in one place.”
Fivetran has a partnership with OpenAI, the company that has — for better or worse — become the poster child for the tidal wave of hype around GenAI. “OpenAI has the same data pipeline problem that everyone has, though probably at a larger scale,” Kundavaram said. “We’ve been close partners with them, supporting their use case and innovating alongside [them].”
Along with its ability to handle both structured and unstructured data from multiple sources, Kundavaram offered two additional reasons a data lake is the best approach for GenAI projects: future-proofing and cost. “It’s built on open standards, and if you want to use any number of querying tools like Google, Snowflake or Databricks, you can,” he said. “It is also very cost effective since you don’t need to make copies of data and customers experience significant savings on ingestion costs.”
More generally, Fivetran said that companies including Disney, Sonos, Workday and PWC are turning to managed data lakes as they look to centralize high volumes of structured and unstructured data for AI workloads.
Given the renewed interest in data lakes, I was curious why Fivetran hasn’t launched a data lake product before now. Building a new product inevitably takes time and considerable engineering investment, of course, but Kundavaram said that the open table formats — particularly Apache Iceberg — also needed time to become sufficiently well-developed. “It’s matured quite a bit in the last couple of years,” he said.
Landscape, Pricing and Outlook
Data integration is a highly competitive space. Among dozens of vendors, major players include Microsoft with Azure Data Factory, SQL Server Integration Services and Power Query for data integration, and Microsoft Fabric as its main data platform; Informatica has its Intelligent Data Management Cloud; and Oracle has Oracle Cloud Infrastructure, Oracle GoldenGate and Oracle Data Integrator.
To win customers, Fivetran needs an edge. A core strength is its 700+ connector ecosystem. It continues to invest heavily here, adding about 60 to 70 new connectors per quarter, Kundavaram said. The vendor’s Powered by Fivetran program enables its customers to embed Fivetran connectors into their own applications, and a Connector SDK enables partners to create custom connectors as needed. By leveraging this, enterprises can centralize large volumes of data in Google Cloud Storage, creating a foundation for training custom large language models (LLMs).
Fivetran includes a number of data governance capabilities, such as role-based access control (RBAC), data encryption, and column blocking and hashing. In addition, its Hybrid Deployment model can be used to keep the data plane and all pipelines within the customer’s own secure network.
“We have a lot of customers with sensitive data who run our product using Hybrid Deployment,” Kundavaram said. “This ensures that only functional metadata gets shared back to our control plane, while no data leaves their environment.”
When compared to its larger competitors, Fivetran’s takes a different approach to data transformation. The vendor offers a simpler set of around 55 dbt Core-compatible Quickstart data models for its most popular connectors, including Marketo, Mixpanel, Salesforce and SAP. Around 40% of its customers use these when setting up the source integration, Kundavaram said, and land “transformed, analytics-ready tables in the destination.” Alternatively, customers can build their own dbt models, which Fivetran can schedule and manage.
Fivetran is venture-funded, and in its most recent funding round (in 2021), it announced a Series D round of $565 million, valuing the company at $5.6 billion. In September 2024, Fivetran announced it had surpassed $300 million in annual recurring revenue, up from $200 million in 2023, although these figures have not been audited according to the rules of public companies.
Historically, small and midsize businesses (SMBs) have been Fivetran’s focus but, aided by its acquisition of HVR in 2021 alongside its Series D funding round, the vendor has expanded its reach beyond the midmarket segment. Pfizer, for example, uses Fivetran “to support scalable analytics platforms and enable real-time analytics, which is particularly crucial in areas such as clinical trials and supply chain operations,” according to a Fivetran case study.
From a pricing perspective, Fivetran is consumption-based in a tiered model, based on monthly active rows processed. This approach allows SMB customers to start their projects without securing significant upfront capital expenditure and larger enterprises to better manage costs even as volumes scale.
Learn more about Fivetran’s Managed Data Lake Service for Google’s Cloud Storage.