Managing data with rapid growth

Explosive growth in personnel and internal data resources (data tables, dashboards, reports, metrics) can lead to a messy data ecosystem within a company. What started off as a single source of truth can soon evolve into mass confusion with an excessive number of sources and knowledge silos within each team. Data resources scale linearly with growth, but more data does not necessarily lead to better decisions. In fact, it can cause confusion and even discourage data exploration and hinder new ideas.


Effectively navigating endless data resources of differing quality, complexity, relevance, and trustworthiness is a challenge. A modern data catalog designed for data analysts can mediate these issues. A data catalog is a metadata management tool that is used to inventory and organize data in any system. Typical benefits include improvements to data discovery, governance, and access.


Uber and Twitter both built their own internal data catalog tools to increase the productivity of data consumers. Datalogz mimics their strategy to make this technology accessible to any organization via a unique collaborative data catalog.

Uber's Databook

As Uber grew, the copious amounts of data continued to grow. Big data at Uber was an understatement; Uber is "processing trillions of messages per day, storing hundreds of petabytes of data in HDFS across multiple data centers, and supporting millions of weekly analytical queries." After acquisitions and expansions, Uber owned JUMP Bikes, Postmates, and Drizly; each company has a different data lake and variation of analytical and data warehousing tools. Uber data analysts needed to have the ability to understand this complicated data ecosystem and self-serve data, but this was their challenge with this abundance of data.


According to Uber, "Big data by itself, though, isn’t enough to leverage insights; to be used efficiently and effectively, data at Uber scale requires context to make business decisions and derive insights." This led Uber to build their own internal data catalog, Databook.


Databook, Uber's self-built data catalog, manages metadata and enables data discovery, exploration, and self-service data understanding. Databook "ensures that context about data—what it means, its quality, and more—is not lost among the thousands of people trying to analyze it." In short, Databook’s metadata empowers Uber’s engineers, data scientists, and operations teams to move from viewing raw data to having actionable knowledge on how to use it quickly and effectively.


With such fast growth, a system for searching all datasets and insights is an absolute must for making data useful.

Twitter's Data Access Layer


Like Uber, Twitter has heaps of data and runs over one hundred thousand daily processing jobs on tens of petabytes of data. The data team at Twitter has grown from a single data group to hundreds of employees. This makes it near impossible to find and understand data insights without considerable time spent searching. To mitigate this problem and others, Twitter built DAL, Data Access Layer, to document and organize all internal data.


Twitter's DAL was built with the goal of providing data consumers with the answers to four common questions: 1) how can we find datasets that are the most important? 2) who and how are they created? 3) what does the data represent logically? 4) how can the datasets be consumed?


DAL allows Twitter's hundreds of data consumers to find and understand data at a deeper level and faster pace.

What Twitter and Uber decided to do set the tone in data management for leading tech companies. They've proven that cataloging data in this way is best practice for data teams. However, most companies don't have the resources or bandwidth to build their own DAL or Databook, and that is where Datalogz fits in. Datalogz is a simple secure data-consumer-focused discovery and understanding tool for analytics teams of any size to help data management scale with growth.


4 views0 comments

Recent Posts

See All

Database Documentation

Data documentation isn’t sexy. But it matters—big time. Data documentation is paramount for any data team. Without accurate and up-to-date documentation, how will your team understand data to make acc

2022 Biggest Data Challenges

Since 2010, data created, captured, copied, and consumed globally increased from 1.2 trillion gigabytes to 59 trillion gigabytes, an almost 5,000% growth. The rapidly growing volume and complexity of