Automated Metadata Tagging - Why do data catalogs fail?

Jan 3
5-min



Current Situation for Most Companies

Documenting data is extremely important for any successful data initiative, but most people DESPISE documenting it. Documentation takes time and effort and is often created on the fly resulting in it not being a priority. At Datalogz we are changing this by building an automated metadata tagging system to automatically classify data tags based on your companies business glossary.

The majority of data is undocumented and not labeled. This creates challenges for teams to quickly discover correct data for projects (i.e. useful data is likely being underutilized). The result is that projects are not being completed to their full potential, wasted time is being spent on data discovery, and data users do not have a complete grasp on what data is available to them. Having better-documented data can result in improved efficiency of data users, better analytics, and higher-performing machine learning models.

 

An example use case can be this: There is a data consumer who is building a machine learning model to predict optimal sleep times for their smartwatch wearers. The first step in this data consumer’s process would be to determine the relevant datasets to train this model. They might need data around sleep, performance stamina, training regime, and exertion. Currently, they are searching through datawarehouses or speaking with colleagues to find this data. However, because all available data in these categories is not easily discoverable, it could take days or weeks to find suitable datasets. Some of the most relevant data sets would likely go undiscovered. By not having tagged data and since manually tagging all data is not an option, project time and reliability of results are being directly impacted here. 


What if your data’s metadata could be auto-tagged to create an indexable and discoverable environment for data users to quickly find correct data assets?

Datalogz

Datalogz is already making a mark on the metadata management space by building the most robust metadata pipeline available on the market. We are experts in generating insights with query history, sourcing information from users, and tying user information into the data itself. From inception, our goal has been to make data more accessible and understandable for any size organization. We are able to equip data teams of all sizes with the tools they need to unlock their full potential. Our next goal is to incorporate AI into our product to allow automatic documentation, and to keep documentation up-to-date.

Together, we bring a novel approach to solving and executing the most complex data problems. Our team is capable and ready to solve automated metadata challenges. Innovating and demonstrating significant impact with as much efficiency as possible is at the core of what we do.

Approach

Using an AI clustering model to match similar fields based on proprietary profiling across datasets


In an enterprise data environment, thousands of data columns could exist in various formats with unclear naming conventions. The first step in creating auto-documentation begins with grouping like-data at the enterprise level. To do this, a proprietary profiling algorithm will be developed to profile every column. The profiling can be used to then find similar groupings across tables to see how data can be potentially connected. We are able to use this clustering approach to find joinability across data sets and so we can then begin to find similarities and groupings to set up for successful tagging.

Matching grouped data to a centralized tagging system

The second step in applying auto-tagging is creating a system to tag columns with keywords associated with each grouping. The model will come pre-equipped with common tags that are pre-trained and also allows users to input feedback to create additional tags. Users will be able to add tags from a central portal and train the model in real-time by creating a new tag. The new tag would be matched to the nearest cluster and then the model would tag relevant tables. This will unlock data discoverability and autonomous tagging at scale. 


Continuous training and feedback loop


Users will be able to reassign tags and adjust the results of the model as needed. The model will allow for programmatic retraining in real-time to improve accuracy based on user feedback. Having a feedback loop will improve tagging accuracy and result in better output for data discoverability. 

Where information is generated:


  • DB Name
  • Schema Name
  • Table Name
  • Column Name
  • Column Values
  • String
  • Boolean
  • Number
  • Usage History
  • User profile information


In Summary


Undocumented and poorly documented data causes problems for most data teams. Data users would love to have all of their data documented, but the effort to do this is far too large. At Datalogz, we’re building an automated data documentation system to instantly document up to 1/3rd of your companies based on your business glossary. Your users provide semantics, examples, and terms. Datalogz then automatically tags columns and tables with relevant keywords for instant discoverability at your company.

Stop investing in data catalogs that create more work for your team. Datalogz changes the status quo by instantly unlocking true instantaneous discoverability while legacy solutions like Amundsen, Atlan, Collilbra, and Alation rely on creating business processes to document data. 


Share this Post:

Want to learn more?

Book Demo

blog

Recent posts

Show more