Engineering

Countly's New Data Schema: What's Changing, Why It Matters, and How It Affects You

Last updateD on

December 22, 2024

Arturs Sosins

CTO at Countly

For historical reasons, we have been using dynamic collection creation. This means that each new event creates a new collection in the database. While this provides some benefits, such as managing permissions on the collection level, it also has many downsides.

And because we wanted to future-proof Countly, we decided to make this step and change the scehma, which will potentially allow us to do more in the future.

What Are We Changing With This New Data Schema?

In a nutshell, we are consolidating all events from all apps into a single collection. It means that, for aggregated data, instead of multiple eventsHASH collections, you would have only one single events_data collection. For granular data, instead of multiple drill_eventsHASH collections, you would see only one single drill_events collection.

While aggregated data is mostly meant for dashboard reports and is not used by customers directly, the granular data on the other hand is used outside of Countly a lot, so let's discuss changes to drill_events collection in more detail.

Of course, to combine data from all events and apps into one collection, we need to add fields that indicate which app or event this data comes from and we do that by adding a and e fields respectively.

We also removed some fields to reduce the documents' weight because these values can be calculated at run time based on the ts field.

‍

Image showcasing Countly before the new data schema.

Why Do We Make These Changes?

We want to be forward-thinking and future-proof Countly for what is coming next. But also remove the inconveniences that our current users have.

Easier to export data

For example, previously if you wanted to export all events from Countly, you had to jump through many hoops. But after this change, it will be as easy as exporting a single collection or querying it to a subfilter.

Easier to manage data

In similar cases, management of a single collection is much easier than multiple collections, including:

Creating indexes
Sharding
Manages TTL and data retention
Deleting old, unused collections

Better performance

Because of the way MongoDB (or specifically WiredTiger) handles writing in collections. It is actually much faster to write to a single collection than to multiple parallel ones. This also leads to the fact that there will be no hard limit on event keys.

While we still suggest having some limits to ensure it is manageable from the dashboard's point of view, there will no longer be a hard limit and no performance penalties.

Future growth

In the future, storing data in a single collection would allow us to:

Allow faster cross-event queries
Better handling of consolidated data and applications
Introducing support to other databases for granular data to support different kinds of loads

What Are The Downsides and Am I Affected?

Migration of data

The main downside is migrating existing data to the new collection. If you have a lot of data, it will take a significant amount of time.

In this case, we suggest not migrating data at all. Just allow new data to be written into a new collection. Then, data expiration is applied to old collections to delete data when it is no longer needed. Countly will offer an option for querying the new and old data models during migration.

Of course, it is possible if you need to migrate the data, but it can take up to 100 hours per 2 billion documents. If you need clarification, please discuss this topic with your account manager.

Data retention and custom indexes

If you have applied data retention or any custom indexes to your drill collections, they must be reapplied to the new drill_events collection.

Exporting data from Countly

If you periodically export data from Countly or access raw data, things will become easier for you, but you will still need to make some changes to make it work.

Instead of using the dynamic event collection, you would need to switch to the drill_events collection in the same countly_drill database.

You can refer to this data model document for the data schema for this collection, but if you have already worked with drill collections, it will be very familiar to you.

Accessing data in DB Viewer

If you are an avid DB Viewer user, then now instead of multiple collections in the countly_drill database, you will see the main one named drill_events, containing all events from all apps.

Migration checklist

The first thing to decide is whether you need to migrate all data immediately or if you are okay with seamless migration using both new and old data.

If you need to migrate all data right away:

Make sure you have enough disk space, which is double what countly_drill currently consumes.
Delete all the data you no longer need, such as older data or unused events, to ensure you migrate as little data as possible.
Calculate the time needed for migration: around 100 hours for 2 billion documents.
Take notes of custom indexes and data retention you have so it can be reapplied to the new drill_events collection.
Try new Countly in test deployment and make sure all your export and raw access data work with the new data model.
Contact Countly for help during migration.

If you are ok with the migration period and using old and new data:

Make sure old drill collections have data retention setup (TTL indexes).
Take notes of custom indexes and data retention you have so it can be reapplied to the new drill_events collection.
Try new Countly in test deployment and make sure all your export and raw access data work with the new data model.
Install new Countly in your production environment,
Enable using old drill collections under Settings -> Drill -> Union with data from old collections.
After the data retention period expires, remove all empty old drill collections and disable the setting.