Google Analytics 4 [GA4] BigQuery - 9 challenges to surprise you in data analysis

Oskar Maciejek
Founder | Tech lead & Data Solutions Architect
Data analysis
Jan 13, 2023

    Continuation of a series of articles about Google Analytics 4 [GA4] data in BigQuery


    This article is part of a series of posts about Google Analytics 4 and the export of data to Google BigQuery. It consists of the following components:

    In the previous post of Google Analytics 4 BigQuery - why you should use it ? we discussed a number of arguments about why it is worthwhile to use the export of Google Analytics 4 data in BigQuery.
    However, while interacting with data, we may come across various difficulties that we should be aware of before we spend many hours on taking attempts to overcome them. So we decided to write an informative article about these problems - this is a collection of our experience from projects based on data export from Google Analytics 4.

    #1 Data reported in Google BigQuery will be different than those seen in the Google Analytics 4 panel

    The first task we handled while working with the export of Google Analytics 4 data in BigQuery was to trace the logic of reports that are available in the interface. In this way, we can learn a lot and understand exactly how the metrics and dimensions are calculated and validate the correctness of our queries.

    Here, the task turned out to be difficult, because according to Google Analytics 4 documentation , the data in the panel may differ from those that we will calculate in BigQuery.

    Even if the data in the Google Analytics 4 panel shows that they are not sampled, the number of sessions is still an estimate based on the number of unique session identifiers. Read more about this topic in: Unique count approximation in Google Analytics


    How Google Analytics 4 Calculate Sessions
    In BigQuery we most often do not use the estimation function, and hence we can spot differences, for instance in the number of sessions compared to the results given in standard and exploratory reports or in Looker Studio.

    Here are the recommendations from Google:

    • If we wish to get more accurate results based on raw data, we should use Google BigQuery to export data
    • If we need to quickly obtain results taking into account the margin of error, it is best to check them in the reports in the panel

    Google Analytics 4 BigQuery Data Precision

    #2 Information about campaign sources - in BigQuery we only have the first source of user acquisition

    Little doubt, a report showing the source/medium was the most popular report in Google Analytics Universal. Such a report was also reproduced in Google Analytics 4. It is available in the default Acquisition -> Traffic Acquisition section.

    Google Analytics 4 - Source/Medium Report

    This report uses a dimension called “Session source/medium”, which carries information about the source of the session.

    While viewing data exported to Google BigQuery for GA4, we can notice fields called traffic_source.source, traffic_source.medium, traffic_source.name (campaign name), which are very often misused to reproduce the above report.

    Google Analytics 4 BigQuery traffic_source field example


    The problem, however, is that these are fields with the information about the things that occurred in the first campaign since acquiring this user, not about the source that occurred in a relevant session, as we can view it in the reports and the "Session source/medium" dimension.


    Google Analytics 4 BigQuery traffic_source field description


    By using these fields in the table, we will thus achieve a different image than for the “Session source / medium” dimension in the panel.


    Google Analytics 4 BigQuery traffic_source field path example

    UPDATE - the following parameters are currently available at the event parameter level and in the columns in the collected_traffic_source group. However, please remember that these are not data calculated according to the dimension logic, i.e. Session source/medium/campaign. These are just parameters that occur with the event.

    To reproduce this dimension, we need to retrieve the campaign parameters from the events, and then calculate these data according to the logic described in the documentation for all sessions: [GA4] Scopes of traffic-source dimensions - Analytics Help


    Google Analytics 4 BigQuery Campaign Parameters

    #3 Google Analytics 4 [GA4] performs deduplication on users between devices


    The data may be connected between platforms/devices, this being a major revolution in Google Analytics 4 compared to Google Analytics Universal. By implementing Google Analytics 4 on our website and in the mobile application, we are able to obtain one transparent source of analysis for these platforms.

    By default, Google Analytics generates a “user_pseudo_id” for each device/browser, that is a new cookie (website) or identifier (mobile application). Based on this identifier, the calculations necessary to present all metrics and dimensions in the panel are performed.
    One user can visit our website or app through different devices. If you visit us on the website on your computer, then on the website on your mobile device, and finally in the mobile application, you will be identified as 3 different users.

    If we enable Google Signals or implement a User ID function (by way of illustration a custom identifier from a database when logging in or making a purchase)Google Analytics will perform a deduplication on your data (as far as possible in Google Signals), which will affect the number of users and the attribution of marketing channels. We may view data at different levels (whether deduplicated or not) depending on our identity settings in Google Analytics 4 - more on this in: [GA4] Reporting Identity - Analytics - Help

    Google Analytics 4 User Identity

    However, it is worth remembering that Google Signals data is data that Google can connect with users who have logged into their Google accounts and have ad personalization enabled. By linking this data to logged-in users, the reports may present a number of users more accurately. In view of the users' privacy policy, Google cannot share this data with us, which also affects the discrepancies in data between the interface and the reports in Google BigQuery. Read more about this topic in: [GA4] Activate Google signals for Google Analytics 4 properties


    Google Analytics 4 BigQuery Show More Users


    Take note - if we have implemented the user_id parameter on our website or application, we can seek to recreate the deduplication of users in BigQuery - only in this way will we achieve the correct number of users and the correct attribution of traffic sources to the session in accordance with the “Cross-Device” logic.

    Complex Deduplication in BigQuery | by Benjamin Campbell , expands the issue of the deduplication problem and a potential solution.

    #4 One session ID (ga_session_id) can be assigned to 2 different users


    Time stamps are used intensively in programming.
    This data determines the moment when a specific event occurred.
    The value of this stamp is defined on the basis of "Unix Time", a system of time representation measuring the number of seconds since the early 1970.

    For example, Timestamp 1672481243 represents the date and time: 2022-12-31 10:07:23
    That's 1672481243 seconds since the early 1970s.

    When browsing Google Analytics 4 data in BigQuery, we also have a lot of contact with time stamps, for instance the field event_timestamp contains the number of microseconds since 1970.

    When we look at the identifier named “ga_session_id”, we see that this is the approximate time of the first event starting the sessions.


    Google Analytics 4 GA Session ID Duplication


    Whereas several users can start sessions at the same time (exactly the same second), the parameter per se, that is ga_session_id, does not give a unique session identifier. To this end, we need to combine user_pseudo_id or user ID and ga_session_id to obtain a unique session ID for our calculations. We can do this with the CONCAT function.

    SELECT
       CONCAT(user_pseudo_id, ga_session_id) as session_id
    FROM
       your_google_analytics_4_events


    We will then develop a unique string, one which will communicate the identifier of a session:


    1020668977.16723547091672354709



    Of note - if we look at the value of user_pseudo_id and see what it is composed of, we will also notice a timestamp communicating the date of creation of the user in question.


    1020668977.1672354709 = {{random number}} + “.” + {{user created timestamp}}

    #5 One session ID (ga_session_id) can occur on 2 different days


    Google Analytics Universal created a new session each time:

    • there has been no interaction for more than 30 minutes (based on default settings)
    • when campaign parameters changed ( that is utm, gclid, referral)
    • when the session took place between one and the other day


    By way of illustration, if a user started the session at 23:58, and the purchase was made without walking away from the computer at 00:05 the following day, Google Analytics Universal created 2 different session IDs in this case - the one lasted from 23:58 to 00:00 and the other from 00:00 to 00:05 (assuming that the user closed the browser as soon as he had made his purchase).


    For Google Analytics 4, the session ID will remain the same between one and two days. Accordingly, if we want to calculate the exact metrics for a relevant session, we have to take into account the data from the previous day, be it to check the entry/destination page for a specific session. This affects the size of the processed data and the need to apply additional modifications in SQL queries.

    #6 URL (page_location) can be up to 420 1000 characters long


    UPDATE - Character limit restrictions for page_location parameter changed from 420 to 1000 characters

    If you have long urls, ones using many parameters, the analysis of this data in BigQuery may surprise you. During one of the projects, the client asked us to analyze the filters selected by the user based on the parameters in the url addresses. We sat down to the task with optimism, writing regular expressions that allow us to extract parameters from url addresses.
    Following a brief analysis, we noticed that parts of the parameters are missing or often cut out in url addresses. So we decided to check the maximum length of url addresses and it turned out that a large part is always 420 characters - all these addresses had cut off parameters due to length.


    Google Analytics 4 page_location max 420 characters


    Our recommendation: if we are aware that the relevant url parameters will be very important for us during data analysis, we should pass them as event parameters. Take note, however, that page_location is a system parameter that can be 420 1000 characters long. For custom parameters, the character limit is 100. Read more about the limits: [GA4] Event collection limits - Analytics Help

    #7 Data without analytical consent contain no information about user_pseudo_id and ga_session_id


    If we have correctly implemented the Google consent mode function, the data may reach Google Analytics in various forms. If you do not agree to be identified by a cookie, your events will be submitted to Google BigQuery, but the parameters, that is user_pseudo_id and session_id, will be null. Hence, if we plan to use this data for combining or other calculations, it is worthwhile to keep it in mind, because many data can be grouped into one non-existent user or into one non-existent session.


    Google Analytics 4 BigQuery Consent Mode Data

    #8 When our transactions go to Google Analytics in different currencies, we need to have them converted


    If users can make purchases in different currencies on our website or in the application, we must ensure that the value is converted to the currency that has been set in the settings of the respective service.
    In the case of submitting the “currency” event parameter in the implementation, Google Analytics 4 provides us with additional fields in which the data converted by default to USD are located. Sadly, we cannot define the currency in which we would like to drop data to Google BigQuery. Respectively, if we have a selected currency in our service settings, for instance PLN, in which case we will have to download the day-specific rate to Google BigQuery and convert this information in order to be able to map what we see in the panel.


    Google Analytics 4 BigQuery - Currency Conversion Value

    #9 Lack of complete information about full Google Ads data and vulnerability to campaign name changes


    In exporting Google Analytics Universal data to BigQuery, we were accustomed to comprehensive information about Google Ads campaigns.

    Google Analytics 4 BigQuery Missing Google Ads Data

    As we mentioned earlier, for Google BigQuery data, we have only 3 fields calculated, that is traffic_source.source, traffic_source.medium and traffic_source.name, and these are fields that talk about acquiring a user, not about the sources of a specific session.
    If you want to complement your Google Analytics 4 data with additional information from Google Ads, including your advertising account identifier, campaign identifier, campaign type, we need to make sure that Google Ads tables are included in Google BigQuery, and only in the subsequent step do we link them with additional SQL instructions. This process is quite capable of expanding the logic of queries, and so this constitutes another difficulty that we must be ready for while exploring data.

    Summary

    Google Analytics 4 and the ability to export data to Google BigQuery is a great solution for all companies that wish to make efficient decisions based on data. However, if we want to fully draw from their potential and act in accordance with the state-of-the-art practices related to measuring traffic between devices, we must spend a lot of time processing this data (user deduplication, attribution of traffic sources, combining Google Ads data). We also need to be prepared for data discrepancies between the Google Analytics 4 panel and what we get as a result in Google BigQUery.

    Luckily, most of the problems related to data processing can be automated - we discussed our approach to this issue in our next article “Google Analytics 4 BigQuery - traffic sources, sessions, attribution, marketing costs and ready data for analysis”.


    Rate article

    Rates: 0 Avarage rate: 0