A/B testing sounds simple. Show half your users version A, show the other half version B, measure which performs better. In practice, building a reliable A/B testing system that produces trustworthy results without degrading your application is a genuine engineering challenge. We have built experimentation infrastructure on multiple projects at Veld Systems, and the gap between "we run A/B tests" and "we run A/B tests that actually produce reliable data" is wider than most teams realize.
This post covers the architecture behind production A/B testing: how assignment works, how to avoid common pitfalls, and what infrastructure you need to run experiments without breaking things.
The Core Problem: Consistent, Unbiased Assignment
The foundation of any A/B test is user assignment. Each user needs to see exactly one variant, and that assignment needs to be consistent (they see the same variant every time they visit) and unbiased (the groups are statistically equivalent).
The simplest approach is random assignment stored in a database. When a user first encounters an experiment, generate a random number, assign them to a variant, and store that assignment. On subsequent visits, look up their assignment and serve the same variant. This works for basic cases but breaks down quickly.
The problems with database stored assignment:
First, it creates a database query on every page load for every active experiment. If you are running five experiments simultaneously (which is common for any team serious about optimization), that is five additional queries per page load. Under high traffic, this degrades performance noticeably.
Second, it does not handle anonymous users well. If a user is not logged in, you need another identifier. Cookies work but are increasingly unreliable due to browser privacy changes. Device fingerprinting is invasive and inaccurate. You need a strategy that handles the logged out to logged in transition without reassigning the user to a different variant.
The better approach: deterministic hashing. Instead of storing assignments, compute them. Take a combination of the user identifier (or anonymous ID) and the experiment name, hash it, and use the hash to determine the variant. The same input always produces the same output, so assignments are consistent without any storage. The hash function's uniform distribution ensures unbiased splitting.
In practice, this looks like: `hash(userId + experimentName) % 100`. If the result is under 50, variant A. If 50 or above, variant B. For more complex splits (10/90, 33/33/34), adjust the thresholds accordingly.
This approach eliminates the database dependency for assignment, handles anonymous users (use a cookie based anonymous ID that persists across sessions), and scales to any number of experiments without additional infrastructure load.
Server Side vs. Client Side: A Critical Architecture Decision
Where you evaluate experiments, on the server or in the browser, has significant implications for both performance and reliability.
Client side A/B testing (the approach used by most plug and play tools) works by loading a JavaScript library that modifies the page after it renders. The user briefly sees the original page, then the library swaps in the variant. This causes a flash of original content that is visible, annoying, and can skew your results because users who notice the change behave differently.
Client side testing also has a fundamental reliability problem: ad blockers and privacy tools frequently block the testing library's JavaScript. When that happens, those users always see the control variant, creating a systematic bias in your results. You are not comparing variant A to variant B. You are comparing "variant A plus all the privacy conscious users" to variant B.
Server side A/B testing evaluates the experiment before rendering the page. The user receives the correct variant on the first render with no flash and no dependency on client side JavaScript. This is more work to implement but produces cleaner data and better user experience.
For web and mobile applications we build at Veld, we strongly prefer server side evaluation. The implementation uses feature flags as the underlying mechanism, with experiment assignment layered on top. The feature flag system determines which variant to render, and the analytics layer tracks which variant each user saw and what they did.
The Event Pipeline: Measuring What Matters
Assignment is only half the system. The other half is measurement, and this is where most A/B testing implementations produce misleading data.
The measurement pipeline needs to capture three things for every user in an experiment:
1. The assignment event. When a user is assigned to (or evaluated for) an experiment variant, log it. This is your denominator. Without accurate assignment counts, you cannot calculate conversion rates correctly.
2. The exposure event. Did the user actually see the variant? Assignment and exposure are different. A user might be assigned to variant B of a checkout page experiment but never visit the checkout page during the experiment period. Including them in the analysis dilutes your results. Track exposure separately from assignment.
3. The outcome event. Did the user perform the action you are measuring? This might be a purchase, a sign up, a click, or any other measurable action. The outcome event needs to be linked to the correct experiment variant with the correct timestamp.
These three events flow into an analytics pipeline that computes the experiment results. The pipeline needs to handle deduplication (a user who converts twice should count once for conversion rate calculations), attribution windows (how long after exposure does a conversion still count?), and data delays (events from mobile apps might arrive hours or days after they occur).
On projects like Traderly, where accurate measurement directly impacts business decisions, we build this pipeline with particular care. A flawed measurement system is worse than no measurement because it gives you confidence in wrong conclusions.
Statistical Rigor: Avoiding False Conclusions
The most common A/B testing mistake is not technical. It is statistical. Teams look at results too early, declare a winner when the data is not conclusive, and ship changes based on noise rather than signal.
Sample size matters. If your baseline conversion rate is 5% and you want to detect a 10% relative improvement (from 5% to 5.5%), you need roughly 30,000 users per variant for a statistically significant result at 95% confidence. Most teams do not run that math before starting an experiment and end up making decisions based on samples that are far too small.
Peeking invalidates results. Checking your experiment results daily and stopping when you see a "winner" is not the same as waiting for the predetermined sample size. Early peeking dramatically increases the false positive rate. If you check results 10 times during an experiment, your actual false positive rate can be 25 to 40% instead of the expected 5%. This is called the peeking problem, and it means many teams are shipping changes that have no real impact while believing they do.
Solutions that work in practice: Use sequential testing methods (like Bayesian approaches or alpha spending functions) that are designed for continuous monitoring. These methods adjust the significance threshold to account for repeated looks at the data. Alternatively, commit to a fixed sample size before starting the experiment and do not look at results until you reach it.
Multiple comparisons. If you are testing five metrics simultaneously, the probability that at least one shows a "significant" result by chance is not 5%. It is 23%. Correct for this by using a method like Bonferroni correction or, better yet, designate a single primary metric before the experiment starts and treat everything else as exploratory.
Infrastructure Considerations
Running experiments in production requires infrastructure that most teams underestimate:
Configuration management. You need a system to create, configure, activate, and deactivate experiments without deploying code. This is typically a database backed admin interface that your product team can use independently. The experiment configuration specifies the variants, the traffic split, the targeting criteria (maybe you only want to run this experiment for users in a specific country), and the start and end dates.
Mutual exclusion. If you are running multiple experiments that affect the same page or flow, you need to ensure users are not in conflicting experiments simultaneously. A user in an experiment testing the checkout page layout should not also be in an experiment testing the checkout page copy. These interactions make results uninterpretable. Implement experiment layers or mutual exclusion groups to prevent this.
Kill switches. Every experiment needs an instant off switch. If variant B causes errors, crashes, or a significant drop in a critical metric, you need to shut it down immediately without waiting for a code deployment. This is another place where feature flags earn their keep. The same system that controls flag rollouts controls experiment shutdowns.
Monitoring integration. Your application monitoring should be aware of experiments. If error rates spike, you need to know whether the spike is concentrated in one experiment variant. This requires tagging your monitoring data (logs, error reports, performance metrics) with the active experiment variants for each user. Without this, debugging production issues during an active experiment becomes much harder.
When Not to A/B Test
Not everything should be A/B tested. The overhead of running a properly controlled experiment is real, and sometimes the juice is not worth the squeeze.
Do not A/B test when you do not have enough traffic. If you get 500 visitors a day and your conversion rate is 3%, you will need to run most experiments for months to get a reliable result. At that scale, qualitative user research (talking to customers, watching session recordings) will give you better insights faster.
Do not A/B test bug fixes or obvious improvements. If your checkout button is broken on mobile, fix it. You do not need an experiment to prove that a working button converts better than a broken one.
Do not A/B test when the decision is strategic, not incremental. Whether to pivot your product, enter a new market, or redesign your entire user experience are decisions that A/B testing cannot answer. Those require vision, judgment, and willingness to take calculated risks.
Building It Right
A/B testing architecture is one of those systems that looks simple until you try to build it properly. The difference between "we swap button colors sometimes" and "we run reliable experiments that drive confident decisions" is significant engineering work across assignment, measurement, statistics, and infrastructure.
If your product is at the stage where data driven optimization matters, investing in proper experimentation infrastructure pays for itself quickly. The alternative, making changes based on opinion or unreliable data, is how teams waste months building features that do not move metrics.
We build experimentation systems as part of our system architecture work. If you want to run experiments that produce results you can trust, reach out to us and we will design an approach that fits your product and traffic.