First published on Insights & Observations by UsabilityHub

If you’re in e-commerce, the way you lay out your products is a critical part of your customer experience.

It’s your digital shopfront. You need everything nicely on display in order to generate as many sales as possible, just like a regular store.

A huge advantage for e-commerce businesses over traditional shopfronts is that we don’t need foot traffic analysis or video monitoring to see the impact of layout changes in your shopfront. We can use online usability testing to analyse design and layout variations in a fraction of the time.

This case study demonstrates and tests alternative product grid layouts, with an aim to answer one simple question — what does the perfect product grid look like?

Before we begin

I’m not a maths guy. I approached this case study from the viewpoint of a keen designer armed with a small budget, a short timeline and a basic question.

There’s no comprehensive statistical analysis here. The intention was to be as pragmatic as possible while still having confidence in the results.

Background

In late 2015, I worked at a grocery delivery startup, YourGrocer, based near my office. I’d been in corporate e-commerce for a while, so a change of gear and more ownership of an end-to-end customer experience appealed to me.

The crazy pace of seed-stage businesses means we didn’t have time to test every idea in a meaningful way. There were stronger priorities, ideas that had more customer impact potential or were a better use of developer and designer time.

When we revised the product layout, we went through a quick cycle of the user centred design process, tested our updates with customers in moderated user testing and contextual inquiry, and then released them.

After implementation, we recorded conversion, 2 month retention, time spent per order, and a host of other metrics.

However, the results of these tests often didn’t show huge changes, and we didn’t have enough customer volume to run a series of meaningful A/B tests in a short amount of time.

Often, we concluded that an update hadn’t broken anything, had good qualitative feedback from users, and looked better to us - so we moved on to the next problem.

A more pragmatic approach

Without expensive paid traffic through Google or Facebook, it would have taken months to get to statistically significant answers from real site data. That’s far too long for a seed-stage business. We needed answers in days or weeks, not months.

On reflection, we should have tried online testing on a heap of small variations before shipping updates. I only really discovered how much online testing platforms had matured towards the end of my time at YourGrocer. Now we can get some data back in minutes, instead of months.

This is fortunate for you, reader, because I can now present a case study of how changes to your product grid affect the scanability of your design! ^_^

What‘s the hypothesis?

Improving the product layout was driven by a series of hypotheses:

  1. Bigger everything == shorter scanning time.
  2. Shorter scanning time == shorter order time.
  3. Shorter order time == greater perception of product/service speed to the customer

… all of which would contribute to customers returning more often.

In this case study, we’re only looking to test one piece of the puzzle — the bigger everything == shorter scanning time part.

Test setup

For this research, I used a multi-variate click test on UsabilityHub. This allows us to test multiple layout concepts with exactly the same task setup every time.

Here’s what the starting point looked like:

A pretty bog-standard product grid. Art.

I tested five variations, along with the original. Each variation was seen by 25 unique participants. This gave a total participation count of 150.

Test participants only saw one of the variations. The intention was to record the first, fresh impression of each participant, with past usage or familiarity with the product having as little influence on the problem as possible.

I asked the participants to imagine they were doing grocery shopping online. The task was to look at the ‘Fresh vegetables’ category page, and add one bunch of bok choy to their shopping cart.

This specific language was used because we wanted to test the scanability of the grid within the context of a common user journey, in this case the purchasing flow.

The data you get from a test like this is limited, but it’s enough for our purposes. The mean completion time is returned, as well as a heatmap showing where the participants clicked. The raw data can also be exported to CSV format for further analysis.

Results

Comparing the mean completion time for each test gives us a basic indication about which variation performs the best. If you can find an item quickly, that’s success. If you struggle to scan the grid and it takes a long time to find an item, that’s failure.

Here’s a gif of 3 different grids, and the average completion time for each. There’s a transcript underneath.

3,4 and 5 column layout comparison

Test 1: existing version with 3 columns

  • 49 seconds average completion time.
  • The target item was on the 11th row of the layout, with the image starting around 5,760 px down the page.
  • This represents an average scanning time of 4.45 seconds per row.

Test 2: 4 columns

  • 41.5 seconds average completion.
  • Position of the item was a couple of rows higher, on the 9th row, 4,630 px down.
  • Average scanning time of 4.61 seconds per row.

Test 3: 5 columns with slightly smaller product cards
This version had the product cards slightly scaled down to fit within the same resolutions as the previous tests.

  • 43.8 seconds average completion
  • Position of the item was higher again, on the 7th row, 3,080 px down.
  • Average scanning time of 6.26 seconds per row.

Initial insights

From this first dataset, it seems that there is a point of diminishing returns for how many products you cram into the grid.

The average scanning time increased 34% per row in the jump from 4 to 5 columns.

Is it the decrease in size of the product cards that causes this effect? Or the fact that there are more products in each row, overwhelming the test participants?

To get some more data on this, I repeated a previous test, but with the cards scaled down to the same size as the 5 column test.

I can’t get enough of these gifs.

Test 4: 4 columns with smaller product cards

  • 40.3 seconds average completion
  • Target was on the 9th row, 3,940 px down the page
  • Average scanning time of 4.47 seconds per row

This result is close to the previously tested 4 column version — the mean completion time is 1.2 seconds faster, but the average row scanning times are within 0.2 seconds of each other— so for me this data demonstrates that the extra column is slowing participants down, not the smaller product card size. But it’s not super convincing.

Ok. So what if we play with the scale of the layout?

Hey, I like the design, but can you make it bigger?

One of the things to keep a sharp eye out for when conducting moderated user sessions is whether or not a participant is struggling to make out details on a page due to their size.

You might notice a participant lean in to see a detail, or they might verbalise it in an off-the-cuff way.

If someone pulls out reading glasses when you first start a test session, it’s not an issue. When they pull them out and you’re 20 minutes in? Red flag.

For marketing pages, campaign sites and other consumer-focused content, it’s critical that your design passes the squint test: can you squint your eyes and still make out the main CTAs?

For the next test, I took the original 3 column design and scaled up as many elements on the page as possible, to measure what kind of impact this change can have on scanning time.

Yes, it does.

Test 5: 3 columns, scaled up

  • 39.7 seconds average completion, almost 10 seconds quicker than the original 3 column layout in the first test.
  • Target product on the 11th row, 4940px down the page.
  • Average scanning time per row is 3.61 seconds.

This change took nearly a second off the average scanning time per row. The scale of this improvement is a surprising result.

I wonder how many other seemingly small changes would have dramatic results?

Embiggening

Having good images is critical to your design — but just how important is the size of them?

To test this, I hacked together a variation of the scaled up 3 column version with comically large images, and threw it into design-concept-Thunderdome:

The difference in image scale for this test.

Test 6: Comedic images

  • 38.9 seconds average completion time.
  • Target on 11th row, down 4,940 px.
  • Average scanning time of 3.54 seconds per row.

This is 0.07 seconds per row faster than the previous winner, which I’d argue is within the error margins for these tests.

From this data it seems that they make a slight difference, but not by much. I’d argue that it’s not worth corrupting your visual style with images that run right up to content container padding zones for a 0.07 seconds per row gain in scanability.

Interpreting the data

Putting together the mean completion time, where the target item was situated in each variation, and the number of products per row allows us to see a representation of how many products a customer could scan through over a 30 second period.

This is a fuzzy metric, because it’d be rare for a user to do nothing but scan products when they were shopping. However, it gives us more context to talk about the differences in designs and how they might affect the key user journeys for this product.

Extrapolating this further out to a longer session, a customer would be able to scan through ~50–70 more items every five minutes using one of the optimised 3 or 4 column layouts compared to the baseline layout.

I think this is a significant enough result to justify a closer look at more variations and optimisations we can find in the layout of product grids via user testing.

Further analysis

My usual process for writing finishes with a ‘sanity check’ of the final draft by a small social email list which contains a group of STEM professionals from different backgrounds.

They introduced me to a different way to analyse this data using histograms. I’m not a maths guy, and it’s clear to me now that I need to take in some bio-stats classes.

If you’re not aware of histograms, they’re a great way of showing the spread of time-sensitive data in an experiment. Here’s what our test data looks like in this format:

The standout insight from this histogram for me is that almost all of the layout variations performed well compared to the original, with the exception of the 5 column layout.

The big black bulge in the 60–90 second bucket shows the bad performance of the original. The 3 column and 4 column alternative layouts all have tighter groupings of results.

The 5 column design (in light blue) has a wider spread of responses, including 3 responses between 60–90 seconds, and the longest response of all tests (not shown) at 147 seconds.

I believe these results aren’t ‘outliers’ in the data because ultra-long completion times for isolated participants are 100% relevant. They represent the fact that someone sat in front of the screen and struggled to complete a the task. It’s another red flag.

The beauty of the histogram is that it shows you this data without having to take in the numbers themselves. It’s obvious there are designs that seem to work better than the original — but I’m also fascinated about what would happen to this diagram if we increased the participant numbers.

Would gaps between the variations develop? Or would the curves become more similar? I feel another article coming on.

Is there a clear winner?

If you only compare the mean completion time, then the last version, with comedy sized images is the winner.

Taking the histograms into account, I’d say there’s no clear winner — but there’s definitely a couple of losers that I would count out of the running in further tests: the 5 column variation and the original.

If you put a gun to my head, I’d take the version from test 5– the 3 column layout, with many page elements scaled up.

It definitely outperforms the original 3 column version, and the wider spread of data from the 5 column tests is not a good sign when you’re optimising for scanability.

The difference in performance between this version and the comedy-sized images version is small, and it’s not nearly as jarring to look at, especially when you load in some less-than-perfect product images.

What have we learned?

  1. We need more tests
    This initial testing shows that it’s worth continuing our investigation of the ‘bigger everything == shorter scanning time’ hypothesis. So far, these results confirm this hypothesis, but the path forward isn’t clear cut.
  2. Bigger seems better… for now
    I hesitate to say MAKE ALL EVERYTHINGS AS BIG AS POSSIBLE after this initial round of research, but there is evidence here that shows scaled up elements are more performant than smaller elements when trying to increase the scanability of this product grid.
  3. Your mileage will vary
    The only way to know if your own product grid is at a good scale is by testing your solution. There are too many unknowns to develop specific catch-all guidelines like ‘your images must cover x% area compared to your CTAs’ at this stage.
  4. Start small
    Testing 3–4 variations of a product grid layout will give a preliminary idea about how effective it is. Some data is better than no data in this case, because it gives you a baseline to start further experimentation. Plus, it’s cheap and quick. And fun! Right?!
  5. Get your books out
    I need to pull out my copy of Quantifying the User Experience: Practical Statistics for User Research again and freshen up on this topic. The more I think about it, the more research I want to do — but I want to make sure that I’m not misrepresenting the data.

To that end, here are the complete test results in a Google Sheet. If you’ve got an eye for this sort of thing, I’d love to hear about different ways to look at this data.

We need more data!

If you want to try the experiment out for yourself on your own product grid, hack a couple of concepts together in your layout tool of choice, and get testing.

I don’t often see this kind of research done in the open with real data. If you’ve got different testing tactics, then please let the world know in the responses.

I’d love to hear about similar tests you can talk about in a public forum. Even if you can’t share the exact results or dataset, I’d love to hear about the methodology and whether you thought it was effective (or not).