Tag Archives: Usability

Why Publish().RefCount() doesn’t work well with Retry()

When you’re designing a library with a large number of methods, one of the hard problems is making sure that all of the methods work together in a consistent way. If done correctly, then you can make your library a joy to use. However, if done badly, then your users will spend a frustrating number of hours debugging what’s going on…

And that experience of spending a frustrating number of hours debugging what’s going on is exactly what happened to me recently when I was using Reactive Extensions.

It all started with a performance problem.

We quickly identified what was causing the performance problem, and so we added a performance optimisation — Publish().RefCount() — in order to fix it. And that’s when our code broke.

We were lucky — our tests caught the fact that the performance optimisation had broken our code — and so we set about trying to get the best of both worlds: working code that was also fast.

This blog post goes into the detail of precisely what happened and how we fixed it.

A failing test

Let’s look at a test that demonstrates this. The test below creates an observable, uses Retry() to add error handling, and then it asserts that the observable is equal to 0 followed by 1.

[Test]
public void Test([Values(false, true)] bool withPublishAndRefCount)
{
    IObservable<int> original = CreateObservable(withPublishAndRefCount);

    var withRetry = original.Retry(2);

    Assert.That(withRetry.ToList().Wait(), Is.EqualTo(new[] { 0, 1 }));
}

The test is executed twice. The difference between the two executions is the value of the parameter withPublishAndRefCount.

Creating the observable

The method below shows how that parameter is used. When it’s set to false, Publish().RefCount() isn’t used, and the test passes. However, when it’s set to true, CreateObservable wraps the observable using Publish().RefCount() and just like it did for us, this causes the test to fail.

private IObservable<int> CreateObservable(bool withPublishAndRefCount)
{
    var observable = Observable.Defer(ObservableFactory);
    if (withPublishAndRefCount)
    {
        return observable.Publish().RefCount();
    }
    else
    {
        return observable;
    }
}

Note that the return type of CreateObservable is IObservable<int>. At this point, it’s probably worth mentioning the Liskov Substitution Principle which essentially says that swapping one implementation of IObservable<int> for another shouldn’t cause the test to fail, but that’s precisely what’s happening! When we use Publish().RefCount() we get a different implementation of IObservable<int> and the test fails.

The observable factory

The method below shows the observable factory. The first subscription results in OnNext(0) followed by OnError(e). When Retry() sees this error, it will hide the error from everything downstream and re-subscribe. The second subscription results in OnNext(1) followed by OnCompleted()

private int nextSubscription = 0;
private IObservable<int> ObservableFactory()
{
    var thisSubscription = nextSubscription;
    nextSubscription++;
    switch (thisSubscription)
    {
        case 0:
            var e = new ApplicationException();
            return Observable.Return(thisSubscription)
                             .Concat(Observable.Throw<int>(e));
        default:
            return Observable.Return(thisSubscription);
    }
}

The root cause

So why does wrapping the observable using Publish().RefCount() cause the test to fail?

The test fails because Retry() isn’t able to recover from the error. When it sees the error and re-subscribes, it sees precisely the same error again, and again, and again… If we hadn’t specified a maximum number of times that it can retry, Retry() would spin forever (well, until the test times out)! Essentially the error has been cached by Publish().RefCount() so that subscriptions aren’t independent.

So what’s caching this error?

Well Publish().RefCount() is short-hand for Multicast(new Subject<T>()).RefCount()

And it is Subject<T> that caches the error.

The fix

Now that we know precisely what’s happening, we can work around it.

What we need to do is write a replacement for Subject<T> that keeps subscriptions independent, and then we simply replace all instances of Publish().RefCount() with Multicast(new IndependentSubscriptionsSubject<T>()).RefCount()

And here’s the code for IndependentSubscriptionsSubject<T>

public class IndependentSubscriptionsSubject<T> : ISubject<T>
{
    private ISubject<T> _innerSubject = new Subject<T>();

    public IDisposable Subscribe(IObserver<T> observer)
    {
        return _innerSubject.Subscribe(observer);
    }

    public void OnNext(T value)
    {
        _innerSubject.OnNext(value);
    }

    public void OnCompleted()
    {
        _innerSubject.OnCompleted();
    }

    public void OnError(Exception error)
    {
        var erroringInnerSubject = _innerSubject;
        _innerSubject = new Subject<T>();
        erroringInnerSubject.OnError(error);
    }
}

Conclusion

We’ve seen how Publish().RefCount() caches errors, which breaks Retry(), and we’ve also seen how to work around it by writing a replacement for Subject<T> that doesn’t cache errors.

Dealing effectively with abstraction

I often hear that the hardest problems in writing software are cache invalidation, naming things, and off-by-one errors.

However, I think that there’s something else, something that’s not only harder than all of those, but crucially pervades all aspects of writing software that’s often forgotten: the problem of dealing effectively with abstraction.

Abstractions are ubiquitous in how we deal with the world. They let us take complicated ideas and summarise them by choosing which details to reveal and, crucially, which to hide. This blog post looks at several areas of writing software, pointing out some abstractions that occur and how tricky they are to get right.

Marketing

In a Google AdWord you’ve got about 95 characters of text to play with. That’s not very much. In fact, I’ve gone over that limit already in this paragraph alone. You need a really good abstraction — text that’s both really short but also manages to get people interested enough to click it.

User experience design

When a new user starts using your software for the first time, chances are that they won’t know how to use it. But to make software ingeniously simple, the fact that they’ve never seen it before must not matter. And they’re not going to read the manual — they need to instantly ‘get it’.

Essentially there needs to be a good abstraction. You need to give the user a mental model, and the challenge is to give them one that’s good enough to use the product, and to give it to them quickly, before they get bored and give up. But, more than that, the challenge is to constantly adjust the abstraction you give over time.

Beginners will want to feel like they’re getting value out of the tool quickly and easily without being overwhelmed, whereas experts will want something more sophisticated. They already know what your core concepts are and how to use the product to help them do their day job, but that keyboard shortcut that’ll save them 10 seconds each time they do that operation? That’s what they love.

Writing code

I recently encountered commits that failed code review because of poor abstraction, so I’ll use that as an example. The commits tried to add a new feature that involved getting a string from another library, parsing it, and then doing something with the parsed result (see what I did there? — I gave you an abstraction so that I didn’t bore you with the details of the new feature :) This is a perfect example of when abstraction goes wrong.

First up, the problem with the commits is that they didn’t handle several things that the library could throw at us. When writing a parser, you need to know precisely all the forms that it can take. If you don’t go and peel away every abstraction and do that literature review, but instead rely on only a few examples, you literally don’t know what could happen, because you don’t know what you don’t know. So you shouldn’t be surprised that there’s several bugs in your code.

Second up, the library suffers from primitive obsession and is broken. An object would be a vastly better abstraction than a string because an object has properties and methods, so that we, and lots of other consumers of that library, wouldn’t need to write a parser to pull apart the string. Overall, there’d be less repetition of code if the library just did what lots of its consumers wanted.

So, choosing the abstraction is hard. If you get it right, you’ll make it faster overall for consumers of your library to write code, and you’ll reduce their bug rate. If you get it wrong, people will find your library annoying to work with — they’ll spend their time writing boilerplate code, and fixing bugs in code they should never have had to write in the first place.

Naming

But abstraction is about more than just the behaviour of an API. It’s also about its names: an article on the importance of naming things concludes that the point of a name is “telling the user exactly what they need to know and no more”. In other words, you first need to decide what abstraction you want — what does the user need to know — and only then can you come up with a name that does precisely that.

Naming things goes beyond just an API though; for example, consider the terms “false positive” and “false negative”. They are catch-all terms that cover all the distinct ways in which a system might fail. They are a way of talking about failures without going into the specific details. They are an abstraction, and I don’t think they’re a very good one.

Firstly, I can never actually remember which way round positive and negative are. For example, if a user buys something on the internet and the purchase gets wrongly rejected, is that a false positive or a false negative? Choosing different terms could make things easier to understand.

Secondly, we’ve just annoyed one of our users. Whenever we’re discussing possible features or bug fixes to the internet purchasing system, I want to keep the user at the forefront of our mind, and one way to help do that is to choose an evocative word. That’s why I prefer the terms “insult rate” and “fraud rate”, instead of “false positive” and “false negative”.

Testing

When writing any test you need to trade-off between the insult rate and the fraud rate. If you’re not careful, unit tests become too tightly coupled to the product code under test. Any small change to the application causes tonnes of tests to fail erroneously (a high insult rate). The tests are “brittle” and a failure to handle abstraction is often at play.

When writing a unit test you need to decide what the asserts should actually test and, on the other hand, what is implementation-defined behavior. In other words, you need to choose what are the implementation details of the method that should be hidden from the test, and you shouldn’t write asserts for that behaviour.

This makes the test less coupled to the product, less brittle, and less likely to break when the product code gets changed slightly.

Abstraction

We’ve seen how abstraction (i.e. choosing which details to reveal and, crucially, which to hide), cuts across many different areas when writing software, from marketing to user experience design to writing code to designing the behaviour of an API to naming stuff to testing — it gets everywhere.

And not only is it really hard, but often it’s subtle to spot.

And that’s why I think it’s more important than cache invalidation, naming things, and off-by-one errors.

And mastering it is crucial in upping your game as a member of a software team.

Next time, we’ll look at HTTPS and how the abstraction that it provides is not only the wrong one, but also dangerous. By default, HTTPS is not as secure as it can be, but by peeling back the abstraction, and looking within, we get access to a whole host of options that we can tweak to make our website safer.