Optimizing a NServiceBus "batch job" part 2

by christerdk 25. February 2013 12:46

In the previous post I explained how we used NServiceBus to break down a large RSS import task into several isolated tasks. Instead of importing a whole RSS feed in one transactional go, an amount of Add/Update/Delete tasks were created to be processed seperatedly. That gave us quite an performance improvement.

However, when we looked at each one of these tasks in detail, there were still some issues with the implementation. Overall, the original code was quite well structured. But there were no clear distinction between sub-tasks that had to be performed. This meant, among other things, that download of images were handled in the same transactional scope as updating the database.

The UpdateRSSItem task, for example, could be broken to into the following sub-tasks:

DownloadImages (started by UpdateRSSItemMessage, this will download any new images)
UpdateDataInDatabase (we then update the DB)
RemoveOldImages (we then remove old images)

If all of these subtasks were to be processed together, it would have meant downloading from the web, updating the DB and doing some file I/O in one transaction. Simpler to understand in synchrounous code, maybe, but that would also have meant that if an error was to occurr in removing the old images, parsing  the data and / or adding data to the DB, the images on the web would have to be downloaded again on retry.

So it makes sense to break down even a “small” task like updating content based on RSS into smaller tasks.

Not only did we in fact experience this sort of download retries in my previous project, but even on www.miljoparkering.se we’ve experienced the advantages of dividing a task into very small sub-tasks in our daily notification batch. Although the sub-tasks introduced indirection in a coherent task, this also made it much easier to see what happened during the notification batch in our error and audit queues.

Tags:

Optimizing a NServiceBus "batch job"

by christerdk 1. October 2012 13:36

On the last project I was working as a consultant on we used NServiceBus throughout the system.

After some time in production we started to experience issues with the system. First a slow system, then transactional deadlocks on the db and loss of data when these deadlocks were automatically resolved by the db server.

I had the pleasure of being the one tasked with tracking down the culprit and resolve the issue. (No irony, I love tracking down issues… :))

Problem

Some of our NServiceBus endpoints were concerned with high level business tasks, others with more infrastructural tasks. It turned out that the issue could be tracked to a NServiceBus endpoint which had the task of continuously importing content from RSS feeds owned by our customers. That is, for each feed, create, update (yes, some feeds updated existing feed items) or delete items locally according to feed contents. Importing also meant downloading images referenced in the feed items. Each RSS feed import was represented with a saga instance, and used timeouts for initiating imports.

The logic was placed in a NServiceBus endpoint to be able to “interact” with the import. We wanted to be able to remotely control the import and make it possible for users to start the import, when needed.

The implementation was unfortunately much like what you would find in a typical batch job. It consisted of a message handler which contained calls to private methods, more or less like this:

public void Handle(StartRSSImport message) {
    var feedContent = _feedLoader.GetFeedItems(message.Url);

   UpdateExistingItems(feedContent); //Updated already known items
    AddNewItems(feedContent); //Added new items from RSS feed
    RemoveItems(feedContent); //Removed items that were no longer present in the RSS feed
}

It worked without problems in the beginning. But then the RSS import feature got popular (oh no! :), more feeds were added and they contained a lot more items. Basically, the big RSS imports, which of course were inherently dependent on various external systems, ended up keeping transactions open and therefore blocking other clients from accessing/updating DB tables used. Issue identified!

A solution

The good thing was that the import code itself was pretty well organized, and methods handling one item at a time were already present (UpdateItem(), RemoveItem(), Additem()).

However, in the “plural” methods, the code for discovery of RSS feed changes was entangled with the code actually performing the changes. This was my goal and what I ended up doing under the circumstances:

  1. My idea was: Importing items from a RSS feed to the local system did not need to be viewed as one big task. It could instead be viewed as a list of separate tasks with no transactional relation, identified by one initial task. So therefore I wanted to...

  2. Have the StartRSSImport message handler method load the feed and only discover what actions that needed to be taken locally.

    This was considered a light weight operation on the production DB, as we already saved hashes locally representing RSS item’s previously known state. That meant we didn’t need to traverse our object graph to discover potential changes for each item in the feed.

    For each identified task, this message handler would then send a matching message to its own queue, either UpdateRSSItemMessage, RemoveRSSItemMessage or AddRSSItemMessage.

    These messages would contain all information needed for their respective message handlers to perform their task (local item id, hashes, text content, image URLs)

  3. Create new message handlers for UpdateRSSItemMessage, RemoveRSSItemMEssage and AddRSSItemMessage.

    Each of these would perform their particular task on item level.

 

After refactoring our code looked more or less like this (compacted for readability – I’m not a fan of train wrecks):

public void Handle(StartRSSImport message) {
    var feedContent = _feedLoader.GetFeedItems(message.Url);

    foreach (var item in FindUpdatedItems(feedContent)) {  Bus.SendLocal(new  UpdateRSSItemMessage( … )) ;}
    foreach (var item in FindAddedItems(feedContent)) {  Bus.SendLocal(new  AddRSSItemMessage( … )) ;}
    foreach (var item in FindDeletedItems(feedContent)) {  Bus.SendLocal(new  DeleteRSSItemMessage( … )) ;}
}

public void Handle(UpdateRSSItemMessage message) {
    //Updates existing item
}
public void Handle(AddRSSItemMessage message) {
    //Adds new item
}
public void Handle(DeleteRSSItemMessage message) {
    //Deletes item
}

This solution is basically comparable to when you’re reading a menu on a restaurant. You order the items you want and then they arrive later. You don’t, like earlier, call the chef and make him create your food on the spot. It’s just bad style… ;o)

Separating the discovery of changes from actually performing the changes locally meant we went away from a N+1 task in one message handler and, more importantly, in one transaction. It was a huge step forward, and the errors disappeared.

Final thoughts

The possibility to interact with a RSS import instance was a popular function in our system we achieved pretty easily by using NServiceBus. But this issue and the way we resolved it also showed us how NServiceBus can help us make our implementation simpler. But it did also teach us that we had to get our heads around slicing the control flow into separate flows first, before we could benefit from this.

So we were ok, right? Well, it’s true that the transaction scopes were now much smaller, fitting item level handling instead the feed handling. But there were still some issues to this implementation. I’ll follow up on this in a later blog post.

Tags:

nservicebus

NServiceBus on Amazon EC2 voodoo

by christerdk 20. September 2012 14:04

The www.miljoparkering.se web site is hosted on an Amazon EC2 instance.

We’re using NServiceBus as a component in www.miljoparkering.se.

Our solution had been running really well for quite a while. Then, at a point, we started getting issues with our server instance, and Amazon support personnel advised us to force restart the server. The problem was hardware failure on the physical server on which our EC2 instance was placed. This incident happened before summer.

Recently we noticed that we “magically” had gotten a couple of new outgoing queues on our server.

image

It seemed as if NServiceBus was trying to deliver messages to an endpoint placed on server ip-0af12737, which was the server name before the shutdown and startup mentioned before. As you can see, the current server name is ip-0aeb39db.

It turned out that this automatic server name change at startup (after shutdown) actually is by design by Amazon.

So these outgoing queues are actually coming from orphaned/stale NServiceBus endpoint subscriptions, which is revealed in the subscription documents in Raven.

image

We assume these queues are created based on an interpretation made by NServiceBus on these data, concluding that the old server name is some (external) endpoint out there, which should receive these messages.

So what we have here is not really a bug. It’s a combination of two products, where the EC2 server name change feature together with NServiceBus is actually, well, not breaking the solution, but at least adding complexity and a grain of voodoo. ;-)

It is possible to turn off this server name change feature. There’s a tool for that installed on your server instance called EC2ConfigService Settings.

image

In this tool you have to uncheck the “Set Computer Name” check-box on the General tab and press OK.

image

Preferably, you would do this before you start using NServiceBus on the server. But, I’m sure many people won’t know about this problem until they stumble upon it.

If so, turning off server name change is not enough. You have to clean up the subscription documents in Raven. You do this by shutting your endpoints down, removing all the Subscription documents in Raven, and then starting the endpoints again. (Worked on our server, I can’t guarantee it works on yours  - you’re on your own ;))

By the way, this issue does not happen when you do an ordinary Reboot in the server. We’re talking only when you shut down and restart the server instance down from the AWS console.

My first piece of public music

by christerdk 16. July 2012 10:42

... is on stenstroms.com - internationally acclaimed for their fine shirts and garments! :) 

Check out the video  

Long story short, the track first chosen for the the behind the scenes video didn't match expectations. Caroline, my friend and producer of the video, knew that I had picked up fiddling with music again and contacted me and asked if I was interested in giving it a shot. Three prototypes and one very sleep deprived weekend later, the track was approved by the customer! :)

Lessons learned:

  1. We did prototyping of tracks for creating a tighter feedback loop, which was very productive. On the 2nd day of work, a prototype was chosen for completion. 
  2. Seeing the video online, it's clear to me that even though you're doing your best to mix for the best result, web compression of the video (and therefore sound), can severly impact the result. Unfortunately, due to time constraints we didn't have enough time to test that part. Next time, I'm going to do two mixes, one for hi-fi and lo-fi...

 

Tags:

www.miljoparkering.se - leveraging our events and architecture to ensure customer happiness

by christerdk 11. July 2012 15:06

Last week, a particular situation come to our attention that puzzled us. A user removed all street subscriptions. First reaction: Why why why would someone do that? Next question: No, really, why?

Of course, there can be many reasons for that. Maybe the user moved to another town outside our service area. Maybe the user moved to a street which is not covered by our service. Or maybe because the user didn't find the service satisfactory. 

But then we started speculating on how we could use our events and architecture to ensure customer happiness and service delivery, and how we would be able to know and respond if, from a business perspective, a negative user scenario would occur.

There are some quite interesting negative user scenarios for www.miljoparkering.se ...

1. User begins signing up via email, but never completes the account creation process (that is, clicking the confirmation link)

2. User completes account creation process, but never subscribes to any streets

3. User has an account and street subscriptions, but for some reason decides to remove all street subscriptions 

It is not only interesting from theoretical perspective to know why these scenarios occur, it's absolutly vital for any business to know when they occur, how often they occur and for it to be put into the light, so that the business can respond to it.

This is how we handle these scenarios at www.miljoparkering.se:

1. User signs up via email, but never completes the account creation process (that is, clicking the confirmation link)
When user signs up, we send them the confirmation link. The link will take the user to a page to complete the registration. It is crucial for us that users don't abandon this process for whatever reason (technical issues etc.). Behind the scenes, a nServicebus saga is therefore started by AuthorizationCreatedForNewEmailAccountEvent. The messagehandler requests a timeout of 4 days. When the timeout returns, the timeout message handler checks if the user finalized account creation. If not, we reach out via mail, asking if we can be of service and if the user needs further guidance.

2. User completes account creation, but never subscribes to any streets
When the user completes the sign-up process the main page of the site is shown to the user. On this page the user can choose among the many streets we cover. To ensure that none of our users for some reson gets stuck (confusing UI, browser issues etc.), we start a nServicebus saga when UserCreatedEvent is published. This saga requests a timeout of 4 days. If the user adds a street subscription the saga is marked as complete. If not, the timeout returns. Whether or not reaching out via mail is the way to go is still to be decided, but we will, as a minimum, get a notification when the scenario happens.

3. User has an account and street subscriptions, but for some reason decides to remove all street subscriptions  
There can actually be many reasons for this scenario to occur naturally. If the user moves to a different part of the city, the user might choose to remove all streets before adding the new ones - and in such case having 0 subscriptions temporarily. The user can also choose to remove them for a longer period because, say, of going on vacation. However, if the user does this because of unhappiness with our service, then it is very much in our interest to try and turn around whatever negative experience the user might have had. To handle this, if a user reaches 0 subscriptions, we start a saga. Again, we take use of a timeout, which, when fired, tests for amount of street subscriptions for the user. If it is still zero, we reach out through email and, hopefully, and as a minimum, gain understanding for the opt out and what we can do better.

Some final thoughts:
This involves a lot of mail sending, but since it is quite unlikely for a user to experience all scenarios, I think we're still keeping the nagging level low. ;-) 
Instead of using sagas, it would have been possible to do the same through daily batch jobs. However, tapping in on existing events and setting up wait periods (timeouts) is so darn lean with nServicebus.
We haven't done any changes to our core domain and service, but still we've gained a lot already from the EDA refactoring. I wonder what's next...

www.miljoparkering.se - the first post EDA release change

by christerdk 28. June 2012 21:27

When someone signs up on www.miljoparkering.seBjorn and I receive an e-mail notification ... to stimulate our curiosity :)

We've had users suggesting to put in more "feedback". So for some time we wanted to send a welcome mail when users sign up, but we didn't want to change anything until after introducing the EDA inspired architecture. 

So, now the "requirement" was to send two mails instead of one. In the old architecture, we might have been tempted to simply throw in some more code more or less at the same place as the existing mail sendout code. That code still existed in the new architecture, but with minor refactoring, we hoped to see first of the positive effects of following the EDA principles.

Here's what we did:
1. Identified the event (UserCreatedEvent) and its contents (name, email, facebook id etc.)
2. Identified where the UserCreatedEvent should be published (two places: after creating an account either via facebook or via mail)
3. Added a new event handler that handled UserCreatedEvent and moved the existing code which sends the mail to Björn and I there.
4. Added a new event handler that handled UserCreatedEvent which sends the new welcome mail.

The result: An event / event handler setup using an EDA event, which not only fulfills the new requirement, but which also can be easily reused in the future.

(I know, I know... send two e-mails instead of one, make code reusable - not really a big challenge, is it? :) Well, we're using www.miljoparkering.se, with its fairly simple domain, to test and gain experience on principles, frameworks and alternative ways of reaching goals - the fact that the code itself does quite simple things, and everything therfore is quite overdesigned, is of minor importance)

www.miljoparkering.se is getting refactored into an EDA inspired architecture

by christerdk 27. June 2012 23:00
Apart from solving the issues with "unfair" parking fines, www.miljoparkering.se also serves a technical sandbox for Björn and I to test new ideas and emerging trends and technologies. 

Up until now, the NCQRS framework has taken up the front row of our application, and we have used commands to initiate actions and model state changes, and events to respond to these changes and update the read model.
 
However, as we progressed in our fast paced MVP fashion, some of the responsibilites that we assigned to the event handlers seemed less and less appropriate or not so obvious. For example, should the responsibility of the event handlers solely be to update the read model - or was sending e-mails also a valid responsibility (or any integration for that matter)? We also had the need for event handlers to issue new commands and as we only used the in-memory queue that came with NCQRS, we had to introduce a hack to make it work (the issue was that we could not start up a new context in the same thread, and we didn't want to introduce an ESB at that stage).
 
I had been reading up on Event Driven Architectures (EDA) and after a talk we decided to split things up, place responsibilities in a more structured way and make the whole solution more flexible for change - inspired by the principles of EDA. We decided to push the NQRS(+EventSourcing) framework at least one row back and limit its responsibilities to be only about handling changes in the domain model and writing these changes to the read model. The application flow control was to be put in the exchange of events to event handlers in an ESB - that is, not as in NCQRS style events, which communicates fine grained changes in the domain model, but instead as more coarse grained "business events".
 
And tonight, after 3 weeks refactoring and a lot of testing, we released the new EDA inspired version during the Portugal / Spain football match. :)
 
 

www.miljoparkering.se is overdesigned – by design

by christerdk 11. December 2011 23:10

Under the hood www.miljoparkering.se is, admittedly, quite overdesigned.

The web site, in its current form, is very simple: People sign up. They then choose the streets they want reminders for. It’s also possible for them to change their e-mail address. Every night there’s a recurring job that goes through all the user profiles, matches them with street cleaning tasks for the day, and then send out reminders to the users that has these streets selected. The site could have been made easily in a short time with a simple relational database, simple data access layer and some CRUD code.

For the last year or so, however, the Command Query Responsibility Segregation (CQRS) principle and Event Sourcing has been surging up the trend curve. It’s hard get around if you’re interested in system design. Many people talk about it, it’s discussed on community meetings. We actually considered CQRS when we had to make initial technology choices for Mikz.com, the daytime project I’m currently working on. But CQRS+EventSourcing is quite different than the classic de-facto enterprise system design, the n-layered design.

In a n-layered design, you’ll typically see at least three layers, UI, domain logic and database access. Depending on your flavor, you might have more layers, and the data access layer might / might not be implemented with and ORM framework such as nHibernate. After changes changes are made to the domain model, the current state of the model is persisted into the database. When you need to get information from the system, the same layers are used to query for the current state.

In CQRS, you don’t use the same model for both updating and for querying. There’s a model that facilitates safe and structured changes to the domain objects (through the use of “void” commands). And there’s a model for querying the system. The first model may be relatively slow to work with, but that’s ok because we need to validate the information and ensure that the changes are made correctly, and we maybe even have notify other systems that a change has occurred. The read model, however, generally needs to be fast since most often, for each update you make to a system, there are many reads.

Creating a system based on CQRS can be done without Event Sourcing. But Event Sourcing has some advantages, that doesn’t come naturally from a system, that just contains the current state, such as getting answers to new questions from a system that is already running. This makes the Event Sourcing concept very interesting. Also, data from a system based on Event Sourcing is quite different: current state is not persisted in the data store. Instead, all the events that occurred in the system lifetime are saved, from which current state of any domain object can be extracted!

With many years of experience with development of n-layered systems, Björn and I agreed that it was about time to get our hands dirty with CQRS and Event Sourcing. We wanted to gain real life experience, and thereby get a bit of distance to theoretical discussions, some of which where a bit too glorifying, and instead be able to make choices on behalf of our enterprise customers based on real experience. This is the reason behind the overdesign of www.miljoparkering.se – it’s by design.

As of writing this, with version 1.0 of the system, we are aware that this is only the start of our experience with CQRS+EventSourcing. Our professional experience tells us that we need to experience at least (1) an upgrade iteration, as well as at some point (2) facing a previous design choice, that has to be refactored/rectified/expanded. I will blog when we reach those experiences.

www.miljoparkering.se in the Amazon Cloud

by christerdk 11. December 2011 00:14

When we set out to create www.miljöparkering.se, we wanted it to be a win situation from the very start. That is, even if the site wouldn’t ever gain significant user adoption, our goal was to at least make it a personal win.

One of the things that both Björn and I had wanted to break into was cloud computing. We wanted to gain experience in hosting a web service in the cloud. We also wanted to control and maintain the entire stack of the web service, from web server to database server, to background services.

The inspiration for using Amazon’s cloud services (or, Amazon Web Services (AWS), to be exact) came from talking to Lars Krantz and Thomas Mårtensson, the guys behind www.malmofestivalen.se. During our communication while I was creating the Android application, I got a lot of input on how they did things on the site and got to know their thoughts on how to maintain and scale the site. And as you might have noticed the last two years, that site has run perfectly with high availabilty. 

So, in preparation to developing www.miljoparkering.se I sent Lars some questions, which yielded an answer which I, at first, found rather cryptic. But basically it was just very much in line with one of the toughest parts of getting involved with Amazon’s AWS: you have to understand a huge pile of acronyms, services and entities and how they all relate to each other!

I all honesty I think Amazon AWS has a steep learning curve. Documentation is present, but is huge and overwhelming. Trying to predict costs is almost an intimidating task (although it’s gotten a lot better since the first time I looked through their pricing information some years ago). The detailed sign-up process includes identity validation through phone (very well implemented, though, technically speaking). Once you’re validated, you can log on to your personal web console, a AWS universe full of detailed information. Also, using the AWS SDK for .Net has its quirks, such as initially failing when used with log4net (caused by an unexpected side effect). 

However, for those considering to try out Amazon AWS, I do have some encouragement: True, Amazon AWS is very advanced and can be quite overwhelming. But you can ignore most of it and just focus on the part of AWS, that you initially need. For us it was EC2 (Elastic Computing) servers. After getting a server up and running, we started to look at other services, such as Elastic IP (public IP for the server, hard to get around ;-), firewall setup, and the SES, Simple Email Service, for sending out our notifications.

Although the site itself could fit easily into a normal web hotel, the Amazon server gave us the flexibility and control we needed. And over time, the cryptic mail from Lars turned out to be very useful information, once I had gotten acronyms in place (thanks Lars ;o)

Funnel vision

by christerdk 5. November 2011 15:41

When Björn and I put up our landing page for www.miljoparkering.se we had Google Analytics (GA) in place from the very start.

Our landing page contained two points of user interaction, the sign-up and the contact form, so creating goals based on these actions was obvious to us.

Soon, however, we started getting curious about how much of the information on our landing page the users were seeing and at what point they would start loosing interest and leave. But as our landing page didn’t have sub-pages, we didn’t really have any any obvious data to tell us what the users were doing other than visit the first page. From a GA perspective we were a bit in the blind …

We started thinking about what we could do about this. What we’d really like was a solution based on GA goals which would mean that we could use funnel visualization, a great way to display the users progress and their exit points. However, GA goals and funnels are based on page views and cannot be based on events, so we needed to somehow hack it a bit.

Our solution ended up like this:

  • We imported the jquery appear plug-in. This plug-in makes it easy to add event listeners to elements on a page, however firing the event only once when the element enters the users viewport.
  • We chose the images on the page to be the measure points which we would use to track the users progress. We gave the images recognizable identifiers: picfine, piclady, picgreyfine, picconfusion, picui and picmap.
  • We then added appear event handlers on the images. Each event handler would use the GA client side tracker object to send information to GA. Asynchronously, of course, so that the user experience wouldn’t get crippled.
  • The information sent to GA would be “fake” page views including the names of the pictures in the URL: /scroll/[image name]

    With this setup we could then create a GA goal called “See all page”, in which the users had to browse through all the fake pages and to end up at /scroll/picmap to complete the goal. Of course, the users weren’t really browsing around, they were scrolling.

    This is the funnel we now see in GA.

    image

    The funnel shows 73 visits to the page and the first image (picfine). After that, 63 (86%) visits are made to the second image, piclady. Hereafter, 61 (97%) proceed to picgreyfine. And from there on 58 (95%) to picconfusion. And so on…

    After some time we could see three behaviors: people either (1) visited the page and exited right after, (2) exited the page after seeing the first sign-on form or (3) continued to the very end.

    This solution gave us much more detailed insight on the users interest in our page. Fortunately, as of writing this, a whole 72.6% continue to the very end, which we, with regards to keeping the users interest, consider a success! :-)

  • Powered by BlogEngine.NET 1.4.5.0
    Theme by Mads Kristensen

    About Christer

    Do you need an enterprise .Net software developer for your project? I am available to help you reach your business goals from September 2013!

    Christer is a indpendent software development contractor with more than 13 years of experience. 

    Christer is a NServiceBus Community Champ 2013

    When not working on enterprise projects, Christer uses his time making peoples lives a little easier, through either software or the written word. 

    Christer's software on the Web:
    Miljöparkering.se - a site which helps citizens avoid fines when parking. It is also an in-production POC that serves as a testing ground for new technologies / architectural styles.

    Christer's software for Windows:
    Mobile Broadband Logging Monitor - if you feel your computer gets slow while using mobile broadband.
    Mobile Broadbang Log Level Utility - to change the excessive logging in 3Connect.

    Christer's Android software (find them in Android Market):
    Malmökartan for Android - stuff you won't find on Google Maps.
    Malmöfestivalen for Android - an Open Source project to support the festival! :)

    Danish blog about message based architectures and enabling technologies such as NServiceBus. 

    Christer also pretends to have a life IRL. Here he enjoys the company of his girlfriend Lydia, their dog Xena, and loads of books. 

    Feel free to get in contact!