Monday, 19 February 2018

Adding Upsert support for Entity Framework Core

Like many others, I have used Entity Framework and have been a fan of the simplicity it allows in accessing database. It's quite powerful and can be used to execute a large variety of queries. Any queries that can't be expressed using Linq syntax I usually move to a stored procedure or a function and call that from EF. One thing that I've always moved to a stored procedure was the Upsert command.

Actually, I've never called it upsert until recently when I stumbled upon a reference to it. Since the database engine I've worked with most is SQL Server, I've used the MERGE statement to execute an atomic (not really) UPDATE/INSERT, and it looks something like this:

MERGE dbo.[Countries] AS [T]
USING ( VALUES ( 'Australia', 'AU' ) ) AS [S] ( [Name], [ISO] )
    ON [T].[ISO] = [S].[ISO]
WHEN MATCHED THEN
    UPDATE SET
        [Name] = [S].[Name]
WHEN NOT MATCHED BY TARGET THEN
    INSERT ( [Name], [ISO] )
    VALUES ( [Name], [ISO] );

Other databases that I started working with recently have similar syntax available. For example, in PostgreSQL, one could use the INSERT … ON CONFLICT DO UPDATE syntax:

INSERT INTO public."Countries" AS "T" ( "Name", "ISO" )
VALUES ( 'Australia', 'AU' )
ON CONFLICT ( "ISO" )
DO UPDATE SET "Name" = 'Australia'

I thought that it would be interesting to see whether this would be possible to do in Entity Framework directly, rather than having to write it in SQL. Out of the box EF doesn't support it, even though there is interest in adding it, and there's even an issue on EF Core's GitHub project discussing this. But the concept itself is simple, so I thought it would be an interesting project to play around with.

In a typical Upsert command, I believe there are three main components:

  • The Insert statement that defines the initial state of the entity inserted
  • The list of columns that will be used to uniquely identify an entity
  • The update statement, setting the state of an "updated" entity

The beauty of an Upsert command is that the state of the object when it's first created may be different to the state we may try set it to when updating. This can be useful in certain situations, for example when populating an Updated column, while leaving the Created value untouched. Furthermore, the update statement may update some columns based on the existing state of the object. Consider a table that records the number of page visits on a daily basis. Every day we want a new record added to the table, but multiple visits within the same day should just increment the column counting the visits. This can be easily achieved with the following statement (using the Postgre SQL syntax):

INSERT INTO public."DailyVisits" AS "T" ( "UserID", "Date", "Visits" )
VALUES ( @userID, @date, 1 )
ON CONFLICT ( "UserID", "Date" )
DO UPDATE SET "Visits" = "T"."Visits" + 1

As you can see, we're using the UserID and Date fields as a unique key, and setting Visits to 1 on the initial execution, and incrementing that value with each successive execution.

With all that in mind, I set out to create a simple extension method that would help me achieve the same generated SQL but in a "Linq" way, while keeping it simple and easy to understand. The main challenge was coming up with a simple fluent approach to describing the query, and parsing the Expression trees that would be passed in. While my initial approach was to pass in all arguments to a single method, I realised that using a more fluent approach would make the statement considerably easier to read and understand. I also wanted to make sure that the I can use both a single and multiple columns as a unique identity for the query.

One benefit to approaching this in Entity Framework Core is the beautiful metadata APIs available to developers. Given an entity type, you can get a IModel object that gives you access to all the metadata for the entity that Entity Framework uses. You can then access the IProperty entries describing each of the columns. With that I can easily filter out any columns that are database generated, and only use columns that can be updated in the query. I can also get the database column names and the table name which can then be used to generate the final SQL statements.

In the end I came up with what I think is a rather elegant syntax. The queries above can now be described like this:

DataContext.Upsert(new Country
    {
        Name = "Australia",
        ISO = "AU",
        Created = DateTime.UtcNow,
    })
    .On(c => c.ISO)
    .UpdateColumns(c => new Country
    {
        Name = "Australia"
        Updated = DateTime.UtcNow,
    })
    .RunAsync();

DataContext.Upsert(new DailyVisits
    {
        UserID = userID,
        Date = DateTime.UtcNow.Date,
        Visits = 1,
    })
    .On(v => new { v.UserID, v.Date })
    .UpdateColumns(v => new DailyVisits
    {
        Visits = v.Visits + 1,
    })
    .RunAsync();

The code is quite simple, but also limited. I've only implemented several basic types of expressions that can be used in the update expression tree, and the columns can either have new values set or be incremented/decremented for the time being. But it's a pretty good proof of concept and it does work quite well for simple projects.

For this to be implemented as part of the Entity Framework project it would have to be a LOT more serious than this. Besides better expression support and proper error handling, it should properly update the state of entities currently in the data context. For example, if the "Australia" country was already loaded in the current data context, and you were to use Upsert to update it, the cached state of the entity in the context would not be updated with this extension.

With that in mind, if all you're doing is tracking page views in a small project, this could be an acceptable way to do it, and it was a fun project to write!

I've posted the source code for the project and also deployed it to NuGet if anyone is interested in checking it out: