Sunday 12 January 2020

Reconciling schedule data (Solve a logic problem)

I have a problem I need to solve using Node. The question I have is surrounding the best logical way to solve it. Any advise is appreciated.

Summary

You will build a tool that imports train schedules from an external data source and is stored in an internal database. 

Eternal Data Source

There is a service that provides a list of trains:

[{"id":1,"name":"A EXPRESS"},{"id":2,"name":"B EXPRESS"},{"id":3,"name":"C EXPRESS"},{"id":4,"name":"D EXPRESS"},{"id":5,"name":"E EXPRESS"}]

And a list of train stations:

Example for train "A EXPRESS":

[{"arrival":"2019-04-30T11:48:00.000Z","departure":"2019-05-01T05:42:00.000Z","service":"Loop 5","station":{"id":"ST1","name":"Waterloo"}},{"arrival":"2019-05-13T18:00:00.000Z","departure":"2019-05-14T05:00:00.000Z","service":"Loop 5","station":{"id":"ST2","name":"Paddington"}},{"arrival":"2019-05-15T04:00:00.000Z","departure":"2019-05-15T11:00:00.000Z","service":"Loop 5","station":{"id":"ST3","name":"Heathrow"}},{"arrival":"2019-05-16T20:00:00.000Z","departure":"2019-05-17T10:00:00.000Z","service":"Loop 5","station":{"id":"ST4","name":"Wimbledon"}},{"arrival":"2019-05-18T15:00:00.000Z","departure":"2019-05-19T21:00:00.000Z","service":"Loop 5","station":{"id":"ST5","name":"Reading"}},{"arrival":"2019-05-21T04:00:00.000Z","departure":"2019-05-21T21:00:00.000Z","service":"Loop 5","station":{"id":"ST6","name":"Algate"}},{"arrival":"2019-05-31T03:00:00.000Z","departure":"2019-05-31T15:00:00.000Z","service":"Loop 5","station":{"id":"ST1","name":"Waterloo"}}]

Note: this train stops at station "ST1" ("Waterloo") twice.

To do

For each imported station call, we want to maintain the latest information as well as the history of the station call: 

-What station is the train calling? 

-What are the latest arrival and departure dates? 

-When was the station call first imported? 

-When was the station call last updated? 

-How did the station call change as time went? (evolution of arrival and departure dates with time) This kind of information is useful for us to understand how often trains are delayed, when do schedule changes happen and if there are patterns to these changes. 

How it works

-The external data source is a simulation of train schedules forecasts 

-The data covers a time range from January 1st 2019 to May 31st 2019 

-This 5 months time window is compressed and simulated over a 24 hours cycle 

-This 24 hours cycle restarts every day at 00:00 UTC 

-The data source provides endpoints to request train schedules

-A train list endpoint provides a dynamic list of trains that you can import (see Data above)

-A train schedules endpoint provides a dynamic list of stations calls for a specific train (see Data above)

-A train schedule consists of a list of station calls with a varying amount of past station calls and future station calls

-This external data source does not provide a unique identifier for each station call

-This means that merging station calls is not straightforward. This is the crux of my question: reconciling external station calls with the existing ones in the database. 

-Station calls arrival and departure dates routinely change, sometimes by multiple days. Sometimes they swap, get deleted or new ones appear

-Station calls can sometimes be deleted (the train will not stop in that specific station

-New station calls can sometimes be created (the train will make an unscheduled stop)

-The train schedules endpoint changes the data returned every 15 minutes. 

Specific requirements

You need to capture 24 hours of all train schedules starting at 00:00:00 UTC on one day and ending at 23:59:59 UTC on the same day.   

Question

As you can see from the task above, there is a need to reconcile the new imported data with existing data as the new data changes.

There is no ID that can be used to match station visits, but these visits need to be updated when the external data changes.

We do have a train ID and a station ID as well as the date of visits.

What is the logic I can apply to keep the database data accurate and up to date?

Thank you

Answer?

My initial thoughts are to do the following, however I am not sure if this is the best solution. If you have a better solution, or see a problem with mine, please let me know.

The list of station calls retrieved from the external service will always be saved to the internal database under its retrieval timestamp. This will always be displayed as the current information in the UI (latest database entry).

Each 15 min a call is made to the external service and a new latest entry is added to the internal database, and reflected on the UI.

The status of the latest entry needs to reflect the difference between the the latest entry and the previous entry. E.g. delayed by X min, cancelled, etc. This is the tricky part, because the current station visit needs to be matched with a previous station visit.

My thinking is for a specific train is to just find the matching station ID.

  • If there is no previous matching station ID, then the status is "new".

  • if there is previous station ID, and no matching current station ID, the status is "cancelled".

  • if there is one matching station ID, then there times are compared, and the status is updated to "early" or "delayed".

  • If there is more than one matching station ID, the station previous and current station IDs with the closest timestamps are matched, and their status updates accordingly to "early" or "delayed".

Is my logic correct?



from Reconciling schedule data (Solve a logic problem)

No comments:

Post a Comment