Sunday 22 November 2020

ArangoDB best way to get_or_create a document

I'm performing what I imagine is a common pattern with indexing graph databases: my data is a list of edges and I want to "stream" the upload of this data. I.e, for each edge, I want to create the two nodes on each side and then create the edge between them; I don't want to first upload all the nodes and then link them afterwards. A naive implementation would result in a lot of duplicate nodes obviously. Therefore, I want to implement some sort of "get_or_create" to avoid duplication.

My current implementation is below, using pyArango:

def get_or_create_graph(self):
    db = self._get_db()
    if db.hasGraph('citator'):
        self.g = db.graphs["citator"]
        self.judgment = db["judgment"]
        self.citation = db["citation"]
    else:
        self.judgment = db.createCollection("judgment")
        self.citation = db.createCollection("citation")
        self.g = db.createGraph("citator")


def get_or_create_node_object(self, name, vertex_data):
    object_list = self.judgment.fetchFirstExample(
            {"name": name}
            )
    if object_list:
        node = object_list[0]
    else:
        node = self.g.createVertex('judgment', vertex_data)
        node.save()
    return node

My problems with this solution are:

  1. Since the application, not the database, is checking existence, there could be an insertion between the existence check and the creation. I have found duplicate nodes in practice I suspect this is why?
  2. It isn't very fast. Probably because it hits the DB twice potentially.

I am wandering whether there is a faster and/or more atomic way to do this, ideally a native ArangoDB query? Suggestions? Thank you.

Update As requested, calling code shown below. It's in a Django context, where Link is a Django model (ie data in a database):

        ... # Class definitions etc

        links = Link.objects.filter(dirty=True)

        for i, batch in enumerate(batch_iterator(links, limit=LIMIT, batch_size=ITERATOR_BATCH_SIZE)):
            for link in batch:
                source_name = cleaner.clean(link.case.mnc)
                target_name = cleaner.clean(link.citation.case.mnc)

                if source_name == target_name: continue 

                source_data = _serialize_node(link.case)
                target_data = _serialize_node(link.citation.case)

                populate_pair(citation_manager, source_name, source_data, target_name, target_data, link)

def populate_pair(citation_manager, source_name, source_data, target_name, target_data, link):
    source_node = citation_manager.get_or_create_node_object(
        source_name,
        source_data
        )
    target_node = citation_manager.get_or_create_node_object(
        target_name,
        target_data
        )
    description = source_name + " to " + target_name
    citation_manager.populate_link(source_node, target_node, description)

    link.dirty = False
    link.save()

And here's a sample of what the data looks like after cleaning and serializing:

source_data: {'name': 'P v R A Fu', 'court': 'ukw', 'collection': 'uf', 'number': 'CA 139/2009', 'tag': 'NA', 'node_id': 'uf89638', 'multiplier': '5.012480529547776', 'setdown_year': 0, 'judgment_year': 0, 'phantom': 'false'}
target_data: {'name': 'Ck v R A Fu', 'court': 'ukw', 'collection': 'uf', 'number': '10/22147', 'tag': 'NA', 'node_id': 'uf67224', 'multiplier': '1.316227766016838', 'setdown_year': 0, 'judgment_year': 0, 'phantom': 'false'}
source_name: [2010] ZAECGHC 9
target_name: [2012] ZAGPJHC 189


from ArangoDB best way to get_or_create a document

No comments:

Post a Comment