Friday, 24 December 2021

How to find list of unique affixes given a list of words?

An affix can be a prefix (before word), infix (in the middle of a word), or suffix (after word). I have a list of 200k+ latin/greek names used in biological taxonomy. It turns out there is no centralized list of all the affixes used in the taxonomy, unfortunately, other than this very basic list.

The question is, how can I take that 200k+ list of latin/greek names, and divide it into a list of affixes (ideally using just plain JavaScript)?

I don't really know where to begin on this one. If I construct a trie, I need to somehow instead test for specific chunks of words. Or if the chunk can be extended, don't include the chunk until we reach a final extension of some sort...

const fs = require('fs')
const words = fs.readFileSync(`/Users/lancepollard/Downloads/all.csv`, 'utf-8').trim().split(/\n+/)
const trie = { children: {} }

words.forEach(word => addToTrie(trie, word))

function addToTrie(trie, word) {
  let letters = word.trim().split('')
  let node = trie
  let i = 0
  while (i < letters.length) {
    let letter = letters[i++]
    node = node.children[letter] = node.children[letter] || { children: {} }
  }
  node.isWord = true
}

It doesn't need to be exact, like each affix actually means something, it can be dirty (in that, some words mean something, some words don't). But it shouldn't just list every permutation of a word's letters sort of thing. It should include things which are "potential affix candidates", which are chunks which appear more than once in the list. This will at least get me partway there, and I can then manually go through and look up the definitions for each of these "chunks". Ideally, it should also tell whether it is a prefix/infix/suffix. Maybe the output is a CSV format affix,position.

You can get creative in how this is solved, as without knowing a list of possible affixes in advance, we don't know what the exact output should be. This is basically to try and find the affixes, as best as possible. If it includes things like aa- as a prefix, for example, which is probably a common sequence of letters yet I don't think is an affix, that is fine with me, it can be filtered out manually. But if there are two words (I am making this up), say abrogati and abrowendi, then abro would be a "common prefix", and that should be included in the final list, not abr, ab, and a, even though those are common too. Basically, the longest common prefix. However, if we have the words apistal and ariavi, we could say that a is a common prefix, so our final list would include a and abro.

To go into slightly more detail, say we have these two words aprineyanilantli and aboneyanomantli, they have the common prefix a-, and the common suffix -antli, as well as the infix -neyan-, so those should be in the final list.

It doesn't necessarily need to be efficient, as this is only going to run theoretically once, on the 200k+ list. But if it efficient as well, that would be bonus. Ideally though it shouldn't take hours to run, though I am not sure what's possible :)

Another example is this:

brevidentata
brevidentatum
brevidentatus
crassidentata
crassidentatum
crassidentatus

Here, the first 3 have a common prefix, brevidentat, then 2-3 have the common prefix brevidentatu. But later (with human knowledge), we find identat is probably the infix we desire, and a/um/us are word form suffixes. Also, we see that identat is an infix in the two words crass... and brev.... So the end result should be:

brav-
crass-
-identat-
-a
-us
-um

That, in theory, would be the ideal outcome. But you could also have this:

brav-
crass-
-identat-
-identata
-identatus
-identatum

That would also work, and we could do some simple filtering to filter those out later.

Note, I don't care about infixes in the sense of word parts that surround something else, like stufffoo...barstuff, where foo...bar wraps something. I just care about the word parts which are repeated, such as prefixes, suffixes, and stuff in the middle of words.



from How to find list of unique affixes given a list of words?

No comments:

Post a Comment