Wednesday, 13 February 2019

How to group wikipedia categories in python?

For each concept of my dataset I have stored the corresponding wikipedia categories. For example, consider the following 5 concepts and their corresponding wikipedia categories.

  • hypertriglyceridemia: ['Category:Lipid metabolism disorders', 'Category:Medical conditions related to obesity']
  • enzyme inhibitor: ['Category:Enzyme inhibitors', 'Category:Medicinal chemistry', 'Category:Metabolism']
  • bypass surgery: ['Category:Surgery stubs', 'Category:Surgical procedures and techniques']
  • perth: ['Category:1829 establishments in Australia', 'Category:Australian capital cities', 'Category:Metropolitan areas of Australia', 'Category:Perth, Western Australia', 'Category:Populated places established in 1829']
  • climate: ['Category:Climate', 'Category:Climatology', 'Category:Meteorological concepts']

As you can see, the first three concepts belong to medical domain (whereas the remaining two terms are not medical terms).

More precisely, I want to divide my concepts as medical and non-medical. However, it is very dificult to divide the concepts using the categories alone. For example, even though the two concepts enzyme inhibitor and bypass surgery are in medical domain, their categories are very different to each other.

Therefore, I would like to know if there is a way to obtain the parent category of the categories (for example, the categories of enzyme inhibitor and bypass surgery belong to medical parent category)

I am currently using pymediawiki and pywikibot. However, I am not restricted to only those two libraries and happy to have solutions using other libraries as well.

EDIT

As suggested by @IlmariKaronen I am also using the categories of categories and the results I got is as follows (The small font near the category is the categories of the category). enter image description here

However, I still could not find a way to use these category details to decide if a given term is a medical or non-medical.

Moreover, as pointed by @IlmariKaronen using Wikiproject details can be potential. However, it seems like the Medicine wikiproject do not seem to have all the medical terms. Therefore we also need to check other wikiprojects as well.

I am happy to provide more details if needed.



from How to group wikipedia categories in python?

No comments:

Post a Comment