Login
Login
National Data Archive
An Online Microdata Catalog
  • Home
  • Catalog
  • Citations
    Home / Central Data Catalog / ZAYCEV_MOKUJIN
central

A language agnostic semantic triples extractor.

zaycev_mokujin
zaycev
Created on December 02, 2019 Last modified December 02, 2019 Page views 33 Metadata JSON
  • Project Description

Overview

Abstract
# MOKUJIN

A language-agnostic toolset for semantic triples extraction and processing.

### Requirements

* Python 2.7.X
* LevelDB
* LZ4 (optional)
* Django 1.5 (optional)

### Quick Start:

1. Prepare input data (list of sentences in first-order-logic form generated by [Metaphor](https://github.com/metaphor-adp/Metaphor-ADP)):
a. Create LF file using one of Metaphor's pipeline.
b. Split LF if they are too large (recommended size ~ 1GB).
2. Extract triples from LF sentences using `findtriples.py`:

`python findtriples.py < sentences.lf.txt > triples.csv `

The output will be the following:

```
noun_verb_adv, <NONE>, быть-VB, можно-RB, <->, <->, 156
noun_adj, поле-NN, ледяной-ADJ, <->, <->, <->, 73
noun_verb_adv, <NONE>, быть-VB, надо-RB, <->, <->, 68
noun_verb_adv, <NONE>, быть-VB, нельзя-RB, <->, <->, 65
noun_adj, океан-NN, ледовитый-ADJ, <->, <->, <->, 47
...
```


3. Create triples index using `createtriplesindex.py`:

`python createtriplesindex.py -i triples.csv -o triples-index-dir`

4. Create *query-file* `query.json`:

```
{
"query": [
{
"label": "poverty",
"source": [
"source_word_1",
"source_word_2",
...
"source_word_n",
],
"target": [
"target_word_1",
"target_word_2",
...
"target_word_n"
]
}
]
}
```

Note that `source` or `target` filesds may be empy depending on your next step.

5. Run `findsources.py` to find source (requires a list of targets in *query* file):


```
python findsources.py -i triples-index-dir -o output-dir -q query.json
```

6. Prepare file with list of sources (each on separate string):

```
source_1
source_2
...
source_n
```

7. Run `findpatterns.py` to find patterns:

```
python findpatterns.py -i triples-index-dir -o output-dir -qf sources.txt
```


## Relation Triples Extractor

Usage:

```
python mokujin.py [<input file in logical form>] [<output file>]
```

### Features

* Input format are sentences in first-order logic form produced by [Metaphor](https://github.com/metaphor-adp/Metaphor-ADP) semantic pipelines.
* Extracts the following relationships:

**Verbs**

1. `subj_verb_dirobj([noun*],verb,[noun+]) ("John reads a book")`
2. `subj_verb_indirobj([noun*],verb,[noun+]) ("John gives to Mary")`
3. `subj_verb_instr([noun*],verb,[noun+]) ("Джон работает топором")`
4. `subj_verb([noun+], verb) ("John runs") // only if there is no dirobj and indirobj`
5. `subj_verb_prep_compl([noun*],verb,prep,[noun+]) ("John comes from London")`
6. `subj_verb_verb_prep_noun([noun*],verb,verb,prep,[noun+]) ("John tries to go into the house")`
7. `subj_verb_verb([noun+],verb,verb) ("John tries to go") // only if there is no prep attached to the second verb`

**Nouns**

1. `noun_be_prep_noun(noun,verb,prep,noun) ("intention to leave for money")`
2. `noun_be(noun,verb) ("intention to leave") // only if there is no prep attached to verb`
3. `noun_adj_prep_noun(noun,adjective,prep,noun) ("The book is good for me") -> only if "for" has "good" (and not "is") as its arg`
5. `noun_adj([noun+],adjective) ("The book is good") // only if there is no prep attached to adj as its arg`
6. `noun_verb_adv_prep_noun(adverb,verb) ("John runs fast for me") -> only if "for" has "fast" (and not "runs") as its arg`
7. `noun_verb_adv([noun*],verb,adverb) ("John runs fast") // only if there is no prep attached to adv`
8. `nn_prep([noun+],prep,noun) ("[city]&bike for John") // only if "for" has "bike" (and not some verb) as its arg`
9. `nn(noun,noun) ("city bike") // only if there is no prep attached to the second noun`
10. `nnn(noun,noun,noun) ("Tzar Ivan Grozny")`
11. `noun_equal_prep_noun(noun,noun,prep,noun) ("John is a man of heart") // only if "of" has "man" (and not "is") as its arg.`
12. `noun_equal_noun(noun,noun) ("John is a biker") // only if there is no prep attached to the second noun`
13. `noun_prep_noun(noun,prep,noun) ("house in London")`
14. `noun_prep_prep_noun(noun,prep,prep,noun) ("book out of the store")`

**Verbs**

1. `compl(anything,anything) ("близкий мне")`

### Input/Output Examples:

**Input (Logical Form):**

```
% В четверг , 7 февраля 2013 года , стартовала официальная продажа билетов на Олимпийские игры в Сочи —
% ровно за год до начала соревнований .
id(1).
[1001]:в-in(e1,e5,x1) & [1002]:четверг-nn(e2,x1) & [1005]:февраль-nn(e3,x2) & [1007]:год-nn(e4,x3) &
[1009]:стартовать-vb(e5,x4,u1,u2) & [1010]:официальный-adj(e6,x4) & [1011]:продажа-nn(e7,x4) &
[1012]:билет-nn(e8,x5) & [1013]:на-in(e9,x5,x6) & [1014]:олимпийский-adj(e10,x6) & [1015]:игра-nn(e11,x6) &
[1016]:в-in(e12,x6,x7) & [1017]:сочи-nn(e13,x7) & [1019]:ровно-rb(e14,e15) & [1020]:за-in(e15,e5,x8) &
[1021]:год-nn(e16,x8) & [1022]:до-in(e17,x9,x10) & [1023]:начало-nn(e18,x10) & [1024]:соревнование-nn(e19,x11) &
card(e20,u3,7) & card(e21,x3,2013) & of-in(e22,x2,x3) & of-in(e23,x4,x5) & typelt(e24,x5,s1) & typelt(e25,x6,s2) &
of-in(e26,x10,x11) & typelt(e27,x11,s3) & past(e28,e5)

% В первые же часы билеты на самые интересные широкому кругу болельщиков виды программы — хоккей , биатлон ,
% сноуборд — были раскуплены чуть менее чем полностью .
id(2).
[2001]:в-in(e1,x1,x2) & [2004]:часы-nn(e2,x2) & [2005]:билет-nn(e3,x1) & [2006]:на-in(e4,x1,x3) &
[2008]:интересный-adj(e5,x3) & [2009]:широкий-adj(e6,x3) & [2010]:круг-nn(e7,x3) & [2011]:болельщик-nn(e8,x4) &
[2012]:вид-nn(e9,x1) & [2013]:программа-nn(e10,x5) & [2015]:хоккей-nn(e11,x6) & [2017]:биатлон-nn(e12,x7) &
[2019]:сноуборд-nn(e13,x8) & [2022]:раскупить-vb(e14,u1,x8,u2) & [2023]:чуть-rb(e15,e16) & [2024]:менее-rb(e16,e14) &
[2025]:чем-cnj(e17,x9) & [2026]:полностью-rb(e18,e17) & card(e19,x2,1) & typelt(e20,x2,s1) & typelt(e21,x1,s2) &
of-in(e22,x3,x4) & typelt(e23,x4,s3) & typelt(e24,x1,s4) & of-in(e25,x1,x5) & past(e26,x8) & past(e27,e14)

% Что касается мужского хоккея , например , то недоступными оказались пропуска на все игры плей-офф — и это при том ,
% что даже сетка турнира составлена пока не целиком .
id(3).
[3002]:касаться-vb(e1,u1,x1,u2) & [3003]:мужской-adj(e2,x1) & [3004]:хоккей-nn(e3,x1) & [3006]:например-rb(e4,e5) &
[3008]:то-cnj(e5,x2) & [3009]:недоступный-adj(e6,x3) & [3010]:оказаться-vb(e7,x4,u3,u4) & [3011]:пропуск-nn(e8,x4) &
[3012]:на-in(e9,x4,x5) & [3014]:игра-nn(e10,x5) & [3015]:плей-офф-nn(e11,x6) & thing(e12,x7)
[3019]:при-in(e13,x8,x7) & [3024]:сетка-nn(e14,x9) & [3025]:турнир-nn(e15,x10) & [3026]:составить-vb(e16,x9,u5,u6) &
[3027]:пока-cnj(e17,x11) & [3029]:целиком-rb(e18,e17) & of-in(e19,x5,x6) & of-in(e20,x9,x10) & not(e21,e18) &
past(e22,e7) & past(e23,e16)
```

**Output (List of Triples in CSV format):**

```
rel_type,arg1,arg2,arg3,arg4,arg5,arg6,freq
noun_adj,федерация-NN, российский-ADJ,<->,<->,<->,162267
subj_verb,речь-NN,идти-VB,<->,<->,<->,85846
subj_verb_dirobj,<NONE>,обратить-VB,внимание-NN,<->,<->,64583
noun_adj,житель-NN,местный-ADJ,<->,<->,<->,17450
```

## Triples Indexer
## Sources Finder
## Patterns Finder
Authoring entity
Agency Name Role
zaycev owner
Language
English

Methods, software and scripts

Software
Name Libraries or packages used
Python
json, numpy, django, cPickle, glob, re, string, plyvel, threading, mokujin, array, argparse, csv, gensim, marshal, multiprocessing, createlfindex, unidecode, fnmatch, StringIO, logging, zlib, sys, findsources, random, collections, gc, lz4, os, unittest
License
Name
Apache License 2.0

Metadata production

Producers
Name Role
GitHub Bot bot
Date of Production
01 December 2019
National Data Archive

© National Data Archive, All Rights Reserved.