Skip to content

xxx.dtd a template #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
funderburkjim opened this issue Oct 13, 2019 · 22 comments
Open

xxx.dtd a template #6

funderburkjim opened this issue Oct 13, 2019 · 22 comments

Comments

@funderburkjim
Copy link
Contributor

In the previous revision of csl-pywork, the dictionary dtds (xxx.dtd) were in 'distinctfiles'. That is, when reconstructing a 2020 dictionary, csl-pywork used a separate version of the xxx.dtd program for each dictionary. Now, csl-pywork uses one one.dtd template to create the different versions.

This is an improvement, because now we can see all the variations in one place.

@funderburkjim
Copy link
Contributor Author

how one.dtd was constructed.

While similar in some ways to the relation between the make_xml.py template (see #5) and the individual versions, this relation is somewhat different.

In the case of make_xml.py template, the template is used to generate, for each dictionary xxx, a version of make_xml.py for that dictionary that is functionally the same as the prior distinct version;
namely the generated and distinct versions create the same xxx.xml file.

The xxx.dtd generated by one.dtd is also functionally similar to the previous distinct xxx.dtd, in that the xml file xxx.xml is judged valid by both.

However, the xxx.dtd generated by one.dtd is quite different from the previous distinct xxx.dtd.
The way the one.dtd template was developed is described in readme_dtd.txt. In brief, one.dtd started out as a copy of the previous distinct acc.dtd. Then, one.dtd was adjusted one dictionary at a time by

  • adding elements that were in the next dtd but not yet in one.dtd
  • adding attributes that were in the next dtd but not yet in one.dtd
  • adjusting attribute value specifications that were in the next dtd but not yet in one.dtd
  • checking that the resulting one.dtd validated all the dictionaries thus far.

@funderburkjim
Copy link
Contributor Author

Must one.dtd be a template?

For dictionary xxx, the xml root of xxx.xml is xxx. In other words, the xml structure of xxx.xml is

<xxx>
<!-- many other elements used in the xml form of the dictionary entries -->
</xxx>

So, at least the differences in root elements dictates that one.dtd must use xxx as a template variable.

Otherwise, there are only two places where template variables are used.

  • For AP dictionary, there are many elements like <div n="?">... ; that is, the div element has an
    n attribute with value ?. One way attribute values can be specified in a dtd is by an
    a list of possible values. In all other dictionary dtds, the possible values of the n attribute of div
    element is given by such a list. But, according to the very strict rules of dtd formation, there are
    restrictions regarding the character set which is allowed to be used in the specification of a possible
    attribute value in such a list; and the ? character is not allowed.
    Thus, in ap.dtd, we must specify the n attribute of div by the more general CDATA specification,
    which does validate the <div n="?"> usage.
  • In case of all dictionaries but mw, the children of the root element are all of type H1.
    But in case of mw, the children of the root element are of 20 types which can be understood
    by the regular expression H[1-4][ABCE]?.
    Currently, one.dtd template generates different values for the children of the root.

@funderburkjim
Copy link
Contributor Author

removal of template logic for AP

We remove the template distinction for AP dictionary as follows.

  • The <div n="?"> form is introduced by make_xml.py. We can change this to
    <div n="Q"> [Q is not used elsewhere as a value of 'n' for 'div'; and Q for Question].
  • We must also check (in csl-websanlexicon) whether basicdisplay.php does anything special
    for <div n="?"> in ap dictionary ; we see (in function sthndl_div for 'ap') that any value of 'n'
    other than '2' or '3' is simply a line break; so no change required in basicdisplay.php.
  • in one.dtd, we add 'Q' as an allowed value of n attribute in div
  • regenerate ap dictionary.

@funderburkjim
Copy link
Contributor Author

Suggestions for improvement

With all the dtds represented in one.dtd, we can now examine one.dtd with an eye towards simplification.

  • There are minor spelling variants for values of 'n' attribute of 'lang' element. (eg. Arabic v. arabic).
  • The 'br' and 'lb' elements are used to indicate line breaks in some dictionaries; while other
    dictionaries use <div n="lb"> . It would be simpler to choose one or the other as standard, and
    then change the non-conforming dictionaries. This would probably involve selective changes
    to make_xml.py in csl-pywork/v02, and basicdisplay.php (in both csl-websanlexicon/v02 and
    apidev).
  • The multitude of possible values for 'n' attribute of 'div' (see one.dtd) can almost surely be
    simplified. This would require examination of the 'meaning' of the attribute values (from
    make_xml.py)
  • The above are just those that occur to me at this moment. Further examination of one.dtd will
    likely find many other opportunities for simplification.

Some additional tools needed.

In investigating such simplifications as above, some additional software tools will probably be
needed. One that comes to mind and that is already written is:

  • check_xml_tags.py (currently exists in MWScan/2014/pywork/) This program reads a text file
    and writes to an output file all instances of <...>. This is useful for determining the tags,
    attributes, and attribute values actually occurring in a particular xxx.xml.
    • one application would be in finding the 'Arabic/arabic' values mentioned above.
      Write a bash shell to run check_xml_tags on all dictionaries, then do a grep on all the resulting
      files to find info on the <lang tag. Use the results as a guide to developing changes to various
      xxx.txt digitizations. Finally, when all dictionaries are changed,
      modify one.dtd to remove the now unused attribute values of <lang n=.

@gasyoun
Copy link
Member

gasyoun commented Oct 20, 2019

@funderburkjim

But in case of mw, the children of the root element are of 20 types which can be understood by the regular expression H[1-4][ABCE]?.

What tool would you use to count how many of each 20 are there?

@YevgenJohn
Copy link
Contributor

YevgenJohn commented Oct 20, 2019

@funderburkjim

But in case of mw, the children of the root element are of 20 types which can be understood by the regular expression H[1-4][ABCE]?.

What tool would you use to count how many of each 20 are there?
Can we use something like https://github.com/teeshop/rexgen ?
[root@localhost rexgen]# rexgen H[1-4][ABCE] | wc -l
16
[root@localhost rexgen]# rexgen H[1-4][ABCE]
H1A
H2A
H3A
H4A
H1B
H2B
H3B
H4B
H1C
H2C
H3C
H4C
H1E
H2E
H3E
H4E

@YevgenJohn
Copy link
Contributor

YevgenJohn commented Oct 20, 2019

Some additional tools needed.

  • check_xml_tags.py (currently exists in MWScan/2014/pywork/) This program reads a text file
    Where in https://github.com/sanskrit-lexicon I can find that MWScan/2014/pywork/ ?
    None of CSL* repos have check_xml_tags.py. Neither csl-orig or csl-pywork has 2014 (2020 only).

Please advise, I'd like to work on this bash tool. Thank you!

@funderburkjim
Copy link
Contributor Author

I've added a 'v02/utilities/ folder to this repository, and put check_xml_tags.py there.
It is an analytical tool, not used in the dictionary generation.

@YevgenJohn
Copy link
Contributor

Thank you! I see it there.

@funderburkjim
Copy link
Contributor Author

Note on .gitignore

There are often occasions where I want to do some kind of analysis; an example might be to try
check_xml_tags.py. BUT I don't want to add material to what is tracked by git. The .gitignore
has a 'temp*' line in it. Thus I can add a 'tempxyz' directory any convenient place in the local copy of csl-pywork, and put anything in there.

@YevgenJohn
Copy link
Contributor

We might benefit from another branch for this repo.
One could switch between branches, using either the default one for the generic dictionary use, or the other branch for some analytics use, if that's is convenient for the team.

@funderburkjim
Copy link
Contributor Author

What tool would you use to count how many of each 20 are there?

A one-line variation of check_xml_tags.py does the trick. Change line 10 to:
tags = re.findall(r'<H.*?>',line)

Call the new program, for example, v02/utilities/temp.py. And run it with
python temp.py ../../../mw/pywork/mw.xml temp.txt .

Then temp.txt contains the list of 20, with counts. For example 009468 <H1A>.

@funderburkjim
Copy link
Contributor Author

We might benefit from another branch for this repo

My understanding of git does not yet extend to how to make use of branches. If you have something specific in mind, go ahead and give it a try. Let's take it in baby steps until we all
understand how to make use of branches. If you do this, be careful as to size of files added to the
repository. Currently, the repository tracks just fairly small program files.

@gasyoun
Copy link
Member

gasyoun commented Oct 20, 2019

Currently, the repository tracks just fairly small program files.

Sure. I think we can benifit from Yevgeniy's experience.

@YevgenJohn
Copy link
Contributor

YevgenJohn commented Oct 20, 2019

I am from GitLab world, but GitHub should have it also, as that's part of regular Git functionality.
Git has -b attribute for clone, checkout etc.
https://stackoverflow.com/questions/1911109/how-do-i-clone-a-specific-git-branch
This is their docs (which I consider the best): https://git-scm.com/book/en/v2/Git-Branching-Basic-Branching-and-Merging
Please let me try if I could add a branch for the repo with my credentials. Branch can always be removed when not needed. We do personal branches at work, per person, per task etc, so we end up eventually merging some of them, or removing some others.
Understood with the size, as we are dealing with the scripts it shouldn't be an issue. I am surprised that even scanned images fit there.
One of my Git project has currently over 5,000 branches and it is doing well, so Git has capacity, being made for Linux kernel with thousands of participants, each has own branch oftentimes, as the merge request must come from the branch to be added to the master branch. It is a powerful mechanism to let code exist in parallel yet linked to the same repository.
Here's how GitHub manages it: https://help.github.com/en/articles/creating-and-deleting-branches-within-your-repository
When I do it using my credentials it doesn't show 'Create branch', I must not have that permission, but the repo owner has that option, so the "analytics" branch (for example) could be created.
All we need to do is use '-b analytics' when we work with that branch.

@YevgenJohn
Copy link
Contributor

A one-line variation of check_xml_tags.py does the trick. Change line 10 to:

That's probably the safest way, in case the "rexgen" tool has differences in regex engine it uses, as some metasymbols might be interpreted slightly different depending on parser (there's entire book on those regex engines subtleties on Safari, apologies for sidetrack).

I hope that the regex used in DTD is the same parser python uses.

@YevgenJohn
Copy link
Contributor

YevgenJohn commented Oct 20, 2019

  • Write a bash shell to run check_xml_tags on all dictionaries,

This is what I see parsing all the dictionaries:

for i in $(ls ../../../ | egrep -v csl) ; do python check_xml_tags.py ../../../${i}/pywork/${i}.xml ${i}.txt ;done
[root@localhost utilities]# for i in $(ls *.txt); do grep '<lang' $i | awk -F'<' '{print $2}'; done | sort | uniq
lang n="arabic">
lang n="greek">
lang n="Greek">
lang n="meter">
lang n="Old-Church-Slavonic">
lang n="oldhebrew">
lang n="russian">
lang n="Russian">
lang n="slavic">
lang script="Arabic" n="Arabic">
lang script="Arabic" n="Hindustani">
lang script="Arabic" n="Persian">
lang script="Arabic" n="Turkish">

We need to unify Greek/greek (* vs pwg capitalized) and Russian/russian (pw vs pwg capitalized), besides that everything else seems unique.
It's simpler to make those two cases in in pwg lowercase, rather than modifying several ones with lower case greek.

[root@localhost utilities]# grep 'lang n="russian"' *
pw.txt:000001 <lang n="russian">
[root@localhost utilities]# grep 'lang n="Russian"' *
pwg.txt:000023 <lang n="Russian">
[root@localhost utilities]# grep 'lang n="greek"' *
ben.txt:001490 <lang n="greek">
bhs.txt:000003 <lang n="greek">
bop.txt:001701 <lang n="greek">
bur.txt:000677 <lang n="greek">
cae.txt:000003 <lang n="greek">
gra.txt:000229 <lang n="greek">
gst.txt:000013 <lang n="greek">
inm.txt:010778 <lang n="greek">
md.txt:000008 <lang n="greek">
mw72.txt:001665 <lang n="greek">
mw.txt:001157 <lang n="greek">
pwg.txt:000397 <lang n="greek">
pw.txt:000186 <lang n="greek">
snp.txt:000001 <lang n="greek">
stc.txt:000001 <lang n="greek">
vei.txt:000147 <lang n="greek">
wil.txt:000023 <lang n="greek">
[root@localhost utilities]# grep 'lang n="Greek"' *
pwg.txt:000001 <lang n="Greek">

@YevgenJohn
Copy link
Contributor

I tried to commit the shell script, but it seems I don't have that permission:

 create mode 100755 v02/utilities/find_lang_unique.sh
[root@localhost utilities]# git push
Username for 'https://github.com': YevgenJohn
Password for 'https://[email protected]':
remote: Permission to sanskrit-lexicon/csl-pywork.git denied to YevgenJohn.
fatal: unable to access 'https://github.com/sanskrit-lexicon/csl-pywork.git/': The requested URL returned error: 403

Basically, the results above could have been done using one shell script:

#!/bin/bash
for i in $(ls ../../../ | egrep -v csl) ; do python check_xml_tags.py ../../../${i}/pywork/${i}.xml ${i}.txt ;done
for i in $(ls *.txt); do grep '<lang' $i | awk -F'<' '{print $2}'; done | sort | uniq
for s in `for i in $(ls *.txt); do grep '<lang' $i | awk -F'<' '{print $2}'; done | sort | uniq | awk -F'n=' '{print $2}' | uniq -ic | egrep -v ' 1 ' | awk '{printf("n=\"%s%s\nn=%s\n",toupper(substr($2,2,1)),substr($2,3),$2)}'`; do grep "lang $s" * ; done
rm -f *.txt

which would give the same results, so not sure if you need a script in the repo, as this seems to be one-time search:

pwg.txt:000001 <lang n="Greek">
ben.txt:001490 <lang n="greek">
bhs.txt:000003 <lang n="greek">
bop.txt:001701 <lang n="greek">
bur.txt:000677 <lang n="greek">
cae.txt:000003 <lang n="greek">
gra.txt:000229 <lang n="greek">
gst.txt:000013 <lang n="greek">
inm.txt:010778 <lang n="greek">
md.txt:000008 <lang n="greek">
mw72.txt:001665 <lang n="greek">
mw.txt:001157 <lang n="greek">
pwg.txt:000397 <lang n="greek">
pw.txt:000186 <lang n="greek">
snp.txt:000001 <lang n="greek">
stc.txt:000001 <lang n="greek">
vei.txt:000147 <lang n="greek">
wil.txt:000023 <lang n="greek">
pwg.txt:000023 <lang n="Russian">
pw.txt:000001 <lang n="russian">

@funderburkjim
Copy link
Contributor Author

I am surprised that even scanned images fit there.

No, actually the scanned images are NOT part of any repository.

Currently, the logic involved in displaying scanned images (this logic is part of csl-websanlexicon)
looks for a local copy of the images (in the web/pdfpages directory). But if it fails to find images there, it gets the images from Cologne server.

The images are also available from an AWS-S3 bucket, but using that source of images is not
currently built into csl-websanlexicon code.

It is precisely for size reasons that the scanned images are not in a repository -- I think their
total size would be about 50-60GB.

If we want to give Ubuntu (and other local) installations the option to have local copies of the
images, we need to develop some way to do this, and add this to the installation instructions.

If you want to work on this, I can provide some further details.

@funderburkjim
Copy link
Contributor Author

I hope that the regex used in DTD is the same parser python uses.

The check_xml_tags.py program actually is not using a python xml parser. It is just reading the xml file as
lines of text and then looking for <...> tags.

aside on xml validators

On local XAMPP system, it is hard to get the xmllint xml-validator -- xmllint is used in the redo_xml.sh script to check that a given dictionary validates according to its dtd.
As a substitute, I have written a (simple) xml validator in python; this is based on the lxml python library. In recent work, I found that the python validator and xmllint validator seemed always to give the same results. So I feel comfortable using the python validator locally. However, the xmllint validator is often much faster to detect errors than the python validator.

@YevgenJohn
Copy link
Contributor

actually the scanned images are NOT part of any repository.
If you want to work on this, I can provide some further details.

Absolutely, I would like to work on this, as in case Cologne server is not available, or VM runs offline, we need an option to stuff VM with local images. I suspect some discrepancies between digital version and pictures are inevitable for this size of project, so it's important to have picture alongside with the digital version of dictionary.

I don't know if GitHub charges for 50-60GB of pictures, which would be accessing read-only, and if that's cheaper comparing to AWS-S3 bucket.

Please advise what to take a look at (I guess image fetching is part of php), so we can give that option to the standalone builds.

Thank you!

This was referenced Oct 21, 2019
@gasyoun
Copy link
Member

gasyoun commented Jan 25, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants