Skip to content

Ensure fsencode/fsdecode works on ASCII FS. Use tarball for unicode test files #758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 2, 2017

Conversation

pombredanne
Copy link
Member

For #688 this should solved the paths encoding issue even on Doker images without locale and an ASCII FS encoding

For #755 this ensures the repo can be cloned on Windows by using a tarball for test paths with weird encodings

Both as reported by @sschuberth : Thanks!

@codecov
Copy link

codecov bot commented Sep 19, 2017

Codecov Report

Merging #758 into develop will increase coverage by 0.36%.
The diff coverage is 94.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           develop    #758      +/-   ##
==========================================
+ Coverage    78.54%   78.9%   +0.36%     
==========================================
  Files           91      91              
  Lines        11263   11484     +221     
==========================================
+ Hits          8846    9061     +215     
- Misses        2417    2423       +6
Impacted Files Coverage Δ
src/commoncode/fileutils.py 84.75% <100%> (+0.14%) ⬆️
src/extractcode/libarchive2.py 94.72% <100%> (+0.06%) ⬆️
src/typecode/magic2.py 92.45% <100%> (+0.21%) ⬆️
src/commoncode/command.py 83.33% <80%> (+0.4%) ⬆️
src/commoncode/testcase.py 83.59% <90%> (-0.01%) ⬇️
src/typecode/contenttype.py 85.51% <0%> (+0.35%) ⬆️
src/scancode/cli.py 93.82% <0%> (+0.36%) ⬆️
src/scancode/cache.py 89.67% <0%> (+1.36%) ⬆️
src/scancode/api.py 74.66% <0%> (+2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 969d84d...4ee8053. Read the comment docs.

@sschuberth sschuberth changed the title Ensure fsencode/fsdecode works on ASCII FS. Use tarbell for unicode test files Ensure fsencode/fsdecode works on ASCII FS. Use tarball for unicode test files Sep 20, 2017
@sschuberth
Copy link
Collaborator

The Unicode fixture tests/scancode/data/unicode_fixtures/empty.txt is present in the TAR but has not been removed from Git, and the fixture tests/scancode/data/unicode_fixtures/snow ☃/.gitkeep is not present in the TAR (only a snow directory is) and it has also not been removed from Git.

FS_ENCODING = sys.getfilesystemencoding() or sys.getdefaultencoding()
FS_ENCODING = sys.getfilesystemencoding()
# normalize the encoding name
FS_ENCODING = codecs.lookup(FS_ENCODING).name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still the correct thing to do? I though when using a version of backports.os that has PiDelport/backports.os#8 merged, we don't need to do normalization anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FS_ENCODING is now used only in command.py and libarchive2.py to encode ctypes-accessed DLLs/SOs paths ... It is also used implicit in magic2.py and should not be duplicated ... and to your point this is unlikely needed in this form now with @pjdelport fixes and should be replaced by using a proper fsencode/fsdecode in these two places instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this change is still present with backports.os 0.1.1 in use. Shouldn't your either revert this change, or use the proper fsencode/fsdecode that you mentioned?

@@ -15,3 +15,6 @@ copyright: Copyright (c) Piët Delport, and Python Software Foundation
notes: the invalid_utf8_indexes() helper is copied from Bob Ippolito's
pyutf8 package (pyutf8/ref.py) and is MIT-licensed
https://github.com/etrepum/pyutf8/blob/master/pyutf8/ref.py
This verion is an advanced patched version fixing the #688 issue
and contain the code in the contributed pull request at
https://github.com/pjdelport/backports.os/pull/7
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. This custom version of backports.os still has your #7 merged, whereas we should now use a version that has #8 merged. I guess we can wait until #8 is merged upstream and then use a version straight from upstream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sschuberth yes, I will update this with the latest PiDelport/backports.os#8 that @pjdelport kindly published to Pypi too.

@pombredanne
Copy link
Member Author

@sschuberth you wrote:

The Unicode fixture tests/scancode/data/unicode_fixtures/empty.txt is present in the TAR but has not been removed from Git, and the fixture tests/scancode/data/unicode_fixtures/snow ☃/.gitkeep is not present in the TAR (only a snow directory is) and it has also not been removed from Git.

Good catch. Will fix

@pombredanne
Copy link
Member Author

@sschuberth If I check https://github.com/nexB/scancode-toolkit/tree/688-fsencode-fsdecode-755-unicode-test-files/tests/scancode/data unicode_fixtures/ is gone-gone as a dir?

And the archive seems to contains properly the needed files?

$ tar -tvf unicode_fixtures.tar.gz 
-rw-rw-r-- pombreda/pombreda 9 2017-09-17 21:18 unicode_fixtures/pets/names.txt
-rw-rw-r-- pombreda/pombreda 11 2017-09-17 21:18 unicode_fixtures/.hidden
-rw-rw-r-- pombreda/pombreda  0 2017-09-17 21:18 unicode_fixtures/empty.txt
-rw-rw-r-- pombreda/pombreda  9 2017-09-17 21:18 unicode_fixtures/nums
-rw-rw-r-- pombreda/pombreda  3 2017-09-17 21:18 unicode_fixtures/foo bar
drwxrwxr-x pombreda/pombreda  0 2017-09-17 21:18 unicode_fixtures/pets/
-rw-rw-r-- pombreda/pombreda 23 2017-09-17 21:18 unicode_fixtures/users/index.html
drwxrwxr-x pombreda/pombreda  0 2017-09-17 21:18 unicode_fixtures/
-rw-rw-r-- pombreda/pombreda 18 2017-09-17 21:18 unicode_fixtures/todo.html
-rw-rw-r-- pombreda/pombreda  0 2017-09-17 21:18 unicode_fixtures/snow ☃/.gitkeep
-rw-rw-r-- pombreda/pombreda  6 2017-09-17 21:18 unicode_fixtures/users/tobi.txt
drwxrwxr-x pombreda/pombreda  0 2017-09-17 21:18 unicode_fixtures/users/
-rw-rw-r-- pombreda/pombreda 11 2017-09-17 21:18 unicode_fixtures/todo.txt
drwxrwxr-x pombreda/pombreda  0 2017-09-17 21:18 unicode_fixtures/snow ☃/

@pombredanne
Copy link
Member Author

This branch now has the latest @pjdelport backports.os 1.1 !

Copy link
Collaborator

@sschuberth sschuberth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I got confused about how GitHub displays the removal of empty files. Looking at https://github.com/nexB/scancode-toolkit/pull/758/files I see No changes. being displayed for both tests/scancode/data/unicode_fixtures/empty.txt and tests/scancode/data/unicode_fixtures/snow ☃/.gitkeep, but you can "guess" from the red boxes that these empty files have in fact been removed. Sorry about that.

FS_ENCODING = sys.getfilesystemencoding() or sys.getdefaultencoding()
FS_ENCODING = sys.getfilesystemencoding()
# normalize the encoding name
FS_ENCODING = codecs.lookup(FS_ENCODING).name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this change is still present with backports.os 0.1.1 in use. Shouldn't your either revert this change, or use the proper fsencode/fsdecode that you mentioned?

@sschuberth
Copy link
Collaborator

But is thirdparty/prod/backports.os-0.1.patch1-py2-none-any.whl still needed?

@sschuberth
Copy link
Collaborator

When downloading https://github.com/nexB/scancode-toolkit/raw/0ae20e45a735b660c60e5c1b2b8db449d83fc261/tests/scancode/data/unicode_fixtures.tar.gz and opening it with WinRAR, I still only see a snow directory:

snow

On Linux, it shows correctly, however, and also if I run tar -tvf from Git Bash. Very odd, I'll report this as a bug in WinRAR.

@sschuberth
Copy link
Collaborator

I'll report this as a bug in WinRAR.

FYI, the RAR devs already confirmed the bug in WinRAR 5.5 realted to Unicode chars in TAR files with PAX extended headers. It will be fixed in the next version.

 * see PiDelport/backports.os#7 for details

Signed-off-by: Philippe Ombredanne <[email protected]>
 * this ensures that the repo can be cloned on Windows

Signed-off-by: Philippe Ombredanne <[email protected]>
 * this fixes the issues for ASCII fs encodings
 * big thank you to @pjdelport for fixing this

Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne pombredanne force-pushed the 688-fsencode-fsdecode-755-unicode-test-files branch from 0ae20e4 to 9f5595d Compare September 22, 2017 05:30
@pombredanne
Copy link
Member Author

@sschuberth good catch on Winrar ;)
I do not use it though.
This latest commit should be all set now.

@sschuberth
Copy link
Collaborator

The code looks good to me now, but Windows CI still fails with

WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect: 'c:\\users\\appveyor\\appdata\\local\\temp\\1\\scancode_appveyor\\tst\\ rbabmi\\dqxinc\\unicode_fixtures.tar.gz\\unicode_fixtures\\snow ?'

@pombredanne
Copy link
Member Author

pombredanne commented Sep 22, 2017

All should be passing now: I am skipping some tests on Windows. Merging soon!

@pombredanne
Copy link
Member Author

@sschuberth you wrote

But I doubt that's true. I can download and extract https://raw.githubusercontent.com/nexB/scancode-toolkit/d5f3116f0ef8b21c81e60a0a7c18eb46bb257630/tests/scancode/data/unicode_fixtures.tar.gz on Windows just fine. It's probably just a matter of doing it right programmatically.

I can download and extract things alright on Windows eventually using 7zip or extractcode... but they will mangle the names such that they are valid windows names which defeats the purpose of the test: Windows cannot git clone this nor could I find a way to recreate some of the express test files either.
So I have tried many ways but there is no way some tests run on Windows.... so I am going to merge this as it is now.

@pombredanne pombredanne merged commit 6b93456 into develop Oct 2, 2017
@pombredanne pombredanne deleted the 688-fsencode-fsdecode-755-unicode-test-files branch October 2, 2017 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants