debian tag cloud

Sponsored by
HostEurope Logo

debian tag cloud

All 1 2 3 4 5 6 7 8 9

Before we begin, everyone should read up on hashtables and what open addressing / closed hashing is. The context is lines 111‥190 of Python’s Objects/dictobject.c as of today (so we get the line numbers straight).

(I’ve reworded this wlog entry a bit; I originally wrote it too late at night for it to read coherent.) Basically, I’ve got an application where I’d like to use a hashtable for a number of things – not as generic as Python, and with focus on small footprint. I’d like to offer associative arrays in a scripting language, where the keys are always arbitrary byte strings excluding NUL. Also, I’d like to use the hashtable as backend for indexed arrays, where the keys are uint32_t and the usual use case is sequential. Finally, I’m using it for several internal tables, such as a list of keywords, one of builtins, one of special variables, etc. which is a reason for me to not use a self-balancing binary tree as data structure (reading further below might suggest that, but getting a sorted list of hashtable keys is not the focus, though not unimportant).
My questions on this are:

① Why is the shift on perturb done after its first use? In my experiments (using 32-bit width everywhere), for the pathological case of an 8-element (i = 3) table with three entries 0, 0x40000000 and 0x800000000, the “second round” yields 1 for all three, so it cannot have to do with the upper bits. My lookup looks like:

	mask = 2ⁱ - 1;
	j = perturb = hash(key);
	goto find_first_slot;

	 find_next_slot:
	j = (j << 2) + j + perturb + 1;
	perturb >>= PERTURB_SHIFT;
	/* FALLTHROUGH */

	 find_first_slot:
	entry = table[j & mask];
	if (!match(entry)) goto find_next_empty_slot;
 

This means that my first check is always the bare hash (so “only do it if needed” is no reason) and, since I’m using gotos, I could just move the perturb >>= PERTURB_SHIFT; line before the line recalculating the next j to use. This seems to make more sense, even in the face of Python. (I actually looked at the Python file’s comments again today because I thought to use a different resolution, but they have a good rationale for using the multiplication by 5.)

② Why can’t we just use i as the PERTURB_SHIFT? Sure, this changes a shift-right by a constant, which can possibly be encoded as immediate value in assembly (unless you’re on a pre-80186, which can only do SHR AX,1 and SHR AX,CL but not SHR AX,4, but that’s outside of mksh’s scope) into a right-shift by a variable, but i is already known, and I think the behaviour is better (it wouldn’t eat any bits; assume the same 8-entry hashtable and pathologic keys 0, 8 and 16). Again: who do I think I am to go against the wisdom of the Python people, who seem to have shed more thought on this than everyone else I saw, asked, read about (including Spammipedia). That’s why I’m asking here. On that reference: I don’t support spammers or people nagging for donations or premium accounts, like Xing and Groundspeak/Geocaching.COM, at all. In fact, I urge others to do the same, so it really hurts them; it may be their business model, but not if they spam me. Besides, OpenCaching.DE exists.

Another thing is: to avoid CVE-2011-4815, I’m randomising the hash used, with one “seed” value per hashtable, changed before a resize operation. I originally thought to seed it with nonzero, but then I have to rehash on hashtable resize, so I’ll be XORing the final hash value instead (thanks ciruZ for the idea). I’m thinking of omitting that for indexed arrays, as an attacker almost certainly cannot determine the keys there. (To directly use the indexed array keys, which are already uint32_t, as hashes makes using i from ② even more important.) The hash I’m using is a modified Jenkins one-at-a-time called NZAAT: it’s my new generic standard nōn-cryptographic hash, and the changes are thus: while adding a byte, another increment of the hash is done (so NUL counts), and the finaliser got prefixed with the shift-left-add+shift-right-xor sequence of the adder (but not adding any value or the +1), to get best avalanche for all bytes. I actually compiled several versions of Hash.cs on a Windows® VM at work to analyse the original one-at-a-time and all of my modifications; these turned out to be the simplest ones (I originally had added 0x100 instead of 1, but the effect was the same, and +1 is usually cheaper on most CPUs).

Also, to avoid people being able to get to the seed, a user will always get only a sorted list of hashtable keys (numeric for indexed arrays, ASCIIbetically otherwise; see also my thoughts on JSON from the previous wlog entry). What algorithm do I use? For strings, comparisons are much more expensive, so I’d like to keep them low. Memory use is also a factor; allocating one large(r) block is better than many small ones due to the pool allocator overhead and due to portability to ancient Unicēs (which is another reason for me to use a hashtable which is a small struct plus an array of pointers, and then pass the list of keys as array of string pointers, instead of a tree). For both reasons, I’m thinking a relatively simple MergeSort: I need to allocate the result array anyway, so I can just get two and free the one that isn’t the end result, and it’s AFAICT the cheapest on comparisons other than Tree Sort (which nobody really seems to use, and which would effect to using a balanced binary tree again). Since keys are unique, stability and duplicate handling is never an issue. I’d like to use only one algorithm and one data structure, not a combination, as compactness is a design goal.

Please drop your thoughts on Freenode, e.g. by /msg MemoServ send mirabilos your text here or per eMail to the domains debian, freewrt or mirbsd, which are organisations, with the localpart tg. Or just contact me as usual, if you’re already acquainted. Or lookup 0xE99007E0. Thanks in advance! (Especially, Python Developers’ thoughts are welcome.)

The following proposal extends the JSON specification, with the idea of using JSON as an information interchange format, rather than just a way of writing certain ECMAscript values. They do not add anything but only restrict valid JSON content and encoders with some rationale.

First of, I’d like to remind everyone, including JSON’s author, that JSON is case-sensitive, except in the four hexdigits after a backslash-u sequence in a String.

Second, I’d like to remind everyone that JSON is not binary-safe. No way around that, it implements Unicode (actually, 16-bit UCS-2, and it doesn’t guarantee that UTF-16 surrogates are correctly paired) text. I also consider only UTF-{8,{16,32}{B,L}E} valid encodings for JSON. (No PDP endian, either. Sorry, guys.)

For my first proposal, I’d like to point out CVE-2011-4815 which was about overflowing hashtables. The obvious fix is to randomise the hash per hashtable; to ensure this doesn’t leak, we sort ASCIIbetically the keys of an Object in the encoder. (Using Unicode is good here – we can just sort the keys as UTF-8 strings by their uint8_t value or as Unicode (UCS-2 or even UCS-4 or UTF-16) strings by the codepoints.) JSON was never preserving the order of elements in an Object anyway so we make it standardised (we still accept any order, and, when parsing, in collision cases, the later value wins). This also helps diffs.

For my second proposal, I’d like to forbid \u0000, \uFFFE, \uFFFF in strings. The first because many implementations use C strings, and for an information interchange format this is better; it also has security implications to allow NUL in a string. The other two, but not unpaired UTF-16 surrogates (as ECMAscript uses UCS-2 and got UTF-16 only later) because they’re not valid Unicode; JSON was not binary-safe already so why bother. Among other benefits, this also helps implementations.

For my third proposal, I’d like to agree that implementations should impose a nesting depth limit that may be user-defined, and in the face of which, cyclic checking may be ignored by an encoder. I emit nesting depth overflows as literalnull; might also throw an error. Since I was asked, the common “standard” value is to restrict nesting depth is 32, unless the user specified one. (I also saw 8, but 32 WFM pretty well.) Most seem to use it even if it may seem low at first. Only specialised applications probably need more, and they can always pass a value.

For my forth proposal, backslash-escape U+007F‥U+009F always. It may upset humans, editors, databases, etc. (This paragraph is newly added, after some IRC discussion.)

All these do not permit anything that wasn’t accepted to be accepted afterwards. I’ve got a fifth proposal which changes acceptance rules – but only for a subset of parsers: formally JSON is defined in ECMA-262 as industry standard that, in contrast to RFC 4627, always allowed any Value as top-level element of a JSON text. I’d like to make it so, and ignore the RFC’s requirement for it to be an Object or Array. Even so, the first two characters (after the BOM, if any) of a JSON text always are in the non-NUL 7-bit ASCII range, allowing for encoding detection. (This is done by the NUL octet pattern in the first four octets.)

JSON has only taken off because it’s a tightly defined simple format that can be used “everywhere” and isn’t too awful for humans (escaping not needed for U+0020‥U+D7FF and U+E000‥U+FFFD after all, although I’d also take the C1 control characters out, see my forth proposal above). I’ve started to use a trailing comma in indexed and associative arrays in code I write at work, when the array values are one a line, to help version control systems to do their diffs, but refrain from asking for a JSON extension to permit that in order to not endanger compatibility any (no comment needed, it’s just not worth it), but I’d like my above proposals to be followed by implementators (and I’m one of them).

Some more discussion with Jonathan pointed out that JSON5 allows for trailing commata in Object and Array; IMHO the only feature of it that is not bad or outright harmful. I’ll probably keep from accepting them because, on their own, they’re not that useful, and I usually would run JSON texts, even configs, through a parser/encoder roundtrip to pretty-print them which would lose them anyway.

As for binary-safeness: probably best to just use base64 and let the outer layers worry about compression. The data is usually unrelated to the JSON-encoded structure, and even if it’s related to other data the base64 representation is usually similar (unless misaligned).

Update 02.12.2012 – Wrong I was about the first two characters: “"€"” is a valid JSON text. Still possible to peek at four octets and determine the encoding by ordering the tests; updated my notes.

I’ve been debugging a weird problem at work – after upgrading a complex system from lenny to wheezy, some https clients failed to connect: GNU wget and Debian’s version of lynx(1) which is linked against libgnutls26 fail. NSS applications continue to work, as does cURL; wget and lynx on MirBSD (linked with OpenSSL of course) work. Even Debian’s gnutls-cli tools from both gnutls26 and gnutls28 work. Huh. The error_log shows renegotiation problems, yet setting the new Apache 2 configuration option to “use insecure renegotiation” doesn’t help either. (The option is a total #FAIL: its only other value is “use secure TLSv1.x renegotiation”, but I don’t want/need SSL renegortiation at all, anyway.) Natureshadow told me this was a hot issue on Debianforum at the moment, yet, nobody had a clue or enough information to file a formal bugreport against (initially) apache2, as that’s what changed. I tracked it down on a new VM with no configuration otherwise, and here are my findings so others don’t run into it.

Tracking down the problem, this can be reduced to the following configuration (minimised, to show the problem) in /etc/apache2/sites-enabled/1one:

	<VirtualHost *:443>
		ServerName wiki-70.lan.tarent.de
		RedirectMatch permanent . https://evolvis-70.lan.tarent.de/
		SSLEngine on
		SSLCertificateFile /etc/ssl/W_lan_tarent_de.cer
		SSLCertificateKeyFile /etc/ssl/private/W_lan_tarent_de.key
		SSLCertificateChainFile /etc/ssl/godaddy.ca
	</VirtualHost>
 

Do not mind the actual content, this is a very stripped-down demo on a not-actually-set-up-yet box.

Same is valid for the companion configuration file /etc/apache2/sites-enabled/2two:

	NameVirtualHost *:443

	<VirtualHost *:443>
		ServerName evolvis-70.lan.tarent.de
		SSLEngine on
		# workaround for BEAST (CVE-2011-3389), short-term
		SSLCipherSuite RC4-SHA
		SSLCertificateFile /etc/ssl/W_lan_tarent_de.cer
		SSLCertificateKeyFile /etc/ssl/private/W_lan_tarent_de.key
		SSLCertificateChainFile /etc/ssl/godaddy.ca
		SSLProtocol TLSv1
	</VirtualHost>
 

Turns out the BEAST workaround was at fault here: the differing SSLCipherSuites between the vhosts (on the same Legacy IP / TCP Port tuple, as we use Wildcard SSL Certificates) made Apache 2 want to renegotiate, so either commenting it on 2two or, better, adding it to 1one helped. Interestingly enough, the SSLProtocol directive did not matter (in my tests).

So, keep SSL settings synchronised between vhosts. In fact, those were already from include files, but 2two was from the “Evolvis 5” generation, whereas we added to 1one an Include of the httpd.ssl1.inc file generated by the previous releases of EvolvisForge and had not switched those legacy vhosts to the new configuration, as everything worked on lenny.

This wlog entry brought to you by the system administrators of tarent solutions GmbH and the Evolvis Project, based on FusionForge.


Update 17.05.2013 – Absolutely do not use RC4-SHA for SSL/TLS (https)! It can leak over 200 initial plaintext bytes easily. (arc4random(3) is not affected from this, especially on MirBSD, nor arc4random(9).)

Originally posted by bubulle on Planet Debian, a shell prompt that displays the current git branch, in colour on some terminals, after the current working directory. The following snippet does similar things for mksh users, except it doesn’t redefine your prompt but amend it – just throw it at the bottom of your ~/.mkshrc before that last line beginning with a colon (copy from /etc/skel/.mkshrc if you haven’t done that yet):

	function parse_git_branch {
		git branch 2>/dev/null | sed -n '/^\* \(.*\)/s//(\1)/p'
	}

	function amend_prompt_with_git {
		local p q='$(parse_git_branch)' r

		if [[ $TERM = @(xterm-color|xterm|screen*) ]]; then
			if [[ ${PS1:1:1} = $'\r' ]]; then
				p=${PS1:0:1}
			else
				p=$'\001'
				PS1=$p$'\r'$PS1
			fi
			q=$p$'\e[1;33m'$p$q$p$'\e[0m'$p
		fi

		p=${PS1%%*( )[#$]*( )}
		if [[ $p != "$PS1" ]]; then
			# prompt ends with space + #-or-$ + space, we can amend
			r=${PS1: ${#p}}
			PS1=$p$q$r
		fi
	}
	amend_prompt_with_git
	unset -f amend_prompt_with_git
 

The indirection by use of a function is not strictly necessary but allows the use of locals. I took the liberty of adding an asterisk after “screen” to match the GNU/Linux nonsense of having TERM=screen.xterm or somesuch.

KiBi is my hero of the day. I’ve long wondered why I couldn’t select fixed-misc as font on my workstation at the dayjob, which is running K?buntu Hardon Heroin. (Luckily, I managed to avoid upgrading to Prolonged Pain.) Now I guess that’ll work again.

My work laptop (running testing) also has got this X.org thingy. My keyboard layout now has got a grml branch (named after the person who first cursed about the insane idea of those toy-breaking boys to rearrange the keycodes) that works with it. Since Debian is marginally more sane than K?buntu, in contrast to the gnu branch I use on my orkstation, the grml branch still has Meta on the left Alt key, not Mode_switch, as it still works in uxterm, which reduces the diff between the MAIN branch (HEAD) on XFree86® and this beast.

And finally: X.org defaults to a black screen and disabled mouse pointer until an application first requests it. Totally unacceptable for evilwm(1) users, and letting people think it crashed, to boot. The Arch Linux guys found this, among others; the fix is: startx(1) users edit /etc/X11/xinit/xserverrc to add -retro behind the X, or copy the file to ~/.xserverrc and change it there:

	#!/bin/sh

	exec /usr/bin/X -retro -nolisten tcp "$@"
 

For display managers, similar files exist in /etc/kde4/kdm and related places.

Update: Also, newer xterm(1) justify an update to ~/.Xresources for we can finally get rid of cut buffers, and get a blinking underline cursor to boot!

On the other front, worked on Debian packaging, and upstream on pax(1) and jupp, with more things to follow (especially in mksh). Also fixed about ⅔ Linux klibc architectures and learned why I’m a BSD developer despite all the bad parts of it ☺ and fixed fakeroot with pax(1) on Hurd… incidentally in code originally designed to support the Linux pax. My dayjob’s keeping me busy, but I’ve got plans to run mksh(1) through Sonar, in addition to the static code analysēs done by (once again, thanks!) Coverity (commits to mksh pending) and Clang/LLVM scan-build. Uhm, what can I say more, grab me in IRC if you need it. Ah, and some other mksh things coming up that may be of interest to people needing to support legacy scripts.

While wtf(1) always has been a bit central to MirBSD, and the acronym database has been accessible by CVSweb, what we never had was a DAU compatible (and shellsnippets compatible) lookup. This has now changed: the above link to the acronyms file is a persistent link to its latest version (well, latest when the website was last recompiled), tooltips may very well follow soon, and we’ve got an online WTF lookup service.
Contributions to the acronym database are welcome, of course; just eMail them to tg@mirbsd.org.

Not to stop there, our online HTML manpage search is also new, shiny, and should replace the “!mbsdman” DuckDuckGo hash-bang shortly. (Both of these services offer a DDG search as fallback. Note that DDG is an external service included herein by linking, under their request to spread it, and not affiliated with The MirOS Project. They do, however, donate some advertising money to Debian.)
For all those who didn’t know: only manpages for software in the MirOS BSD base system and for the MirPorts Framework package tools are listed, not for third-party applications installable using ports or, recently, pkgsrc®. Still, if you want to have a peek at a modern classic BSD’s documentation, you’re welcome. (Not to mention content like re_format(7) and style(9) and that some of our documentation is much more legible than others.)

And because writing all that perl(1) made me ill, not to mention I don’t even know that language, I’ve hacked a bit more in the mirmake(1) and mksh(1) parts of the MirWebsite, finally implementing pointing out where in the navigation sidebar the visitor currently is.

We also have exciting mksh porting news involving RT trying a larger number of ancient platforms than I dare count, me fixing bugs in Linux klibc and diving into other things, learning more about why I consider me lucky for hacking a BSD operating system… sorry, I want to keep this short as it’s mostly an announcement.

The MirWebsite source code is, of course, also available. Improvements welcome. Except for these three CGIs, our website is fully statically precompiled, and that’s a good thing. Please help in making the CGIs secure.

On MirBSD and other sane OSes, you can just press ^T (Ctrl-T) when dd(1) runs; this sends it a SIGINFO (cf. sigaction(2)) which asks it to display (progress) information to the tty. This includes kFreeBSD, btw.

Update 07.01.2012 – this also works on Hurd. Linux neither has SIGINFO nor (cooked mode tty) support for it.

There’s also pv:

	dd if=/dev/mapper/vg01-${customername}--hudson bs=1048576 | \
	    pv -pter -B 1048576 -s 85899345920 | \
	    xz -0 >/mnt/ci-${customername}-snap-20120105-lenny.img.xz
 

I used this At wOrk today to back up a Jenkins VM before upgrading its underlying operating system for evaluation. Here, the -s flag is the total size (in bytes; don’t forget to multiply by 1024 when reading from Linux’ /proc/partitions) so pv can calculate a total and an ETA; -B is the same as bs; and xz is the currently best compressor to use, in any situation, unless you must stay compatible to gzip(1)-only systems. (Except that it’s not under an Open Source licence.)

clpbar might also be worth looking into. XTaran points out sid has this as bar.

PSA: Last of June, 2012, will be a leap second.

This is both a release announcement for the next installment of The MirBSD Korn Shell, mksh R40b, and a follow-up to Sune’s article about small tools of various degrees of usefulness.

I hope I don’t need to say too much about the first part; mksh(1) is packaged in a gazillion of operating environments (dear Planet readers, that of course includes Debian, which occasionally gets a development snapshot; I’ll wait uploading R40c until that two month fixed gcc bug will finally find its way into the packages for armel and armhf). Ah, we’re getting Arch Linux (after years) to include mksh now. (Probably because they couldn’t stand the teasing that Arch Hurd included it one day after having been told about its existence, wondering why it built without needing patches on Hurd…) MSYS is a supposedly supported target now, people are working on WinAPI and DJGPP in their spare time, and Cygwin and Debian packagers have deprecated pdksh in favour of mksh (thanks!). So, everything looking well on that front.

I’ve started a collection of shell snippets some time ago, where most of “those small things” of mine ends up. Even stuff I write at work – we’re an Open Source company and can generally publish under (currently) AGPLv3 or (if extending existing code) that code’s licence. I chose git as SCM in that FusionForge instance so that people would hopefully use it and contribute to it without fear, as it’s hosted on my current money source’s servers. (Can just clone it.) Feel free to register and ask for membership, to extend it (only if your shell-fu is up to the task, KNOPPIX-style scripts would be a bad style(9) example as the primary goal of the project is to give good examples to people who learn shell coding by looking at other peoples’ code).

Maybe you like my editor, too? At OpenRheinRuhr, the Atari people sure liked it as it uses WordStar® like key combinations, standardised across a lot of platforms and vendors (DR DOS Editor, Turbo Pascal, Borland C++ for Windows, …)

ObPromise: a posting to raise the level of ferrophility on the Planet aggregators this wlog reaches (got pix)

benz’ wedding, fun before

24.10.2011 by tg@
Tags: debian event fun

My dear MirBSD co-developer Benny did not only get his Doctor title but also recently married. There will be another post detailing this, including better photos of the two Doctors and the cake (with a Dæmon she made herself) on the wlog, but this is some fun beforehand:

No GPL cars!

Apparently, it is forbidden in France to drive GPL cars. (Without safety valve – but you have to admit the picture was fun. And we were like WTF? since the thing actually meant is LPG in German. Just like UTC is CUT (Coordinated Universal Time) in English, TUC (Temps Universel Coordonné) in French…)

I’m also working on improving our xterm(1) and GNU screen config, and other things. Explaining acronyms on our webpages is also coming some time. Benny is importing weird stuff from TNF for better pkgsrc® support, so there is activity. Just we’ve got dayjobs and a life… and mksh(1) still rocks (pdksh got orphaned in Debian today).

eMail

06.10.2011 by tg@
Tags: debian pcli rant tip

Would MTAs please stop sending hi-bit7 messages to other MTAs which do not advertise 8BITMIME! Recode it to QP or BASE64, damnit! The receiving MTA is entitled to strip the set bit7, which kinda makes things hard to read (while I know how to deal with blvde Stra_e, the advent of UTF-8 makes that blC6de StraC?e, introduces C0 control characters and makes typographic quotation marks into NUL-containing octet sequences (as their UTF-8 representation contains 0x80 octets) which let every sensible MDA terminate the line there). I even filed in the Debian BTS against the BTS (might be Drexim's fault, though).

Would MUAs please default to Quoted-Printable!

And mail hosters should use the same server when retrying delivery, to benefit greylisting. Or at least publish a list of outgoing IPv4 addresses they use for sending. Or use IPv6. Oh, and STARTTLS, while we are on my wishlist.

It's a sad day when the percentage of correctly encoded eMail messages in my INBOX is smaller than that of my Spambox...

Improvements welcome

30.09.2011 by tg@
Tags: debian tip

No I don’t really know any SQL. In fact, even at vocational school, where we focussed on database normalisation anyway, I tried hard to avoid the topic. Feel free to access here my entire knowledge about SQL ☺ (I did use Amaya, Arena and Arachne though. Liked only Arachne out of these three, and then, only under DOS, not its Unix version. Maybe the WWW could be named AAA instead? But then, lynx(1) is the one true browser…)

Ah, well. While at it… the entirety of my Perl knowledge is here: perltoc(1) with quick links to perlfunc(1).

The entirety of my (X)HTML and ECMAscript knowledge, DE: SELFHTML; although, the spec and DTD helped; and to write my notes on JSON, I took a peek at the formal ECMAscript spec as well… à propos, does anyone know a (good enough) indent(1) equivalent for ECMAscript, as I am trying to strip down some, inherited (GPL, yes) code for a hobby project, but Geo-people seem to produce illegible code?

Our MirBSD online manual pages and other assorted BSD documentation (except of course the merely copied ncurses, lynx etc. documentation and the texinfo generated HTML pages) has just gained a major facelift. They look alike in lynx(1) – best web browser ever – and less(1)/man(1) now, and remind of a DEC VT420 on a CSS capable Buntbrause.

Thanks to our contributor XTaran for aid with the colour scheme!

Since these are generated from catmanpages, heuristics are used for things like where should bold/underline begin/end (since nroff(1) is not always the brightest… but working on that), and hyperlinks can only be generated for other manpage references (whose targets may or may not exist, for example if they aren’t part of MirOS base/XFree86®). But on the other hand, Valid XHTML/1.1 and CSS speaks for itself ☻☺

All 1 2 3 4 5 6 7 8 9

MirOS Logo