Close

SQL for the fun of it: finding typing patterns in words (Part 1 of 3)

Part 2
Part 3

Yesterday, a coworker asked me a question just out of curiosity: “How would you determine which words can be typed with alternating hands?” He then added an additional criteria that he’d like to find the words that ended on the left hand (presumably, so you could press space or enter with the right.)

Obviously, any word could be typed with alternating hands; but in the spirit of the question, we’ll assume a standard U.S. QWERTY keyboard layout and normal typing technique with left hand resting position on “a-s-d-f” and right hand resting position on “j-k-l-;”. Thus the left-hand letters would be q,w,e,r,t,a,s,d,f,g,z,x,c,v,b and the right-hand letters would then be y,u,i,o,p,h,j,k,l,n,m. Also, the word list to be checked would have no nulls, thus there will always at least one key stroke needed.

He then proceeded to describe how he thought about it: Divide each word into odd and even letters. That is the letters typed on the 1st, 3rd, 5th, etc key stroke are the odd letters, and the 2nd, 4th, 6th, etc. are the even letters. Then check if all odd letters are left and all even letters are right, or vice versa.

I thought that was an interesting question and fairly straight forward approach so I spent a few minutes and came up with this to, more or less, implement exactly what he described.

SELECT word
  FROM (SELECT word,
               odds MULTISET EXCEPT DISTINCT  r x,
               evens MULTISET EXCEPT DISTINCT l y,
               evens MULTISET EXCEPT DISTINCT r w,
               odds MULTISET EXCEPT DISTINCT l z
          FROM (  SELECT word,
                         CAST(
                             COLLECT(
                                 CASE
                                     WHEN MOD(COLUMN_VALUE, 2) = 0 THEN SUBSTR(word, COLUMN_VALUE, 1)
                                 END
                             ) AS vctab
                         ) evens,
                         CAST(
                             COLLECT(
                                 CASE
                                     WHEN MOD(COLUMN_VALUE, 2) = 1 THEN SUBSTR(word, COLUMN_VALUE, 1)
                                 END
                             ) AS vctab
                         ) odds,
                        vctab('y','u','i','o','p','h','j','k','l','n','m') r,
                        vctab('q','w','e','r','t','a','s','d','f','g','z','x','c','v','b') l
                    FROM words,
                         TABLE(
                                 SELECT COLLECT(LEVEL)
                                   FROM DUAL
                             CONNECT BY LEVEL <= LENGTH(word)
                         )
                GROUP BY word)             
         WHERE SUBSTR(word, -1, 1) MEMBER OF l)
 WHERE (x IS EMPTY AND y IS EMPTY) OR (w IS EMPTY AND z IS EMPTY);

“vctab” is a simple collection type I already had defined.

CREATE OR REPLACE TYPE vctab AS TABLE OF VARCHAR2(4000);

The inner query would create the collections of keys needed for each word and each hand.

WORD                EVENS                       ODDS                     R                               L
------------------- --------------------------- ------------------------ ------------------------------  ------------------------------------
ahent               VCTAB(h,n)                  VCTAB(a,t,e)             VCTAB(y,u,i,o,p,h,j,k,l,n,m)    VCTAB(q,w,e,r,t,a,s,d,f,g,z,x,c,v,b)               
aid                 VCTAB(i)                    VCTAB(a,d)               VCTAB(y,u,i,o,p,h,j,k,l,n,m)    VCTAB(q,w,e,r,t,a,s,d,f,g,z,x,c,v,b)               
archaeographical    VCTAB(r,l,c,h,a,g,e,h)      VCTAB(a,a,i,p,r,o,a,c)   VCTAB(y,u,i,o,p,h,j,k,l,n,m)    VCTAB(q,w,e,r,t,a,s,d,f,g,z,x,c,v,b)               
argyrodites         VCTAB(r,e,i,o,y)            VCTAB(a,s,t,d,r,g)       VCTAB(y,u,i,o,p,h,j,k,l,n,m)    VCTAB(q,w,e,r,t,a,s,d,f,g,z,x,c,v,b)               
baklava             VCTAB(a,v,l)                VCTAB(b,a,a,k)           VCTAB(y,u,i,o,p,h,j,k,l,n,m)    VCTAB(q,w,e,r,t,a,s,d,f,g,z,x,c,v,b)               
bib                 VCTAB(i)                    VCTAB(b,b)               VCTAB(y,u,i,o,p,h,j,k,l,n,m)    VCTAB(q,w,e,r,t,a,s,d,f,g,z,x,c,v,b)               
boeuf               VCTAB(o,u)                  VCTAB(b,f,e)             VCTAB(y,u,i,o,p,h,j,k,l,n,m)    VCTAB(q,w,e,r,t,a,s,d,f,g,z,x,c,v,b)               
bullwork            VCTAB(u,k,o,l)              VCTAB(b,r,w,l)           VCTAB(y,u,i,o,p,h,j,k,l,n,m)    VCTAB(q,w,e,r,t,a,s,d,f,g,z,x,c,v,b)               
cacomorphia         VCTAB(a,i,p,o,o)            VCTAB(c,a,h,r,m,c)       VCTAB(y,u,i,o,p,h,j,k,l,n,m)    VCTAB(q,w,e,r,t,a,s,d,f,g,z,x,c,v,b)               
caitives            VCTAB(a,s,v,t)              VCTAB(c,e,i,i)           VCTAB(y,u,i,o,p,h,j,k,l,n,m)    VCTAB(q,w,e,r,t,a,s,d,f,g,z,x,c,v,b)    

Then it was simply a matter of applying the rules. Checking if a word should end with the left hand is simple enough: SUBSTR(word, -1, 1) MEMBER OF l.

To check if all even/odd are left/right or right/left is a little tricker; but not too bad.

  • odds MULTISET EXCEPT DISTINCT r – this will return any odd letter not typed with the right hand
  • evens MULTISET EXCEPT DISTINCT l – this will return any even letter not typed with the left hand
  • evens MULTISET EXCEPT DISTINCT r – this will return any even letter not typed with the right hand
  • odds MULTISET EXCEPT DISTINCT l – this will return any odd letter not typed with the left hand

If a word has a proper alternating hand typing pattern the pairs of odd/even must both be empty for either left/right or right/left (but not both.)

That worked, but it seemed cumbersome, so I wanted to rework it. I thought MATCH_RECOGNIZE might be a good option since this question is essentially just pattern matching.
That thought line took me to this:

SELECT *
  FROM (SELECT word, COLUMN_VALUE n, SUBSTR(word, COLUMN_VALUE, 1) c
          FROM words,
               TABLE(
                       SELECT COLLECT(LEVEL)
                         FROM DUAL
                   CONNECT BY LEVEL <= LENGTH(word)
               )
       )
 MATCH_RECOGNIZE(
     PARTITION BY word
     ORDER BY n
     PATTERN (^l? (r l)* $)
     DEFINE
         l AS (c MEMBER OF vctab('q','w','e','r','t','a','s','d','f','g','z','x','c','v','b')),
         r AS (c MEMBER OF vctab('y','u','i','o','p','h','j','k','l','n','m'))
 );

I liked this one better. For small sample sets of words it won’t make much difference but when I ran it on a dictionary of a few hundred thousand words this was much more efficient than my first attempt. The basic idea is to split each word into letters then check if the letters follow a repeating right-left pattern. I chose right-left instead of left-right, to ensure I ended with the left hand. To handle words that start with the left hand I make the first letter an optional left (l?).

I’m happy with that variant, but it does require 12c. So next I figured I would try using some analytic functions usable in older releases. Checking alternating hands would just be a matter of looking to see if the previous letter was not typed with the same hand as the current letter’s hand. I also decided to remove the correlated COLLECT subquery and instead do a CONNECT BY directly on the words table. Neither method is significantly better than the other but as long as I was exploring different syntax, I might as well try different driving queries too.

  SELECT word
    FROM (SELECT word,
                 n,
                 hand,
                 CASE WHEN n = 1 OR (hand != LAG(hand) OVER(PARTITION BY word ORDER BY n)) THEN 'yes' ELSE 'no' END
                     alternating,
                 LAST_VALUE(hand) OVER(PARTITION BY word ORDER BY n ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
                     last_hand
            FROM (    SELECT word,
                             LEVEL
                                 n,
                             CASE
                                 WHEN SUBSTR(word, LEVEL, 1) MEMBER OF
                                           vctab('y','u','i','o','p','h','j','k','l','n','m')
                                 THEN
                                     'right'
                                 ELSE
                                     'left'
                             END
                                 hand
                        FROM words
                  CONNECT BY word = PRIOR word AND LEVEL <= LENGTH(word) AND PRIOR SYS_GUID() IS NOT NULL))
   WHERE last_hand = 'left'
GROUP BY word
HAVING MIN(alternating) = 'yes'

With this approach I categorize each letter as being either right or left and check if each letter alternates hands from the previous letter. Then grouping by each word, if all letters are flagged as alternating from the previous letter I know my word alternates.

I also tried variants of the connect by in earlier queries and also using the correlated COLLECT subquery with the analytics. I didn’t find a significant difference in them. I’m sure there is a way to do this with the MODEL clause; but I have to admit I don’t use that much and thus will save that for another time when I get a little more practice with it.

This was a fun exercise, exploring several interesting and powerful features of Oracle’s SQL syntax, even if the goal was a bit silly.