Timothy Lethbridge's ideas on Technology and Politics: November 2011

Thursday, November 24, 2011

US companies dominate in patent power, and software dominates in patent category

IEEE Spectrum has published its annual statistics on patents. They show the dominance of US companies, such as IBM and Microsoft, and the rapid growth of Apple. The overview article is here.

The ranking approach uses what they call 'patent power' and takes into account the growth, impact, generality and originality of patent portfolios using various methods. The 2006 article describing the methodology is here

Here are a few observations:

Top companies in terms of patent power in 2010: The companies with the most powerful patent portfolios, in all industries, are the following. These are those with patent power over 2500. Companies were listed in just one category, even thought they may have patents in several categories. In these numbers I have combined the numbers where one company has bought another (e.g. Oracle having bought Sun)

IBM (8402 in Computer Systems)
Microsoft (7146 in Computer Software; this does not include the Nortel patents they bought in a consortium led by Apple; note also that if Microsoft buys Yahoo, as news articles suggest they are considering, the combined company would top the list at 9910)
Johnson and Johnson (6796 overall; 3610 in Biotechnology and Pharmaceuticals; plus 2810 through Ethicon Endo-Surgery and 376 through Depuy Spine)
Medtronic (5540 in Medical Equipment)
Covidien PLC (4544 overall, the top non-US company; 3333 in Medical Equipment; plus another 718 through Nelicor Puritan Bennett and 493 through Mallinckrodt)
Oracle (4171 overall; 3129 in Computer Software, plus 558 through BEA, 362 though Siebel, and 122 through Sun)
Samsung (4033 in Semiconductors; second-to-top non-US company)
Cisco (3299 in Communications/Internet Equipment
Qualcomm (3170 in Communications Equipment)
Yahoo (2789 in Communications/Internet Services)
Apple (2764 in Electronics; this does not include the Nortel patents they bought with Microsoft and others - with their share of these, they would likely be in 10th place)
Hitachi (2669 overall; 2531 in Electronics; plus 138 through Hitachi Global Storage)

Software-related patents: I am particularly interested in software patents, since I in general think they are counterproductive although it is a necessary evil to keep patenting in this domain until changes I have suggested are made. The rankings show IBM dominating in the Computer Systems category; Microsoft and Oracle dominating the Computer Software category, and Samsung, Yahoo and Apple with high presences in other categories that also include software patents. Clearly software patenting is of dominant importance in the patent world, which is unfortunate. It also seems to be on the rapid increase, which bodes badly for ordinary software developers.

Categories with relatively low power patent presence: There are no companies with patent portfolios above 2500 in domains such as aerospace, automotive and chemicals. It is notable that General Motors (1030) is still doing well despite its troubles in the last recession; it is only just behind Toyota (1272) in the automotive category.

Canadian companies: Top Canadian companies are RIM (1064, Communications/Internet Services) and Magna (553, Automotive).

Google: I thought it notable that Google (2165, Communication/Internet Services) was ranked well behind IBM, Samsung, Microsoft, Oracle, Yahoo and Apple, companies they compete with or have intimate relationships with (e.g. Samsung is the top seller of Android systems). Google seems to be losing its reputation as an innovator. See yesterday's post for related comments. However, this year, Google has bought some IBM patents so if the list was recreated with 2011 data it might look a bit different.

Absolute numbers: In absolute numbers of patents (not 'patent power'), IBM also dominates with 5905, but Samsung comes second with 4599. Other companies are far, far behind.

Wednesday, November 23, 2011

In fear of the great Google Shutdown train: What could be next?

Google has been in a frenzy of 'cleanup' activity over the last few months. It is beginning to make me feel positively uncomfortable.

Google's latest blog post announces the shutdown of a number of services that are not doing well, notably Google Wave and Knol.

I think it is the latter that alarms me the most. They point out that the proposed replacement for Knol is WordPress based. Yet they have their own Blogger service (which you are using right now to read this) that competes against WordPress. Does that mean that the writing will be on the wall some time in the future for Blogger? It certainly makes me think about moving this blog elsewhere.

In addition to Blogger, I am heavily invested in Google Code for the Umple project. Will Google Code be on the chopping block too at some point? After all, like Blogger it is a service that Google provides on a largely public-service basis. Google is already shutting down 'Code search' with no announced replacement, and has severely limited the usability of Google Groups, as I have previously commented.

I think that if I had read all these announcements a year ago when I was first starting blogging and open-source development, I would likely have chosen different platforms. I imagine many people just setting out to use such online services will think the same thing. I chose Google because it is a large company with a reputation for stability, innovation and beneficence. Two of my factors in choosing Blogger and Google Code over WordPress and GitHub were the integration with Google accounts and Google search. However long-term stability trumps all. Using Google services puts me at the mercy of Google Shutdown.

I will persist with Google for now, hoping their 'Do no evil' mantra wins out in the end. However, I am continuing to take steps to protect myself. These include making all links to the Umple project go through the umple.org domain, so I can relocate them if I had to, and regularly backing up my subversion repository and blog. I will be looking into mirroring my blog on a domain and server I have full control over.

I also hereby ask Google:

To make formal 10-year service guarantees for its services in which people invest huge amounts of personal time, like Google Code, Blogger and Google Groups.
To extend these guarantees each year, so any potential shutdown would always be 10 years out
To publish usage statistics trends of its services so we can be confident that we are using services that are not dwindling in numbers of users.
To keep data in shut-down services visible in a read-only manner without limit (e.g. keeping Knol URLs active for reading indefinitely, so links that point to them never go stale). There would be very little cost in doing this.

I think that unless Google does this, more and more people will migrate away, and fewer and fewer new users will adopt these services, resulting in a self-fulfilling prophesy of shutdown due to low usage.

And lest people think I am being unfair to Google, or risking having Google push me personally off its services for criticizing it, I have been just as critical of Apple, and have not even ventured near Microsoft's online services since they have historically been too closed. Google is still an excellent online service provider for blogging, open-source code hosting, and mailing lists, and is still probably a lower risk than smaller companies regarding potential shutdown. But overall, I now do not consider it a low risk in this regard.

Also in defense of Google, they do make their announcements in a reasonably friendly way and try to suggest alternatives. I do wish in their announcements they would tell us the number of users they estimate to be affected (both content providers and readers).

Wednesday, November 16, 2011

Give downloadable files more meaningful names

More and more websites give the option to download a files containing such things as bills, software installers (or source code), receipts, posters, brochures, e-books and reports. Once downloaded, people need to file these for extended periods.

Unfortunately it is common for the downloaded files to have meaningless names such as 'Document', 'Results', 'File', 'Report', 'Data', some arbitrary internal identifier, or other names that are not as helpful as they should be.

I propose the following general guidelines for downloadable files. Any downloaded file name should include:

Some way to identify the file's origin, such as the company name or website
Some succinct way to identify the type of content, such as 'bill', 'receipt', the software product etc.
The date. Files stored on computers have a date created by the operating system. But this is not part of the filename and is reset when you copy or edit a file. When the date is intrinsic to the data in a file, such as the date of issue of a report or receipt, the month of a bill, the release date of software or a press release, etc. then this information should be part of the filename on a permanent basis. Very few downloaded filenames have the date. The date should be in yyyy-mm-dd format to facilitate sorting by date. The date should be the date the date being sent was created, not the date of download. If it is a monthly report, always issued at the end of the month, then the day can be omitted.
If appropriate, a version number. Many software downloads have this, but too many do not.
If necessary, some way to ensure downloads of the same type but with different content are distinct. For example, if you are an avid sports fan and regularly download spreadsheets with sports statistics, you might want the time to which the data is valid, not just the date, as part of the filename.

For example:

SurveyMonkey's downloads appear as 'Results.zip'. When you unzip the file, the internal contents are a little nicer, for example 'SurveySummary_' followed by the date at which data was last updated. However, there is no way to distinguish among the various different surveys I am running, nor the filters or collectors I have applied when generating the data.
A recent download of Silverlight just said 'Silverlight.dmg'. There was no version number or release date.
The paystubs I receive from my employer say 'paystub' followed by a serial number, but do not identify the date of the pay in the filename.
A report I downloaded from the CRTC yesterday (subject of a future blog post probably) at URL http://www.crtc.gc.ca/eng/archive/2011/2011-703.pdf resulted in a filename that just has the year and a serial number, and no identification of its source (CRTC) or content.
When I download my Rogers bill from EPost it says 'RogersBill', but does not include the month of the bill. RogersBill-2011-11-15 would be better. Adding the account number on the end, or perhaps the account-holder's name, might also be useful to account for situations where people have to manage the bills for several accounts. I would even go as far as to tack on the bottom-line total (dollars spent or owning) to bills, receipts and other forms of statements. When later reviewing a long list of receipts, for example, this can prove extremely useful.
When I download an investment report from my investment company, it just says 'ClientReport.pdf'. The company, date and content description are missing. Here, the name of the investor would also be useful, to handle situations where one is downloading reports for different family members.

And on and on. In all these cases, people are forced to edit the filename after download. I can't count how many times I have found a file in my downloads folder and had to open it to remember its contents before editing its filename and then putting it in the correct place in my disk.

This aspect of usability seems to be overlooked by a very high percentage of software engineers and web designers. All the designers seem to think about is the content of the files, and perhaps the URL and the location of the file on the server. They forget that the file will have a life of its own and need to be identifiable once it leaves the server and resides on a client's computer.

One can, of course, go to far. The filename must be kept to a manageable length; at one time the absolute maximum was 8 characters in DOS (not including the extension), then 32 characters, and now typically about 255 bytes (which means fewer than 255 characters if non-ASCII Unicode characters are used). A human-usable limit is, I think, about 60 characters. There is a useful Wikipedia page describing the absolute limits and the characters that can be used. The following are some fictitious examples of meaningful download filenames that are kept to a reasonable length, adhere to the guidelines above, and are hopefully self-explanatory.

TeledirectCableBill-2011-11--66.13--acct9876543.pdf
Supersoft-FuriousCowsInstaller-2011-11-16-v1.2.3.dmg
GovNY-TaxLaw-ProposedChgs-2011-11-16-rpt87924.pdf

I recommend not using spaces in names to make it easier for certain programs, and shell scripts to process files. CamelCase and hyphens are useful to separate elements.

Related situations

If the ability to download a pdf file is not available, it is common for people to save bills, receipts, and similar documents using various 'print to pdf' capabilities, such as those built into the MacOS X print function. In these situations, the created pdf file will have a name derived from the html page's title. All rules I described above should therefore also be applied to the html title for pages that are commonly printed (bills, receipts, etc). I would go so far as to use a filename convention without spaces in the title of such html pages, specifically to facilitate one-step printing to pdf.

When emailing files to people, many senders just grab a file from their filesystem and add it as an attachment. People should get in the habit of giving meaningful names to files when they store them in their filesystem, and not just rely on identifying the file by way of its location in the filesystem; that helps prevent sending attachments with meaningless names. All too often I have received and had to save attachments with names such as report.docx.

Even when sending JPEG or other media files, the naming convention comes into play. I love the Olympus style of naming jpegs that embeds the date of the picture and a serial number right in the filename, e.g. PB171234.JPG for a picture stored in the 11th month of the year (B), on day 17, and with serial-number 1234. Many other cameras use filename formats that are not nearly so useful, like IMG_9999.JPG (see this Wikipedia page for naming convention details). In fact, the file naming convention alone is enough to tilt me towards a certain brand of cameras. When sending or uploading images, people should also consider adding description of the contents to their filenames. All this metadata can be stored using EXIF attributes, but most people aren't sophisticated enough to browse using tool that is EXIF-savvy.

I note with interest that even my own projects are sometimes guilty of inadequate names for downloads. For example in UmpleOnline, you can download a directory of generated code. The download has a name like 'JavaFromUmple.zip'. The source is identified (Umple); the content is identified (Java), but neither the name of the model nor the date and time are currently identified. Both of the latter would be useful. We will work on this problem.

Thursday, November 10, 2011

Stop wasting researchers' time: Drastically simplify grant selection

Today, someone sent me this excellent blog post by UBC mathematics professor N. Ghoussoub discussing the immense waste of valuable time professors spend on applying for grants with a low acceptance rate.

I have made the conscious decision to not even bother applying to grants with low acceptance rates. Even though I think my research is quite respectable, and I have over the years received a decent number of grants, simple economics tells me that it would be better to invest my time in doing actual research.

What is the solution? Clearly professors need grants to hire graduate student, buy equipment and travel to conferences. Junior professors also need prestigious grants to boost their career. However, application processes for all grants should be streamlined.

There are excellent moves in this direction underway. The Canadian Common CV system will soon be in use by most granting agencies. Hopefully that will eliminate the need for researchers to spend any time in grant applications talking about their past research. Grant applications, at least for established researchers, should be primarily based on the CV (papers published, graduate students trained,. etc.).

Let's take this three steps further:

There should be a 'common research proposal system'. A researcher would write a set of description in a standard format for pieces of research they want to accomplish -- a maximum of 2-3 pages each, with a maximum of one page devoted to literature review, and few or no budget details. The proposals would be open to public scrutiny and comment. A proposal could be as focused as presenting a specific experiment the researcher wishes to conduct, or could outline a general set of research objectives, with their rationale. Proposals could be enhanced or removed as time goes by. The professor might actually finish some units of work, or might improve their ideas, for example.
The common-CV system should be enhanced to automatically tag each publication with data regarding citations. It should also compute indexes like the H-Index (excluding self-citations), G-Index, and variants of these that weight more recent work more highly.
Have a common simplified grant-application system. A researcher would select the grants they want to apply for, indicate which of their research proposals they would like to work on with each applied-for grant, and the amount of time they would want to spend on each task. Except for expensive equipment that requires quotes, the budget would simply indicate the number of masters, PhD and postdoctoral students, plus technical and administrative assistants that would be needed. The system would compute the budget based on standard salary rates plus an allowance for conference travel and basic equipment and supplies per researcher (the amount for these would be standardized for each field). This point about budget is important: In current application processes, professors have to write detailed budgets, but then almost never get the amount of funds matching the budget; the current process just forces professors to write essentially-fictitious budgets.

The above would drastically shorten the time a professor wastes applying for grants they have a low probability of receiving. Professors would essentially make 'standing offers' or 'standing requests' to do certain research.

Much of the computation of criteria for grant selection should be automated: Selection committees would give high weight in their decisions to data such as the citation indexes mentioned above, and versions of the citation indexes computed for for publications relevant to the research field of the grant. They would factor in trends regarding graduate student training, and the professor's available time (a professor with many other grants would have less available time).

For established researchers, the peer review process would be limited to two things: 1) Brief comments on the extent to which the researcher is continuing on an established line of research, and 2) suggestions regarding how the research could be improved. If a professor is branching out in a new direction, or is a new researcher, then more detailed comments would be needed. But for researchers that are continuing in the same broad line of research, the track record should largely speak for itself. The suggestions for improvements should primarily be to help guide the researcher.

With the above systems in place, the overhead of applying for and peer-reviewing grants could be drastically reduced.

Some would argue that citation indexes can be faulty. So can peer reviews. There would clearly be different biases in the above system, but I think there would be no major sources of unfairness introduced. The system would simply render the research process a whole lot more productive.

Ghoussoub's comments about waste of research resources also can be applied to conferences and journals with low acceptance rates. For those, however, the solution is completely different: Simply increase acceptance rates so they are generally at least in the 40% range. Far too many decent papers are rejected in my field because of arbitrarily-imposed low acceptance rates.