Importing Text or PDF Files

dBase, Calc, CSV, MS ACCESS, MySQL, PostgrSQL, OTHER

Importing Text or PDF Files

Postby workinstiff » Wed Jul 11, 2012 4:44 pm

I am a novice OOo Base user and I would like to know if there is any way to upload a text document or PDF into an existing database. The documents eventually need to be keyword searchable through the database, so I can't upload it as a simple Jpeg. I am not familiar with scripting in HSQL or MySQL, so I was just hoping there was a function in the database service itself for uploading documents. Any tips?
Electric Boogaloo! Open office 3.4.0
workinstiff
 
Posts: 3
Joined: Wed Jul 11, 2012 3:48 pm

Re: Importing Text or PDF Files

Postby rudolfo » Thu Jul 12, 2012 12:02 am

This depends on what you understand when you say a text document?
XML documents (including unzipped OOo Writer files, typically content.xml or unzipped .docx files) are plain text files and can be loaded into a string or text column of a database and that column can have a full text index (and you are done). I have only few experience with MySQL and full text indexes, but I remember that there is an absolute limit for text columns (might be 4 KByte or 1 MByte, google for it if you need to know it exactly).
If you are beyond this limit you will have to use a BLOB (binary large object) column to store your data. The same is true if you use a non plain/text format like .doc or .pdf. BLOB columns can't have a full text index, in none of the databases I am aware of (Oracle, MySQL, HDBSQL, SQLite, PostgreSQL, MSSQL).
Note, that storing XML in a string column is not the way it is meant to be. Some database have special extensions to store XML objects/nodes in database fields. Because if you store it in a plain string column and create a full-text index for it, the index will also contain tagnames as entries.
<root><city>London</city><city>Paris</city></root>
will also have city and root in the index, because the database can't understand the semantic of the xml structure and can't know that London and Paris are real content while city and root is only markup used to structure the content.
OpenOffice 3.1.1 (2.4.3 until October 2009) and LibreOffice 3.3.2 on Windows 2000, AOO 3.4.1 on Windows 7
There are several macro languages in OOo, but none of them is called Visual Basic or VB(A)! Please call it OOo Basic, Star Basic or simply Basic.
rudolfo
Volunteer
 
Posts: 1488
Joined: Wed Mar 19, 2008 11:34 am
Location: Germany

Re: Importing Text or PDF Files

Postby workinstiff » Fri Jul 13, 2012 4:36 pm

Great, thanks for the tips. Would you know how I format my columns to have a full text index?
Electric Boogaloo! Open office 3.4.0
workinstiff
 
Posts: 3
Joined: Wed Jul 11, 2012 3:48 pm

Re: Importing Text or PDF Files

Postby rudolfo » Sat Jul 14, 2012 12:36 am

I can only tell you how to do it in a MySQL database, because that's the only engine that I used with fulltext indexes. BTW, MySQL is also the backend database for this forum software and this is the reason why the search functionality in the upper right corner ignores words that have less than 4 letters ... because that's how the MySQL indexing works: It takes into account only words that are longer than 3 letters.

http://dev.mysql.com/doc/refman//5.5/en/create-index.html wrote:FULLTEXT indexes are supported only for MyISAM tables and can include only CHAR, VARCHAR, and TEXT columns. Indexing always happens over the entire column; column prefix indexing is not supported and any prefix length is ignored if specified


Code: Select all   Expand viewCollapse view
CREATE TABLE your_table (_id INTEGER, text_col1 VARCHAR(255), text_col2 TEXT);
CREATE FULLTEXT INDEX your_index_name ON your_table (text_col1, text_col2);


A VARCHAR column has a length limit (that changes with every MySQL version to higher values -- but it is still there) while the type TEXT is practically unlimited.

The following statement is used to search the index for the phrase "beer":
Code: Select all   Expand viewCollapse view
SELECT text_col1, text_col2 FROM your_table WHERE MATCH (text_col1,text_col2) AGAINST ('beer')

The sort order how you will see the matching records is based on a weighting algorithm by MySQL, so rows that have beer 3 or 4 times usually come first, while "roots beer" will tend to come later.

If you are more interested in other database backends with full text indexes Google will surely help you.
OpenOffice 3.1.1 (2.4.3 until October 2009) and LibreOffice 3.3.2 on Windows 2000, AOO 3.4.1 on Windows 7
There are several macro languages in OOo, but none of them is called Visual Basic or VB(A)! Please call it OOo Basic, Star Basic or simply Basic.
rudolfo
Volunteer
 
Posts: 1488
Joined: Wed Mar 19, 2008 11:34 am
Location: Germany


Return to External Data Sources

Who is online

Users browsing this forum: No registered users and 1 guest