What type of data format is this page using?

Discuss the spreadsheet application

What type of data format is this page using?

Postby SteveH_66 » Mon Apr 14, 2008 9:07 am

Welcome beginner. Please answer all of the questions below which may provide information necessary to answer your question.
-----------------------------------------------------------------------------------------------------------
Which version of OpenOffice.org are you using? 2.4
What Operating System (version) are you using? Windows Vista Home Premium
What is your question or comment?

Here is the URL to the page in question:
"http://finance.yahoo.com/q/bs?s=INTC&annual"

I was wondering if anyone could tell me what type of data format this page is using? HTML with a table in it? Java Script? CSS? I am trying to get 3 or 4 particular figures out of the page, and I would like to be able to do this directly, if possible. But I understand that there are limits imposed by the different formats which might make it possible. If it were HTML with a table, is it possible to extract one particular table cell content from the page using Link to External Data function in the Spreadsheet, or using some other tool or function in Spreadsheet? If not then I will just have a lot of formatting to do with the spreadsheet. Thanks, Steve
Libre Office (ver: 4.2.3.3 in Linux) (ver: 4.2.8.2 in Windows 7 Home Premium)
SteveH_66
 
Posts: 37
Joined: Sun Apr 13, 2008 4:42 pm

Re: What type of data format is this page using?

Postby Villeroy » Mon Apr 14, 2008 3:23 pm

The page is designed for your human eyes. It shows some figures wrapped in advertisements. It is intentionally not machine readable (or not easily). You may get the desired information in a machine readable form (via csv or direct database access?) if you pay for it. The whole matter has nothing to do with format. It's about content and someone who pays for it. Import the whole shit in a spreadsheet and try to get out the relevant information with suitable lookups. Possibly you can setup a proxy server to filter irrelevant contents before it gets into the office (see Tools>Options...Internet>Proxy). The best solution to extract relevant information out of html is certainly a script rather than a spreadsheet.
Please, edit this topic's initial post and add "[Solved]" to the subject line if your problem has been solved.
Ubuntu 18.04, no OpenOffice, LibreOffice 6.4
User avatar
Villeroy
Volunteer
 
Posts: 28644
Joined: Mon Oct 08, 2007 1:35 am
Location: Germany

Re: What type of data format is this page using?

Postby SteveH_66 » Mon Apr 14, 2008 4:23 pm

Well it is information that the site in question, yahoo, is making publicly available and you can gather as much information as you want to for free using their site. However, to check a list of say 200 stocks one page at a time just to get 3 figures out of it would be time consuming, how should we say? Extremely time consuming. Hours. So I see nothing wrong with trying to extract the information in a less time consuming method than having to bring up each balance sheet a click at a time, 2 if you consider having to click again on the link for annual data because the data comes up by default in the quarterly reports. That's a lot of clicking and cutting & pasting just to get 3 numbers, don't you think? And then if you paste it one link at a time, delete out all the rows of information that you didn't need, and then do it all repeatedly, that's another few hours of work just to get 3 numbers. Yahoo support sent me a reply to a question about that, stating that they fully understood the cost in time to an individual/organization trying to get that data the way they have it, and added it to their suggestion list of new features to add in, so I don't think that they're kicking up a fuss about cost. I'm just trying to find a way to get this information more quickly until they act on that suggestion and possibly add in a way to get the information more quickly. Please don't take that explanation as me trying to be confrontational or trying to start a flame war, just answering the comment in your reply about cost and the people who are paying for it by saying in this case the people paying for it don't seem to mind me trying to get the information in a more efficient and rapid manner. :D

Anyway, thanks for your assistance and the reply to my post. Steve
Libre Office (ver: 4.2.3.3 in Linux) (ver: 4.2.8.2 in Windows 7 Home Premium)
SteveH_66
 
Posts: 37
Joined: Sun Apr 13, 2008 4:42 pm

Re: What type of data format is this page using?

Postby acknak » Mon Apr 14, 2008 4:42 pm

I guess I'm still not sure what you're after here.

I can paste that data (just the data, not the whole page) into Writer: it comes through very nicely.

Same for Calc. The data seem to come through with the table structure intact. You can copy/paste out of that to get the data you want.

Is it those manual manipulations (click/copy/paste) you're trying to avoid?

If so, that's very commonly done: it's called "screen scraping" and you have to write a special-purpose program that retrieves the web page, parses it to find the specific data you want, then exports the data in whatever format you need. It's exactly the same process you're doing now manually.

There's nothing OOo can (or ever could) do to automate that process. Every web page is different, and as Villeroy already pointed out, they're designed for human eyes, not to make it easy for people to extract data from.

A screen scraping program is not terribly hard to write, but it is specific for a certain web page (or class of web pages), and is typically very fragile: if the web site changes the page structure slightly, the scraping program may fail.
AOO4/LO5 • Linux • Fedora 23
User avatar
acknak
Moderator
 
Posts: 22756
Joined: Mon Oct 08, 2007 1:25 am
Location: USA:NJ:E3

Re: What type of data format is this page using?

Postby SteveH_66 » Mon Apr 14, 2008 4:49 pm

acknak, thanks so much for the post and the additional explanation of what I am facing here trying to do it this way. You are right, the data does fill into calc nicely and can be worked on from there. I was just hoping that there might be some way to do the process more easily, on a spreadsheet with a list of 50 or 100 stocks I needed those 3 figures for. Writing a screen scraping program is far beyond my capabilities, I'll just have to see if someone has a program out there that might help with this in some way. Just the idea of all that row deleting and copying is kind of daunting to face when you have a list of stocks you need to get data on. Thanks, Steve
Libre Office (ver: 4.2.3.3 in Linux) (ver: 4.2.8.2 in Windows 7 Home Premium)
SteveH_66
 
Posts: 37
Joined: Sun Apr 13, 2008 4:42 pm


Return to Calc

Who is online

Users browsing this forum: No registered users and 16 guests